next-token-prediction

next-token-prediction

Next-token prediction in JavaScript — build fast language and diffusion models.

Stars: 135

Visit
 screenshot

Next-Token Prediction is a language model tool that allows users to create high-quality predictions for the next word, phrase, or pixel based on a body of text. It can be used as an alternative to well-known decoder-only models like GPT and Mistral. The tool provides options for simple usage with built-in data bootstrap or advanced customization by providing training data or creating it from .txt files. It aims to simplify methodologies, provide autocomplete, autocorrect, spell checking, search/lookup functionalities, and create pixel and audio transformers for various prediction formats.

README:

Next-Token Prediction

Create a language model based on a body of text and get high-quality predictions (next word, next phrase, next pixel, etc.). With enough training data and a good chat interface, this can be used instead of well-known decoder-only models like GPT, Mistral, etc.

Install

npm i next-token-prediction

Usage

Simple (from a built-in data bootstrap)

Put this /training/ directory in the root of your project.

Now you just need to create your app's index.js file and run it. Your model will start training on the .txt files located in /training/documents/. After training is complete it will run these 4 queries:

const { Language: LM } = require('next-token-prediction');

const MyLanguageModel = async () => {
  const agent = await LM({
    bootstrap: true
  });

  // Predict the next word

  agent.getTokenPrediction('what');

  // Predict the next 5 words

  agent.getTokenSequencePrediction('what is', 5);

  // Complete the phrase

  agent.complete('hopefully');

  // Get a top k sample of completion predictions

  agent.getCompletions('The sun');
};

MyLanguageModel();

Advanced (provide trainingData or create it from .txt files)

Put this /training/ directory in the root of your project.

Because training data was committed to this repo, you can optionally skip training, and just use the bootstrapped training data, like this:

const { dirname } = require('path');
const __root = dirname(require.main.filename);

const { Language: LM } = require('next-token-prediction');
const OpenSourceBooksDataset = require(`${__root}/training/datasets/OpenSourceBooks`);

const MyLanguageModel = async () => {
  const agent = await LM({
    dataset: OpenSourceBooksDataset
  });

  // Complete the phrase

  agent.complete('hopefully');
};

MyLanguageModel();

Or, train on your own provided text files:

const { dirname } = require('path');
const __root = dirname(require.main.filename);

const { Language: LM } = require('next-token-prediction');

const MyLanguageModel = () => {
  // The following .txt files should exist in a `/training/documents/`
  // directory in the root of your project

  const agent = await LM({
    files: [
      'marie-antoinette',
      'pride-and-prejudice',
      'to-kill-a-mockingbird',
      'basic-algebra',
      'a-history-of-war',
      'introduction-to-c-programming'
    ]
  });

  // Complete the phrase

  agent.complete('hopefully');
};

MyLanguageModel();

Run tests

npm test

Examples

Readline Completion

UI Autocomplete

Videos

https://github.com/bennyschmidt/next-token-prediction/assets/45407493/68c070bd-ee03-4b7e-8ba3-3885f77fd9f9

https://github.com/bennyschmidt/next-token-prediction/assets/45407493/cd4a1102-5a82-4a6f-abb8-e96805fa65fd

Browser example: Fast autocomplete

With more training data you can get more suggestions, eventually hitting a tipping point where it can complete anything.

https://github.com/bennyschmidt/next-token-prediction/assets/45407493/942bdabf-4bf5-4d7a-b0db-2331d8c3dd18

Inspiration

3Blue1Brown video on YouTube:

YouTube

Watch: YouTube

Goals

  1. Provide a high-quality text prediction library for:
  • autocomplete
  • autocorrect
  • spell checking
  • search/lookup
  1. Create pixel and audio transformers for other prediction formats

  2. Demystify LLMs & simplify methodologies

  3. Make a high-quality, free/open chat-focused LLM in JavaScript, and an equally sophisticated image-focused diffusion model. Working on this here.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for next-token-prediction

Similar Open Source Tools

For similar tasks

For similar jobs