LLM-Scratch

LLM-Scratch

None

Stars: 51

Visit
 screenshot

LLM-Scratch is a minimal implementation of a GPT-style Large Language Model built from scratch using PyTorch. It utilizes BPE tokenization, multi head self-attention, feed-forward layers, and layer normalization. The model is designed for learning and experimentation purposes, focusing on autoregressive text generation. The codebase is clean, modular, and extensible, with a character-level tokenizer for easy understanding and no external dependencies like BPE or SentencePiece. The model architecture includes token embedding, positional embedding, transformer blocks with masked self-attention, feed-forward network, residual connections, layer normalization, and a language modeling head. Training objective involves next-token prediction using Cross-Entropy Loss and AdamW optimizer, with training data sampled in fixed-length blocks and gradients backpropagated through time. Configuration parameters are centralized for easy experimentation and reproducibility.

README:

LLM From Scratch (PyTorch)

This project is a minimal implementation of a GPT-style Large Language Model built from scratch using PyTorch.
It is designed for learning and experimentation, not production use.


What This Model Does

  • Uses BPE tokenization
  • Implements Multi head self-attention, feed-forward layers, and layer normalization
  • Trains a GPT-like transformer on raw text
  • Generates text autoregressively

Each module is intentionally separated to keep the codebase clean, modular, and extensible.


Tokenization

The model uses a character-level tokenizer, which:

  • Converts each unique character into an integer ID
  • Creates a small, fixed vocabulary
  • Makes the learning process easier to understand
  • Avoids external dependencies like BPE or SentencePiece

While inefficient for large-scale models, character tokenization is ideal for learning and experimentation.


Model Architecture

The model follows a GPT-style Transformer architecture:

  1. Token Embedding
    Converts token IDs into dense vectors.

  2. Positional Embedding
    Adds information about token positions in the sequence.

  3. Transformer Blocks (stacked)

    • Masked self-attention
    • Feed-forward network
    • Residual connections
    • Layer normalization
  4. Language Modeling Head
    Projects hidden states to vocabulary logits.

The model is trained using causal (autoregressive) language modeling, where it predicts the next token given previous tokens.


Training Objective

The training task is next-token prediction:

Given a sequence of tokens
Predict the next token at every position

  • Loss function: Cross-Entropy Loss
  • Optimizer: AdamW
  • Training data is sampled in fixed-length blocks
  • Gradients are backpropagated through time

Configuration

All important parameters are defined in one place:

Examples:

  • Batch size
  • Context length (block size)
  • Embedding dimension
  • Number of attention heads
  • Number of transformer layers
  • Learning rate
  • Training iterations

This makes experimentation simple and reproducible.


For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLM-Scratch

Similar Open Source Tools

For similar tasks

For similar jobs