![llm-baselines](/statics/github-mark.png)
llm-baselines
None
Stars: 58
![screenshot](/screenshots_githubs/epfml-llm-baselines.jpg)
LLM-baselines is a modular codebase to experiment with transformers, inspired from NanoGPT. It provides a quick and easy way to train and evaluate transformer models on a variety of datasets. The codebase is well-documented and easy to use, making it a great resource for researchers and practitioners alike.
README:
A modular codebase to experiment with transformers, inspired from NanoGPT.
Install dependencies:
pip install -r requirements.txt
Run a simple training on the Wikitext dataset:
python ./src/main.py
The above command trains a 213.34M parameters model (see the "Results on wikitext" section for more details). The training takes a bit less than 5h on one 40GB A100. It trains for 15k iterations with a batch size of 50x4 (4 gradient accumulation steps), hence a speed of 0.84 iteration per second, or 86k tokens per second. You should reach a perplexity of around 18.5.
If you don't have so much VRAM:
python ./src/main.py --n_layer 12 --sequence_length 256
If you have very limited resources, try the shakespeare dataset and character-based tokenizer:
python ./src/main.py --n_layer=2 --n_head=4 --n_embd=128 --sequence_length=256 --dataset=shakespeare-char --device=cpu --vocab_size=96
Here are the possible parameters you can use (copypasta from config/base.py
):
# General training params
parser.add_argument('--batch_size', default=50, type=int)
parser.add_argument('--acc_steps', default=4, type=int)
parser.add_argument('--seed', default=0, type=int)
parser.add_argument('--device', default='cuda:0', type=str)
parser.add_argument('--iterations', default=15000, type=int)
parser.add_argument('--lr', default=2e-3, type=float)
parser.add_argument('--warmup_percent', default=0.02, type=float)
parser.add_argument('--weight_decay', default=1e-3, type=float)
parser.add_argument('--beta1', default=0.9, type=float)
parser.add_argument('--beta2', default=0.95, type=float)
parser.add_argument('--scheduler', default='cos', choices=['linear', 'cos', 'none'])
parser.add_argument('--opt', default='adamw', choices=['adamw', 'sgd'])
parser.add_argument('--eval_freq', default=200, type=int) # in iterations
parser.add_argument('--results_base_folder', default="./exps", type=str)
# Dataset params
parser.add_argument('--dataset', default='wikitext', choices=['wikitext', "shakespeare-char", 'arxiv', "arxiv2000", "arxiv+wiki", 'openwebtext2'])
parser.add_argument('--vocab_size', default=50304, type=int)
parser.add_argument('--data_in_ram', action='store_true') # force the data to RAM, mostly useless except for openwebtext2
# Model params
parser.add_argument('--model', default='base', choices=['base', 'sparse-heads-q'])
parser.add_argument('--use_pretrained', default="none", type=str) # 'none', 'gpt-2' or a path to the pretraind model
parser.add_argument('--dropout', default=0.2, type=float)
parser.add_argument('--n_head', default=12, type=int)
parser.add_argument('--n_layer', default=24, type=int) # depths in att + ff blocks
parser.add_argument('--n_embd', default=768, type=int) # embedding size / hidden size ...
parser.add_argument('--sequence_length', default=512, type=int)
parser.add_argument('--dtype', default=torch.bfloat16, type=torch.dtype)
parser.add_argument('--bias', default=False, type=bool)
parser.add_argument('--no_compile', action='store_true') # if true then model is not compiled
# logging params (WandB)
parser.add_argument('--wandb', action='store_true') # whether to use wandb or not
parser.add_argument('--wandb_project', default="my-project", type=str)
parser.add_argument('--wandb_run_prefix', default="none", type=str) # is added before the autogenerated experiment name
parser.add_argument('--eval_seq_prefix', default="The history of Switzerland ", type=str) # prefix used to generate sequences
# Distributed args
parser.add_argument('--distributed_backend', default=None, type=str, required=False,
choices=distributed.registered_backends()) # distributed backend type
You need to give your wandb authorize key in order to send the data to your wandb account. If you start jobs on a server without access to prompt, then you can set the WANDB_API_KEY
variable within your script:
# this is a script that could be executed on a server
pip install -r requirements.txt # install req.
export WANDB_API_KEY="put your authorize key here, to find it: https://wandb.ai/authorize"
python ./src/main.py --wandb --wandb_project "my awesome project" --n_layer 7 --model base --seed 123
The structure of the project is the following:
src/
main.py # pick the right data, model, and training function
config/
__init__.py # contains CONFIG_FORMAT_TO_MODULE_MAP mapping the name given to the --config_format flag with a python conf file
base.py # config for the base model
sparse.py # config for some sparsehq model
data/
utils.py # contains the get_dataset function
wikitext.py # load/process wikitext
arxiv.py # load/process arxiv
shakespeare.py # load/process the Shakespeare dataset
models/
utils.py # contains the get_model function
base.py # contains the standard transformer base architecture
sparsehq.py # a fork of base.py with a different architecture
optim/
utils.py # contains eval and get_batch functions
base.py # training function for the base model
sparse.py # training function for the sparsehq model
distributed/
# code to enable simple distributed training
Given the above structure, to add your own model, you can just fork the ./src/models/base.py
file, do your modifications, then if necessary fork the ./src/optim/base.py
in case you need some custom training loop or evaluation. You also need to fork the ./src/config/base.py
file to add your own parameters, which imply adding your new config to the mapping CONFIG_FORMAT_TO_MODULE_MAP
in ./src/config/__init__.py
. To add a new dataset, create a new file in the data
folder, check wikitext.py
for the expected format.
Trying to get the best perplexity as fast as possible, I settled for using a model (213.34M parameters) with the following parameters:
- n_embd: 768
- n_head: 12
- dropout: 0.2 (0.1 also works fine)
- n_layer: 24
- sequence_length: 512
- batch_size: 50
- acc_steps: 4
- iterations: 15000
- lr: 0.002
- warmup_percent: 0.02
The training reaches convergence after a bit less than 5 hours (on one 40GB A100), with a perplexity around 18.5, which I believe to be good for a non-pretrained model. The figure below shows the evolution of perplexity for two models with and without dropout, showing the importance of dropout. It should be possiblee to reach similar perplexity without dropout with a smaller batch size but this would probably extend the duration of the training.
A lighter and faster (12 layers instead of 24) but still very good config on wikitext:
- n_embd: 768
- n_head: 12
- n_layer: 12
- batch_size: 55
- sequence_length: 512
- acc_steps: 3
- dropout: 0.2
- iterations: 15000
- lr: 0.002
- warmup_percent: 0.02
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm-baselines
Similar Open Source Tools
![llm-baselines Screenshot](/screenshots_githubs/epfml-llm-baselines.jpg)
llm-baselines
LLM-baselines is a modular codebase to experiment with transformers, inspired from NanoGPT. It provides a quick and easy way to train and evaluate transformer models on a variety of datasets. The codebase is well-documented and easy to use, making it a great resource for researchers and practitioners alike.
![pandas-ai Screenshot](/screenshots_githubs/sinaptik-ai-pandas-ai.jpg)
pandas-ai
PandaAI is a Python platform that enables users to interact with their data in natural language, catering to both non-technical and technical users. It simplifies data querying and analysis, offering conversational data analytics capabilities with minimal code. Users can ask questions, visualize charts, and compare dataframes effortlessly. The tool aims to streamline data exploration and decision-making processes by providing a user-friendly interface for data manipulation and analysis.
![flow-prompt Screenshot](/screenshots_githubs/LamoomAI-flow-prompt.jpg)
flow-prompt
Flow Prompt is a dynamic library for managing and optimizing prompts for large language models. It facilitates budget-aware operations, dynamic data integration, and efficient load distribution. Features include CI/CD testing, dynamic prompt development, multi-model support, real-time insights, and prompt testing and evolution.
![axar Screenshot](/screenshots_githubs/axar-ai-axar.jpg)
axar
AXAR AI is a lightweight framework designed for building production-ready agentic applications using TypeScript. It aims to simplify the process of creating robust, production-grade LLM-powered apps by focusing on familiar coding practices without unnecessary abstractions or steep learning curves. The framework provides structured, typed inputs and outputs, familiar and intuitive patterns like dependency injection and decorators, explicit control over agent behavior, real-time logging and monitoring tools, minimalistic design with little overhead, model agnostic compatibility with various AI models, and streamed outputs for fast and accurate results. AXAR AI is ideal for developers working on real-world AI applications who want a tool that gets out of the way and allows them to focus on shipping reliable software.
![aioimaplib Screenshot](/screenshots_githubs/bamthomas-aioimaplib.jpg)
aioimaplib
aioimaplib is a Python library inspired by imaplib and imaplib2, aiming to port imaplib with asyncio for asynchronous benefits. It provides functionalities to interact with IMAP servers using asyncio, including checking mailbox, waiting for new messages, handling IDLE command, threading, IMAP command concurrency, logging configuration, and authentication with OAuth2. The library is tested with various IMAP servers like dovecot, Gmail, Outlook, Yahoo, etc. Developers are encouraged to contribute by improving, bug fixing, testing with other IMAP servers, and providing feedback. The library supports most IMAP4rev1 commands from RFC3501 and plans to implement additional commands like 'STARTTLS', 'AUTHENTICATE', 'COMPRESS', 'SETACL', 'DELETEACL', 'GETACL', 'MYRIGHTS', 'LISTRIGHTS', 'GETQUOTA', 'GETQUOTAROOT', 'SETQUOTA', 'SORT', 'THREAD', 'ID', 'NAMESPACE', 'CATENATE', and tests with other servers.
![KaibanJS Screenshot](/screenshots_githubs/kaiban-ai-KaibanJS.jpg)
KaibanJS
KaibanJS is a JavaScript-native framework for building multi-agent AI systems. It enables users to create specialized AI agents with distinct roles and goals, manage tasks, and coordinate teams efficiently. The framework supports role-based agent design, tool integration, multiple LLMs support, robust state management, observability and monitoring features, and a real-time agentic Kanban board for visualizing AI workflows. KaibanJS aims to empower JavaScript developers with a user-friendly AI framework tailored for the JavaScript ecosystem, bridging the gap in the AI race for non-Python developers.
![create-million-parameter-llm-from-scratch Screenshot](/screenshots_githubs/FareedKhan-dev-create-million-parameter-llm-from-scratch.jpg)
create-million-parameter-llm-from-scratch
The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.
![azure-functions-openai-extension Screenshot](/screenshots_githubs/Azure-azure-functions-openai-extension.jpg)
azure-functions-openai-extension
Azure Functions OpenAI Extension is a project that adds support for OpenAI LLM (GPT-3.5-turbo, GPT-4) bindings in Azure Functions. It provides NuGet packages for various functionalities like text completions, chat completions, assistants, embeddings generators, and semantic search. The project requires .NET 6 SDK or greater, Azure Functions Core Tools v4.x, and specific settings in Azure Function or local settings for development. It offers features like text completions, chat completion, assistants with custom skills, embeddings generators for text relatedness, and semantic search using vector databases. The project also includes examples in C# and Python for different functionalities.
![langfair Screenshot](/screenshots_githubs/cvs-health-langfair.jpg)
langfair
LangFair is a Python library for bias and fairness assessments of large language models (LLMs). It offers a comprehensive framework for choosing bias and fairness metrics, demo notebooks, and a technical playbook. Users can tailor evaluations to their use cases with a Bring Your Own Prompts approach. The focus is on output-based metrics practical for governance audits and real-world testing.
![weave Screenshot](/screenshots_githubs/wandb-weave.jpg)
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
![marqo Screenshot](/screenshots_githubs/marqo-ai-marqo.jpg)
marqo
Marqo is more than a vector database, it's an end-to-end vector search engine for both text and images. Vector generation, storage and retrieval are handled out of the box through a single API. No need to bring your own embeddings.
![LLM-Microscope Screenshot](/screenshots_githubs/AIRI-Institute-LLM-Microscope.jpg)
LLM-Microscope
This repository contains the official implementation of the code for the paper 'Your Transformer is Secretly Linear'. It provides functions for calculating anisotropy score, intrinsic dimension, linearity score, and centered linearity score based on pseudo-random features. Additionally, a pip package is available for easy installation. Users can also download the dataset used in the paper for further analysis.
![llm-sandbox Screenshot](/screenshots_githubs/vndee-llm-sandbox.jpg)
llm-sandbox
LLM Sandbox is a lightweight and portable sandbox environment designed to securely execute large language model (LLM) generated code in a safe and isolated manner using Docker containers. It provides an easy-to-use interface for setting up, managing, and executing code in a controlled Docker environment, simplifying the process of running code generated by LLMs. The tool supports multiple programming languages, offers flexibility with predefined Docker images or custom Dockerfiles, and allows scalability with support for Kubernetes and remote Docker hosts.
![instructor-js Screenshot](/screenshots_githubs/instructor-ai-instructor-js.jpg)
instructor-js
Instructor is a Typescript library for structured extraction in Typescript, powered by llms, designed for simplicity, transparency, and control. It stands out for its simplicity, transparency, and user-centric design. Whether you're a seasoned developer or just starting out, you'll find Instructor's approach intuitive and steerable.
![cappr Screenshot](/screenshots_githubs/kddubey-cappr.jpg)
cappr
CAPPr is a tool for text classification that does not require training or post-processing. It allows users to have their language models pick from a list of choices or compute the probability of a completion given a prompt. The tool aims to help users get more out of open source language models by simplifying the text classification process. CAPPr can be used with GGUF models, Hugging Face models, models from the OpenAI API, and for tasks like caching instructions, extracting final answers from step-by-step completions, and running predictions in batches with different sets of completions.
For similar tasks
![llm-baselines Screenshot](/screenshots_githubs/epfml-llm-baselines.jpg)
llm-baselines
LLM-baselines is a modular codebase to experiment with transformers, inspired from NanoGPT. It provides a quick and easy way to train and evaluate transformer models on a variety of datasets. The codebase is well-documented and easy to use, making it a great resource for researchers and practitioners alike.
For similar jobs
![LLM-FineTuning-Large-Language-Models Screenshot](/screenshots_githubs/rohan-paul-LLM-FineTuning-Large-Language-Models.jpg)
LLM-FineTuning-Large-Language-Models
This repository contains projects and notes on common practical techniques for fine-tuning Large Language Models (LLMs). It includes fine-tuning LLM notebooks, Colab links, LLM techniques and utils, and other smaller language models. The repository also provides links to YouTube videos explaining the concepts and techniques discussed in the notebooks.
![lloco Screenshot](/screenshots_githubs/jeffreysijuntan-lloco.jpg)
lloco
LLoCO is a technique that learns documents offline through context compression and in-domain parameter-efficient finetuning using LoRA, which enables LLMs to handle long context efficiently.
![camel Screenshot](/screenshots_githubs/camel-ai-camel.jpg)
camel
CAMEL is an open-source library designed for the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we implement and support various types of agents, tasks, prompts, models, and simulated environments.
![llm-baselines Screenshot](/screenshots_githubs/epfml-llm-baselines.jpg)
llm-baselines
LLM-baselines is a modular codebase to experiment with transformers, inspired from NanoGPT. It provides a quick and easy way to train and evaluate transformer models on a variety of datasets. The codebase is well-documented and easy to use, making it a great resource for researchers and practitioners alike.
![python-tutorial-notebooks Screenshot](/screenshots_githubs/dcavar-python-tutorial-notebooks.jpg)
python-tutorial-notebooks
This repository contains Jupyter-based tutorials for NLP, ML, AI in Python for classes in Computational Linguistics, Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) at Indiana University.
![EvalAI Screenshot](/screenshots_githubs/Cloud-CV-EvalAI.jpg)
EvalAI
EvalAI is an open-source platform for evaluating and comparing machine learning (ML) and artificial intelligence (AI) algorithms at scale. It provides a central leaderboard and submission interface, making it easier for researchers to reproduce results mentioned in papers and perform reliable & accurate quantitative analysis. EvalAI also offers features such as custom evaluation protocols and phases, remote evaluation, evaluation inside environments, CLI support, portability, and faster evaluation.
![Weekly-Top-LLM-Papers Screenshot](/screenshots_githubs/youssefHosni-Weekly-Top-LLM-Papers.jpg)
Weekly-Top-LLM-Papers
This repository provides a curated list of weekly published Large Language Model (LLM) papers. It includes top important LLM papers for each week, organized by month and year. The papers are categorized into different time periods, making it easy to find the most recent and relevant research in the field of LLM.
![self-llm Screenshot](/screenshots_githubs/datawhalechina-self-llm.jpg)
self-llm
This project is a Chinese tutorial for domestic beginners based on the AutoDL platform, providing full-process guidance for various open-source large models, including environment configuration, local deployment, and efficient fine-tuning. It simplifies the deployment, use, and application process of open-source large models, enabling more ordinary students and researchers to better use open-source large models and helping open and free large models integrate into the lives of ordinary learners faster.