1.5-Pints

A compact LLM pretrained in 9 days by using high quality data

Stars: 230

Visit

1.5-Pints is a repository that provides a recipe to pre-train models in 9 days, aiming to create AI assistants comparable to Apple OpenELM and Microsoft Phi. It includes model architecture, training scripts, and utilities for 1.5-Pints and 0.12-Pint developed by Pints.AI. The initiative encourages replication, experimentation, and open-source development of Pint by sharing the model's codebase and architecture. The repository offers installation instructions, dataset preparation scripts, model training guidelines, and tools for model evaluation and usage. Users can also find information on finetuning models, converting lit models to HuggingFace models, and running Direct Preference Optimization (DPO) post-finetuning. Additionally, the repository includes tests to ensure code modifications do not disrupt the existing functionality.

README:

1.5-Pints

A recipe to pre-train models in 9 days, to become comparable AI assistants to the likes of Apple OpenELM and Microsoft Phi.

This repo contains the model architecture, training scripts, and utilities of 1.5-Pints and 0.12-Pint, developed by Pints.AI. By providing access to the model's codebase and architecture, this initiative seeks to facilitate the replication, experimentation, and further open-source development of Pint.

Join us at Discord: https://discord.com/invite/RSHk22Z29j

Paper & Citation

@misc{tan202415pintstechnicalreportpretraining,
      title={1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data},
      author={Calvin Tan and Jerome Wang},
      year={2024},
      eprint={2408.03506},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.03506},
}

Installation

Recommended OS

Typically just stick to Ubuntu 22.04 LTS x86-64. Debian 12 has been tested to work as well.

GOTCHA1: Dont use arm64 / aarch64. xformers does not support ARM64 processors.

GOTCHA2: We should not install system-wide CUDA using apt. It is best to constrain the CUDA installation to within the conda environment, so that different projects can use different CUDA versions.

Install conda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
sh Miniconda3-latest-Linux-x86_64.sh

Source just to be sure conda cli will be available:

source ~/.bashrc

Sometimes if you still face conda: command cannot be found, you can find the installation and source it:

Note: This path assumes you took up the default installation settings. Otherwise, find where you installed it.

source ~/miniconda3/etc/profile.d/conda.sh

Clone this repo

git clone https://github.com/Pints-AI/1.5-Pints.git && \
cd 1.5-Pints

Create conda env

conda create --prefix ./.conda python=3.10 && \
conda activate ./.conda

Note: Stick to Python 3.10. 3.12 breaks a lot of things as of now (23 Feb 2024), and 3.11 has not been tested.

Install CUDA toolkit

conda install nvidia/label/cuda-12.1.1::cuda-toolkit

Install requirements

pip install -r requirements.txt && \
pip install flash-attn --no-build-isolation && \
pip install -r pretrain/requirements.txt

Note: The pip install for dropout_layer_norm can take up ~30 minutes to build depending on the machine.

Install git-lfs if you haven't

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

Download the pints-ai Expository Prose pretrain dataset

cd /path/to/dataset_dir
git clone https://huggingface.co/datasets/pints-ai/Expository-Prose-V1

Prepare the dataset

python -m prepare_dataset.standard_parquet \
--source_path /path/to/dataset_dir  \
--train_val_split_ratio 0.9 \
--max_cores 60 \
--destination_path /path/to/output_dir

Refer to prepare_dataset folder for the dataset preparation scripts.

max_cores is not required if you don't OOM on high core machines.

Train the model

Lightning

fabric run model \
--accelerator=cuda \
--devices=8 \
pretrain/main.py \
--data_dir data/output \
--out_dir ../1.5-pints \
--gpus 8 \
--global_batch_size 512 \
--learning_rate 4e-4 \
--micro_batch_size 8 \
--max_step 96180 \
--warmup_steps 2000 \
--weight_decay 0.1 \
--beta1 0.9 \
--beta2 0.95 \
--grad_clip 1.0  \
--min_lr 4e-5 \
--model_name 1.5-Pints-2k \
--wandb_name <run_name> \
--wandb_project <project_name> \
--tokenizer_dir tokenizer/pints

Note1: --devices and --gpus must be the same. See pretrain.py's setup arguments for all parameters that you can adjust.

Note2: Select the architecture (layers/dimensions/heads) configuration using --model_name. This must be in lit_gpt/config.py.

Note3: Select a micro_batch_size to optimize GPU memory. So far once started, it remains stable, even during validation. micro_batch_size need not be a number that batch_size is divisible by. batch_size is derived from global_batch_size / devices.

Note4: Modify TRAIN_DATA_CONFIG in pretrain/main.py to decide on the datasets used for training. Ensure that the dataset is prepared beforehand.

Wandb

If you are asked for the wandb API key, you can login and get from: https://wandb.ai/authorize

Finetune the model

cd finetune && \
pip install -r requirements.txt

Convert pretrained weights

Before you start finetuning, you need to convert the pretrain weights:

python convert/convert_pretrained_checkpoint.py --checkpoint_dir path/to/checkpoint --output_dir path/to/output

lightning run \
--accelerator=cuda \
--devices=8 \
finetune/full.py \
--checkpoint_dir <path to lit_model.pth> \
--out_dir ~/1.5-pints-2k/ep2/step-00045000/finetuned \
--model_name 1.5-Pints-2k \
--gpus 8 \
--train.save_interval 6000 \
--train.global_batch_size 512 \
--train.micro_batch_size 8 \
--train.lr_warmup_steps 1125 \
--train.epoch 5 \
--train.learning_rate 2e-5 \
--train.max_seq_length 2048 \
--train.beta1 0.9 \
--train.beta2 0.95 \
--train.weight_decay 0.1 \
--logger_name wandb \
--tokenizer_dir tokenizer/pints \
--known_data_max_seq_length 2048 \
--wandb_project <project name>

Run Direct Preference Optimization (DPO)

DPO is opted for use post-finetuning. See here for the execution process.

Evaluate the model

See here

Use the model

Convert lit models to HuggingFace models (pytorch and safetensors)

python convert_lit_to_hf.py \
--checkpoint_name lit_model.pth \
--directory ../models/1.5-pints \
--model_name 1.5-Pints-2k \
--output_config=True \
--safetensors=True \
--delete_pytorch_model=True

Note: We found better success using the safetensors file. Therefore it's recommended to use it instead of pytorch_model.bin.

Use it with huggingface

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/model/path")
model = AutoModelForCausalLM.from_pretrained("/model/path")

prompt = '''<|im_start|>user
Do not go gentle into that good night.<|im_end|>
<|im_start|>assistant
'''

tokenized_input = tokenizer.encode(prompt)
tokenized_output = model.generate(tokenized_input)
print(tokenizer.decode(tokenized_output))

Code Testing

This codebase comes with tests. If you need to make modifications, you can run the tests to ensure your modifications did not disrupt the existing code.

Install test requirements:

pip install -r requirements.test.txt

Run pytest:

python -m pytest --verbose

For Tasks:

Click tags to check more tools for each tasks

train the model finetune the model evaluate the model convert lit models run direct preference optimization (dpo)

For Jobs:

ai researcher machine learning engineer data scientist natural language processing specialist research scientist

Alternative AI tools for 1.5-Pints

Similar Open Source Tools

1.5-Pints

github

: 230

ChatSim

ChatSim is a tool designed for editable scene simulation for autonomous driving via LLM-Agent collaboration. It provides functionalities for setting up the environment, installing necessary dependencies like McNeRF and Inpainting tools, and preparing data for simulation. Users can train models, simulate scenes, and track trajectories for smoother and more realistic results. The tool integrates with Blender software and offers options for training McNeRF models and McLight's skydome estimation network. It also includes a trajectory tracking module for improved trajectory tracking. ChatSim aims to facilitate the simulation of autonomous driving scenarios with collaborative LLM-Agents.

github

: 284

llm-rankers

llm-rankers is a repository that provides implementations for Pointwise, Listwise, Pairwise, and Setwise Document Ranking using Large Language Models. It includes various methods for reranking documents retrieved by a first-stage retriever, such as BM25. The repository offers examples and code snippets for using LLMs to improve document ranking performance in information retrieval tasks. Additionally, it introduces a new setwise reranker called Rank-R1 with reasoning ability.

github

: 111

mods

AI for the command line, built for pipelines. LLM based AI is really good at interpreting the output of commands and returning the results in CLI friendly text formats like Markdown. Mods is a simple tool that makes it super easy to use AI on the command line and in your pipelines. Mods works with OpenAI, Groq, Azure OpenAI, and LocalAI To get started, install Mods and check out some of the examples below. Since Mods has built-in Markdown formatting, you may also want to grab Glow to give the output some _pizzazz_.

github

: 3.4k

code2prompt

code2prompt is a command-line tool that converts your codebase into a single LLM prompt with a source tree, prompt templating, and token counting. It automates generating LLM prompts from codebases of any size, customizing prompt generation with Handlebars templates, respecting .gitignore, filtering and excluding files using glob patterns, displaying token count, including Git diff output, copying prompt to clipboard, saving prompt to an output file, excluding files and folders, adding line numbers to source code blocks, and more. It helps streamline the process of creating LLM prompts for code analysis, generation, and other tasks.

github

: 5.1k

hash

HASH is a self-building, open-source database which grows, structures and checks itself. With it, we're creating a platform for decision-making, which helps you integrate, understand and use data in a variety of different ways.

github

: 1.2k

moly

Moly is an AI LLM client written in Rust, showcasing the capabilities of the Makepad UI toolkit and Project Robius, a framework for multi-platform application development in Rust. It is currently in beta, allowing users to build and run Moly on macOS, Linux, and Windows. The tool provides packaging support for different platforms, such as `.app`, `.dmg`, `.deb`, AppImage, pacman, and `.exe` (NSIS). Users can easily set up WasmEdge using `moly-runner` and leverage `cargo` commands to build and run Moly. Additionally, Moly offers pre-built releases for download and supports packaging for distribution on Linux, Windows, and macOS.

github

: 197

ML-Bench

ML-Bench is a tool designed to evaluate large language models and agents for machine learning tasks on repository-level code. It provides functionalities for data preparation, environment setup, usage, API calling, open source model fine-tuning, and inference. Users can clone the repository, load datasets, run ML-LLM-Bench, prepare data, fine-tune models, and perform inference tasks. The tool aims to facilitate the evaluation of language models and agents in the context of machine learning tasks on code repositories.

github

: 344

ChatDBG

ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.

github

: 825

aio-theme

github

: 71

bilingual_book_maker

The bilingual_book_maker is an AI translation tool that uses ChatGPT to assist users in creating multi-language versions of epub/txt/srt files and books. It supports various models like gpt-4, gpt-3.5-turbo, claude-2, palm, llama-2, azure-openai, command-nightly, and gemini. Users need ChatGPT or OpenAI token, epub/txt books, internet access, and Python 3.8+. The tool provides options to specify OpenAI API key, model selection, target language, proxy server, context addition, translation style, and more. It generates bilingual books in epub format after translation. Users can test translations, set batch size, tweak prompts, and use different models like DeepL, Google Gemini, Tencent TranSmart, and more. The tool also supports retranslation, translating specific tags, and e-reader type specification. Docker usage is available for easy setup.

github

: 7.8k

log10

Log10 is a one-line Python integration to manage your LLM data. It helps you log both closed and open-source LLM calls, compare and identify the best models and prompts, store feedback for fine-tuning, collect performance metrics such as latency and usage, and perform analytics and monitor compliance for LLM powered applications. Log10 offers various integration methods, including a python LLM library wrapper, the Log10 LLM abstraction, and callbacks, to facilitate its use in both existing production environments and new projects. Pick the one that works best for you. Log10 also provides a copilot that can help you with suggestions on how to optimize your prompt, and a feedback feature that allows you to add feedback to your completions. Additionally, Log10 provides prompt provenance, session tracking and call stack functionality to help debug prompt chains. With Log10, you can use your data and feedback from users to fine-tune custom models with RLHF, and build and deploy more reliable, accurate and efficient self-hosted models. Log10 also supports collaboration, allowing you to create flexible groups to share and collaborate over all of the above features.

github

: 96

iree-amd-aie

This repository contains an early-phase IREE compiler and runtime plugin for interfacing the AMD AIE accelerator to IREE. It provides architectural overview, developer setup instructions, building guidelines, and runtime driver setup details. The repository focuses on enabling the integration of the AMD AIE accelerator with IREE, offering developers the tools and resources needed to build and run applications leveraging this technology.

github

: 87

MindSearch

MindSearch is an open-source AI Search Engine Framework that mimics human minds to provide deep AI search capabilities. It allows users to deploy their own search engine using either close-source or open-source language models. MindSearch offers features such as answering any question using web knowledge, in-depth knowledge discovery, detailed solution paths, optimized UI experience, and dynamic graph construction process.

github

: 4.8k

tiledesk-dashboard

Tiledesk is an open-source live chat platform with integrated chatbots written in Node.js and Express. It is designed to be a multi-channel platform for web, Android, and iOS, and it can be used to increase sales or provide post-sales customer service. Tiledesk's chatbot technology allows for automation of conversations, and it also provides APIs and webhooks for connecting external applications. Additionally, it offers a marketplace for apps and features such as CRM, ticketing, and data export.

github

: 258

moxin

Moxin is an AI LLM client written in Rust to demonstrate the functionality of the Robius framework for multi-platform application development. It is currently in early stages of development and not fully functional. The tool supports building and running on macOS and Linux systems, with packaging options available for distribution. Users can install the required WasmEdge WASM runtime and dependencies to build and run Moxin. Packaging for distribution includes generating `.deb` Debian packages, AppImage, and pacman installation packages for Linux, as well as `.app` bundles and `.dmg` disk images for macOS. The macOS app is not signed, leading to a warning on installation, which can be resolved by removing the quarantine attribute from the installed app.

github

: 136

For similar tasks

1.5-Pints

github

: 230

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675