EAGLE
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Stars: 395
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
README:
📢[arXiv] [HuggingFace] [Demo] [Model Zoo] [Data]
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs. It presents a thorough exploration to strengthen multimodal LLM perception with a mixture of vision encoders and different input resolutions. The model contains a channel-concatenation-based "CLIP+X" fusion for vision experts with different architectures (ViT/ConvNets) and knowledge (detection/segmentation/OCR/SSL). The resulting family of Eagle models support up to over 1K input resolution and obtain strong results on multimodal LLM benchmarks, especially resolution-sensitive tasks such as optical character recognition and document understanding.
- [TODO] Vision encoder model weights with pre-alignment.
- [2024/09] Our Eagle-8B-Plus and Eagle-34B-Plus are released on huggingface.
- [2024/09] Evaluation code Here.
- [2024/08] Release the inference and training code of Eagle.
- [2024/08] Release the SFT data of Eagle Eagle-SFT-1.8M.
- [2024/08] The online demo of Eagle-X5-13B-Chat is available at Demo.
- [2024/06] 🏆 Winning the 2nd Place in CVPR24 Challenge on Driving with Language Website
- [2024/05] Serving as the 2D VLM pre-training for OmniDrive.
- Models & Performance
- Visual Examples
- Install
- Training Data
- Checkpoint Preparation
- Training
- Evaluation
- Gradio Demo
Here is the model trained on our organized 1.8M supervised fine-tuning data.
Model | LLM | Pretrain | SFT | GQA | MME | MMMU(Val) | OCR | SQA(I) | POPE | TextVQA | InfoVQA | VizWiz | SEED(I) | VQAv2 | MathVista | MMBench | ChartQA | DocVQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Eagle-X4-7B | Vicuna-7B | LLaVA-v1.5 | 1.8M | 64.8 | 1561 | 34.9 | 540 | 70.5 | 88.4 | 70.9 | 47.4 | 50.8 | 73.4 | 83.4 | 37.3 | 67.8 | 67.5 | 78.8 |
Eagle-X5-7B | Vicuna-7B | LLaVA-v1.5 | 1.8M | 64.9 | 1528 | 36.3 | 529 | 69.8 | 88.8 | 71.2 | 47.4 | 54.4 | 73.9 | 83.4 | 37.0 | 68.4 | 67.8 | 78.6 |
Eagle-X4-13B | Vicuna-13B | LLaVA-v1.5 | 1.8M | 66.3 | 1627 | 36.9 | 561 | 73.1 | 87.7 | 73.9 | 50.7 | 56.2 | 74.4 | 83.8 | 37.6 | 69.9 | 70.5 | 79.9 |
Eagle-X5-13B | Vicuna-13B | LLaVA-v1.5 | 1.8M | 66.2 | 1609 | 36.6 | 574 | 72.8 | 87.8 | 74.2 | 51.8 | 59.3 | 74.1 | 83.8 | 38.8 | 69.2 | 69.9 | 79.4 |
Knowledge | General | Document | Vision | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLM | Model | Avg | SQA(I) | MMMU(Val) | MathVista | AI2D | Avg | MME | MMB | SEED | GQA | Avg | ChartQA | OCR | TextVQA | DocVQA | Avg | MMVP | RWQA |
Llama 3-8B | Mini-Gemini-HD | 55.7 | 75.1 | 37.3 | 37 | 73.5 | 72.7 | 1606 | 72.7 | 73.2 | 64.5 | 62.9 | 59.1 | 47.7 | 70.2 | 74.6 | 40.4 | 18.7 | 62.1 |
LLaVA-NeXT | 55.6 | 72.8 | 41.7 | 36.3 | 71.6 | 72.5 | 1604 | 72.1 | 72.7 | 65.2 | 63.9 | 69.5 | 49.0 | 64.6 | 72.6 | 49.4 | 38.7 | 60.1 | |
Cambrian-1 | 61.3 | 80.4 | 42.7 | 49.0 | 73.0 | 73.1 | 1547 | 75.9 | 74.7 | 64.6 | 71.3 | 73.3 | 62.4 | 71.7 | 77.8 | 57.6 | 51.3 | 64.2 | |
Ealge-X4-8B-Plus | 64.2 | 84.3 | 43.4 | 52.7 | 76.1 | 73.8 | 1559 | 75.9 | 76.3 | 64.9 | 76.6 | 80.1 | 62.6 | 77.1 | 86.6 | 69.1 | 71.6 | 66.5 | |
Vicuna-13B | Mini-Gemini-HD | 54.1 | 71.9 | 37.3 | 37.0 | 70.1 | 70.7 | 1597 | 68.6 | 70.6 | 63.7 | 60.8 | 56.6 | 46.6 | 70.2 | 69.8 | 38.4 | 19.3 | 57.5 |
LLaVA-NeXT | 53.7 | 73.5 | 36.2 | 35.1 | 70.0 | 69.9 | 1575 | 70.0 | 65.6 | 65.4 | 62.9 | 62.2 | 51.4 | 67.1 | 70.9 | 47.6 | 36.0 | 59.1 | |
Cambrian-1 | 60.2 | 79.3 | 40.0 | 48.0 | 73.6 | 73.7 | 1610 | 75.7 | 74.4 | 64.3 | 71.3 | 73.8 | 61.9 | 72.8 | 76.8 | 52.2 | 41.3 | 63.0 | |
Ealge-X4-13B-Plus | 63.0 | 82.0 | 41.0 | 54.4 | 74.0 | 74.6 | 1651 | 75.7 | 74.8 | 65.3 | 75.1 | 77.6 | 61.9 | 75.5 | 85.4 | 61.4 | 58.0 | 64.8 | |
Yi-34B | Mini-Gemini-HD | 62.4 | 77.7 | 48.0 | 43.4 | 80.5 | 76.2 | 1659 | 80.6 | 75.3 | 65.8 | 68.1 | 67.6 | 51.8 | 74.1 | 78.9 | 52.3 | 37.3 | 67.2 |
LLaVA-NeXT | 62.5 | 81.8 | 46.7 | 46.5 | 74.9 | 76.0 | 1633 | 79.3 | 75.9 | 67.1 | 67.7 | 68.7 | 54.5 | 69.5 | 78.1 | 54.2 | 47.3 | 61.0 | |
Cambrian-1 | 67.0 | 85.6 | 49.7 | 53.2 | 79.7 | 76.8 | 1689 | 81.4 | 75.3 | 65.8 | 71.9 | 75.6 | 60.0 | 76.7 | 75.5 | 60.3 | 52.7 | 67.8 | |
Ealge-X5-34B-Plus | 68.6 | 85.5 | 51.8 | 57.9 | 79.1 | 76.3 | 1677 | 81.0 | 75.6 | 64.9 | 75.4 | 77.2 | 62.4 | 78.8 | 83.0 | 68.3 | 67.0 | 69.5 |
Please following the guide here to prepare the environment on Linux OS.
- Clone this repository
git clone https://github.com/NVlabs/EAGLE.git
cd Eagle
- Create environment and install package
conda create -n eagle python=3.10 -y
conda activate eagle
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
pip install .
- Install additional packages for training cases
pip install flash-attn --no-build-isolation
If you have any questions about the environment setup, please follow the instruction video.
We use the same pretraining data as LLaVA v1.5, please download the data from here.
We have compiled all the data and images used in our supervised fine-tuning together. Please download the data from here. After cloning this dataset, please run the following commands to extract all the images:
cd Eagle-1.8M
cat images.tar.part_* > images.tar.gz
tar -xvf images.tar.gz
Please note that while the images have been packaged for convenience, the original dataset licenses remain unchanged. By downloading our data, you agree to the licensing terms of each source dataset. A detailed list of the data sources used in our fine-tuning data mixture is provided below:
Version | Dataset Name | Sample Number | Note |
---|---|---|---|
LLaVA v1.5 | 665k | Multi-modal conversation | |
DocVQA | 39k | Document understanding | |
synDog-EN | 50k | OCR | |
ChartQA | 28k | Chart understanding | |
DVQA | 25k | Chart understanding | |
AI2D | 15k | Open-Hermes 2.5 | |
ShareGPT-4V | 100k | Detailed caption generated by GPT-4V | |
laion-GPT4V | 11k | Detailed caption generated by GPT-4V | |
LVIS-Instruct4V | 220k | Multi-modal conversation | |
LRV-Instruct | 150k | Multi-modal conversation | |
Geo170k | 120k | Math | |
LLaVAR | 20k | OCR | |
Visual7W | 70k | Visual Question Answering | |
Open-Hermes 2.5 | 300k | Text | |
Initial Version | Total | 1.8M |
Please provide the pretrained model weights for EVA-02 vision tower pretrained on detection task. You can download the checkpoint here and place it in the checkpoints/pretrained_models/
directory.
The weights of other models, including Vicuna, Segment Anything Model, Pix2Struct, ConvNeXt, and CLIP will be automatically downloaded from huggingface during the first run.
The training process for Eagle follows a standard two-stage approach: pretraining and supervised fine-tuning. In the first stage, only the projector's weights are updated. In the second stage, all parameters are fine-tuned. The batch sizes for the pretraining and fine-tuning stages are 256 and 128, respectively. All settings and hyperparameters are identical to those in LLaVA-v1.5 except that we will unfrozen the vision tower's parameters during the second stage.
In default we use 32 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size
and gradient_accumulation_steps
if you are using different amount of GPUs.
If you are using a slurm cluster, please use the following command to submit a job.
srun \
--partition $your_partition \
--gres "gpu:8" \
--ntasks_per_node 1 \
-N 4 \
--job-name $RUN_NAME \
"bash $CMD $RUN_NAME"
You can specify the RUN_NAME
and CMD
variables to run different models according to the following table:
Model | Language Model | Script |
---|---|---|
Eagle-X4 | Vicuna-7B | scripts/pretrain-eagle-x4-vicuna-7b.sh |
Eagle-X4 | Vicuna-13B | scripts/pretrain-eagle-x4-vicuna-13b.sh |
Eagle-X5 | Vicuna-7B | scripts/pretrain-eagle-x5-vicuna-7b.sh |
Eagle-X5 | Vicuna-13B | scripts/pretrain-eagle-x5-vicuna-13b.sh |
Remember to set the $PATH_TO_PRETRAINING_DATA
in each script to the downloaded pretraining data. After you have complete the pretraining, you will get a file named mm_projector.bin
in the checkpoint folder.
After pretraining is complete, a projector weight file `` will be saved in the checkpoint directory. Please set the $PATH_TO_PRETRAINED_PROJECTOR
to the path of this projector weights.
You can use the same sumbit code as the pretraining, and use the script in the following table to launch the supervised fine-tuning.
Model | Language Model | Script |
---|---|---|
Eagle-X4 | Vicuna-7B | scripts/finetune-eagle-x4-vicuna-7b-1.8m.sh |
Eagle-X4 | Vicuna-13B | scripts/finetune-eagle-x4-vicuna-13b-1.8m.sh |
Eagle-X5 | Vicuna-7B | scripts/finetune-eagle-x5-vicuna-7b-1.8m.sh |
Eagle-X5 | Vicuna-13B | scripts/finetune-eagle-x5-vicuna-13b-1.8m.sh |
Before submit the job, you should correctly set the $PATH_TO_SFT_DATA
and $PATH_TO_PRETRAINED_PROJECTOR
in each script.
If you have limited GPU resources or memory, please considering the following:
- use
scripts/zero3.json
orscripts/zero3_offload.json
as the Deepspeed training config instead of the defaultzero2.json
- use gradient accumulation and reduce the per-device batch size
We evaluate MME, MMBench, SEED, MathVista, POPE, ScienceQA, GQA, OCRBench, TextVQA, and ChartQA using LMMs-Eval. For better reproducibility, we have included the specific version we used in this repository. Please follow their guidelines and use the following commands to perform the evaluation:
bash scripts/eval_lmms_eval/eval-mme-seed-mmmu-pope-sqa-gqa-ocrbench-textvqa-chartqa.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE
# MODEL_NAME can be any name, just to dinstinguish different runs.
# CONV_MODE should be the name of the conversation template during triaining, i.e., "vicuna_v1" for Vicuna, "llama3" for Llama3, and "yi_34b_chatml_direct" for Yi-34B.
We set up an online demo here. You can also run this demo on your own machine by running:
python gradio_demo.py \
--model-path ${MODEL_CKPT}
--conv-mode vicuna_v1
Our inference code is here. You can set you own 'image_path' here and 'question' here.
import os
import torch
import numpy as np
from eagle import conversation as conversation_lib
from eagle.constants import DEFAULT_IMAGE_TOKEN
from eagle.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from eagle.conversation import conv_templates, SeparatorStyle
from eagle.model.builder import load_pretrained_model
from eagle.utils import disable_torch_init
from eagle.mm_utils import tokenizer_image_token, get_model_name_from_path, process_images, KeywordsStoppingCriteria
from PIL import Image
import argparse
from transformers import TextIteratorStreamer
from threading import Thread
model_path = "NVEagle/Eagle-X5-13B-Chat"
conv_mode = "vicuna_v1"
image_path = "assets/georgia-tech.jpeg"
input_prompt = "Describe this image."
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path,None,model_name,False,False)
if model.config.mm_use_im_start_end:
input_prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + input_prompt
else:
input_prompt = DEFAULT_IMAGE_TOKEN + '\n' + input_prompt
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], input_prompt)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
image = Image.open(image_path).convert('RGB')
image_tensor = process_images([image], image_processor, model.config)[0]
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device='cuda', non_blocking=True)
image_tensor = image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True)
with torch.inference_mode():
output_ids = model.generate(
input_ids.unsqueeze(0),
images=image_tensor.unsqueeze(0),
image_sizes=[image.size],
do_sample=True,
temperature=0.2,
top_p=0.5,
num_beams=1,
max_new_tokens=256,
use_cache=True)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"Image:{image_path} \nPrompt:{input_prompt} \nOutput:{outputs}")
If you find this project useful, please cite our work:
@article{shi2024eagle,
title = {Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders},
author={Min Shi and Fuxiao Liu and Shihao Wang and Shijia Liao and Subhashree Radhakrishnan and De-An Huang and Hongxu Yin and Karan Sapra and Yaser Yacoob and Humphrey Shi and Bryan Catanzaro and Andrew Tao and Jan Kautz and Zhiding Yu and Guilin Liu},
journal={arXiv:2408.15998},
year={2024}
}
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for Llama-2, Llama-3, and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
- LLaVA: the codebase we built upon. Thanks for the great pioneer open-source project!
- LMMs-Eval: many thanks to the LMMs-Lab for their wonderful and easy-to-use evaluation tools!
- LLaVA-HR: we borrow some code on flexible input CLIP encoder from LLaVA-HR!
- Cambrian-1: thanks Cambrian project contributors for their efforts in organizing open-source data for us!
- Thanks to the VILA team and the RADIO team for their helps and discussions. Check out these awesome works from NVIDIA!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for EAGLE
Similar Open Source Tools
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
CogVLM2
CogVLM2 is a new generation of open source models that offer significant improvements in benchmarks such as TextVQA and DocVQA. It supports 8K content length, image resolution up to 1344 * 1344, and both Chinese and English languages. The project provides basic calling methods, fine-tuning examples, and OpenAI API format calling examples to help developers quickly get started with the model.
goodai-ltm-benchmark
This repository contains code and data for replicating experiments on Long-Term Memory (LTM) abilities of conversational agents. It includes a benchmark for testing agents' memory performance over long conversations, evaluating tasks requiring dynamic memory upkeep and information integration. The repository supports various models, datasets, and configurations for benchmarking and reporting results.
oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.
TRACE
TRACE is a temporal grounding video model that utilizes causal event modeling to capture videos' inherent structure. It presents a task-interleaved video LLM model tailored for sequential encoding/decoding of timestamps, salient scores, and textual captions. The project includes various model checkpoints for different stages and fine-tuning on specific datasets. It provides evaluation codes for different tasks like VTG, MVBench, and VideoMME. The repository also offers annotation files and links to raw videos preparation projects. Users can train the model on different tasks and evaluate the performance based on metrics like CIDER, METEOR, SODA_c, F1, mAP, Hit@1, etc. TRACE has been enhanced with trace-retrieval and trace-uni models, showing improved performance on dense video captioning and general video understanding tasks.
nntrainer
NNtrainer is a software framework for training neural network models on devices with limited resources. It enables on-device fine-tuning of neural networks using user data for personalization. NNtrainer supports various machine learning algorithms and provides examples for tasks such as few-shot learning, ResNet, VGG, and product rating. It is optimized for embedded devices and utilizes CBLAS and CUBLAS for accelerated calculations. NNtrainer is open source and released under the Apache License version 2.0.
llm-deploy
LLM-Deploy focuses on the theory and practice of model/LLM reasoning and deployment, aiming to be your partner in mastering the art of LLM reasoning and deployment. Whether you are a newcomer to this field or a senior professional seeking to deepen your skills, you can find the key path to successfully deploy large language models here. The project covers reasoning and deployment theories, model and service optimization practices, and outputs from experienced engineers. It serves as a valuable resource for algorithm engineers and individuals interested in reasoning deployment.
llm-export
llm-export is a tool for exporting llm models to onnx and mnn formats. It has features such as passing onnxruntime correctness tests, optimizing the original code to support dynamic shapes, reducing constant parts, optimizing onnx models using OnnxSlim for performance improvement, and exporting lora weights to onnx and mnn formats. Users can clone the project locally, clone the desired LLM project locally, and use LLMExporter to export the model. The tool supports various export options like exporting the entire model as one onnx model, exporting model segments as multiple models, exporting model vocabulary to a text file, exporting specific model layers like Embedding and lm_head, testing the model with queries, validating onnx model consistency with onnxruntime, converting onnx models to mnn models, and more. Users can specify export paths, skip optimization steps, and merge lora weights before exporting.
Model-References
The 'Model-References' repository contains examples for training and inference using Intel Gaudi AI Accelerator. It includes models for computer vision, natural language processing, audio, generative models, MLPerf™ training, and MLPerf™ inference. The repository provides performance data and model validation information for various frameworks like PyTorch. Users can find examples of popular models like ResNet, BERT, and Stable Diffusion optimized for Intel Gaudi AI accelerator.
openinference
OpenInference is a set of conventions and plugins that complement OpenTelemetry to enable tracing of AI applications. It provides a way to capture and analyze the performance and behavior of AI models, including their interactions with other components of the application. OpenInference is designed to be language-agnostic and can be used with any OpenTelemetry-compatible backend. It includes a set of instrumentations for popular machine learning SDKs and frameworks, making it easy to add tracing to your AI applications.
kangaroo
Kangaroo is an AI-powered SQL client and admin tool for popular databases like SQLite, MySQL, PostgreSQL, etc. It supports various functionalities such as table design, query, model, sync, export/import, and more. The tool is designed to be comfortable, fun, and developer-friendly, with features like code intellisense and autocomplete. Kangaroo aims to provide a seamless experience for database management across different operating systems.
open-llms
Open LLMs is a repository containing various Large Language Models licensed for commercial use. It includes models like T5, GPT-NeoX, UL2, Bloom, Cerebras-GPT, Pythia, Dolly, and more. These models are designed for tasks such as transfer learning, language understanding, chatbot development, code generation, and more. The repository provides information on release dates, checkpoints, papers/blogs, parameters, context length, and licenses for each model. Contributions to the repository are welcome, and it serves as a resource for exploring the capabilities of different language models.
LLamaTuner
LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.
sane-airscan
sane-airscan is a SANE backend that supports driverless scanning using Apple AirScan (eSCL) and Microsoft WSD protocols. It automatically chooses between the two protocols and has been tested with various devices from Brother, Canon, Dell, Kyocera, Lexmark, Epson, HP, OKI, Panasonic, Pantum, Ricoh, Samsung, and Xerox. The backend allows for automatic and manual device discovery and configuration, supports scanning from platen and ADF in color and grayscale modes, and works with both IPv4 and IPv6. It does not require installation and does not conflict with vendor-provided proprietary software.
Awesome-LLM-Safety
Welcome to our Awesome-llm-safety repository! We've curated a collection of the latest, most comprehensive, and most valuable resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.
auto-dev-vscode
AutoDev for VSCode is an AI-powered coding wizard with multilingual support, auto code generation, and a bug-slaying assistant. It offers customizable prompts and features like Auto Dev/Testing/Document/Agent. The tool aims to enhance coding productivity and efficiency by providing intelligent assistance and automation capabilities within the Visual Studio Code environment.
For similar tasks
document-ai-samples
The Google Cloud Document AI Samples repository contains code samples and Community Samples demonstrating how to analyze, classify, and search documents using Google Cloud Document AI. It includes various projects showcasing different functionalities such as integrating with Google Drive, processing documents using Python, content moderation with Dialogflow CX, fraud detection, language extraction, paper summarization, tax processing pipeline, and more. The repository also provides access to test document files stored in a publicly-accessible Google Cloud Storage Bucket. Additionally, there are codelabs available for optical character recognition (OCR), form parsing, specialized processors, and managing Document AI processors. Community samples, like the PDF Annotator Sample, are also included. Contributions are welcome, and users can seek help or report issues through the repository's issues page. Please note that this repository is not an officially supported Google product and is intended for demonstrative purposes only.
step-free-api
The StepChat Free service provides high-speed streaming output, multi-turn dialogue support, online search support, long document interpretation, and image parsing. It offers zero-configuration deployment, multi-token support, and automatic session trace cleaning. It is fully compatible with the ChatGPT interface. Additionally, it provides seven other free APIs for various services. The repository includes a disclaimer about using reverse APIs and encourages users to avoid commercial use to prevent service pressure on the official platform. It offers online testing links, showcases different demos, and provides deployment guides for Docker, Docker-compose, Render, Vercel, and native deployments. The repository also includes information on using multiple accounts, optimizing Nginx reverse proxy, and checking the liveliness of refresh tokens.
unilm
The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.
searchGPT
searchGPT is an open-source project that aims to build a search engine based on Large Language Model (LLM) technology to provide natural language answers. It supports web search with real-time results, file content search, and semantic search from sources like the Internet. The tool integrates LLM technologies such as OpenAI and GooseAI, and offers an easy-to-use frontend user interface. The project is designed to provide grounded answers by referencing real-time factual information, addressing the limitations of LLM's training data. Contributions, especially from frontend developers, are welcome under the MIT License.
LLMs-at-DoD
This repository contains tutorials for using Large Language Models (LLMs) in the U.S. Department of Defense. The tutorials utilize open-source frameworks and LLMs, allowing users to run them in their own cloud environments. The repository is maintained by the Defense Digital Service and welcomes contributions from users.
LARS
LARS is an application that enables users to run Large Language Models (LLMs) locally on their devices, upload their own documents, and engage in conversations where the LLM grounds its responses with the uploaded content. The application focuses on Retrieval Augmented Generation (RAG) to increase accuracy and reduce AI-generated inaccuracies. LARS provides advanced citations, supports various file formats, allows follow-up questions, provides full chat history, and offers customization options for LLM settings. Users can force enable or disable RAG, change system prompts, and tweak advanced LLM settings. The application also supports GPU-accelerated inferencing, multiple embedding models, and text extraction methods. LARS is open-source and aims to be the ultimate RAG-centric LLM application.
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
erag
ERAG is an advanced system that combines lexical, semantic, text, and knowledge graph searches with conversation context to provide accurate and contextually relevant responses. This tool processes various document types, creates embeddings, builds knowledge graphs, and uses this information to answer user queries intelligently. It includes modules for interacting with web content, GitHub repositories, and performing exploratory data analysis using various language models.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.