InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Stars: 2676
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) based on InternLM2-7B excelling in free-form text-image composition and comprehension. It boasts several amazing capabilities and applications: * **Free-form Interleaved Text-Image Composition** : InternLM-XComposer2 can effortlessly generate coherent and contextual articles with interleaved images following diverse inputs like outlines, detailed text requirements and reference images, enabling highly customizable content creation. * **Accurate Vision-language Problem-solving** : InternLM-XComposer2 accurately handles diverse and challenging vision-language Q&A tasks based on free-form instructions, excelling in recognition, perception, detailed captioning, visual reasoning, and more. * **Awesome performance** : InternLM-XComposer2 based on InternLM2-7B not only significantly outperforms existing open-source multimodal models in 13 benchmarks but also **matches or even surpasses GPT-4V and Gemini Pro in 6 benchmarks** We release InternLM-XComposer2 series in three versions: * **InternLM-XComposer2-4KHD-7B** ๐ค: The high-resolution multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for _High-resolution understanding_ , _VL benchmarks_ and _AI assistant_. * **InternLM-XComposer2-VL-7B** ๐ค : The multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for _VL benchmarks_ and _AI assistant_. **It ranks as the most powerful vision-language model based on 7B-parameter level LLMs, leading across 13 benchmarks.** * **InternLM-XComposer2-VL-1.8B** ๐ค : A lightweight version of InternLM-XComposer2-VL based on InternLM-1.8B. * **InternLM-XComposer2-7B** ๐ค: The further instruction tuned VLLM for _Interleaved Text-Image Composition_ with free-form inputs. Please refer to Technical Report and 4KHD Technical Reportfor more details.
README:
InternLM-XComposer-2.5
Thanks the community for HuggingFace Demo | OpenXLab Demo of InternLM-XComposer-2.5.
๐ join us on Discord and WeChat
We release InternLM-XComposer2.5-OmniLive, a comprehensive multimodal system for long-term streaming video and audio interactions. Please refer to the project page for details.
InternLM-XComposer-2.5-OmniLive: A Specialized Generalist Multimodal System for Streaming Video and Audio Interactions
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer2-: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
ShareGPT4V: Improving Large Multi-modal Models with Better Captions
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
InternLM-XComposer-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. IXC-2.5 is trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to perform exceptionally well in tasks requiring extensive input and output contexts.
-
Ultra-High Resolution Understanding: IXC-2.5 enhances the dynamic resolution solution proposed in IXC2-4KHD with a native 560 ร 560 ViT vision encoder, supporting high-resolution images with any aspect ratio.
-
Fine-Grained Video Understanding: IXC-2.5 treats videos as a ultra-high-resolution composite picture consisting of tens to hundreds of frames, allowing it to capture fine details through dense sampling and higher resolution for each frame.
-
Multi-Turn Multi-Image Dialogue: IXC-2.5 supports free-form multi-turn multi-image dialogue, allowing it to naturally interact with humans in multi-round conversations.
-
Webpage Crafting: IXC-2.5 can be readily applied to create webpages by composing source code (HTML, CSS, and JavaScript) following text-image instructions.
-
Composing High-Quality Text-Image Articles: IXC-2.5 leverages specially designed Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques to significantly enhance the quality of its written content.
-
Awesome performance: IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.
Please refer to Technical Report for more details.
๐ฅ For the best experience, please keep the audio on while enjoying the video.
https://github.com/InternLM/InternLM-XComposer/assets/147793160/8206f07f-3166-461e-a631-9cbcdec6ae75
Please refer to Chinese Demo for the demo of the Chinese version.
-
2024.12.12
๐๐๐ InternLM-XComposer2.5-OmniLive-7B is publicly available. -
2024.07.15
๐๐๐ ModelScope Swift supports InternLM-XComposer2.5-7B for finetuning and inference. -
2024.07.15
๐๐๐ LMDepoly supports InternLM-XComposer2.5-7B for 4 bit quantization and inference. -
2024.07.15
๐๐๐ InternLM-XComposer2.5-7B-4bit is publicly available. -
2024.07.03
๐๐๐ InternLM-XComposer2.5-7B is publicly available. -
2024.07.01
๐๐๐ ShareGPT4V is accepted by ECCV2024. -
2024.04.22
๐๐๐ The finetune code of InternLM-XComposer2-VL-7B-4KHD-7B are publicly available. -
2024.04.09
๐๐๐ InternLM-XComposer2-4KHD-7B and evaluation code are publicly available. -
2024.04.09
๐๐๐ InternLM-XComposer2-VL-1.8B is publicly available. -
2024.02.22
๐๐๐ We release DualFocus, a framework for integrating macro and micro perspectives within MLLMs to enhance vision-language task performance.
-
2024.02.06
๐๐๐ InternLM-XComposer2-7B-4bit and InternLM-XComposer-VL2-7B-4bit are publicly available on Hugging Face and ModelScope.
-
2024.02.02
๐๐๐ The finetune code of InternLM-XComposer2-VL-7B are publicly available. -
2024.01.26
๐๐๐ The evaluation code of InternLM-XComposer2-VL-7B are publicly available. -
2024.01.26
๐๐๐ InternLM-XComposer2-7B and InternLM-XComposer-VL2-7B are publicly available on Hugging Face and ModelScope. -
2024.01.26
๐๐๐ We release a technical report for more details of InternLM-XComposer2 series. -
2023.11.22
๐๐๐ We release the ShareGPT4V, a large-scale highly descriptive image-text dataset generated by GPT4-Vision and a superior large multimodal model, ShareGPT4V-7B. -
2023.10.30
๐๐๐ InternLM-XComposer-VL achieved the top 1 ranking in both Q-Bench and Tiny LVLM. -
2023.10.19
๐๐๐ Support for inference on multiple GPUs. Two 4090 GPUs are sufficient for deploying our demo. -
2023.10.12
๐๐๐ 4-bit demo is supported, model files are available in Hugging Face and ModelScope. -
2023.10.8
๐๐๐ InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on ModelScope. -
2023.9.27
๐๐๐ The evaluation code of InternLM-XComposer-VL-7B are publicly available. -
2023.9.27
๐๐๐ InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on Hugging Face. -
2023.9.27
๐๐๐ We release a technical report for more details of our model series.
Model | Usage | Transformers(HF) | ModelScope(HF) | Release Date |
---|---|---|---|---|
InternLM-XComposer-2.5 | Video Understanding, Multi-image Multi-tune Dialog, 4K Resolution Understanding, Web Craft, Article creation, Benchmark | ๐คinternlm-xcomposer2.5 | internlm-xcomposer2.5 | 2024-07-03 |
InternLM-XComposer2-4KHD | 4K Resolution Understanding, Benchmark, VL-Chat | ๐คinternlm-xcomposer2-4khd-7b | internlm-xcomposer2-4khd-7b | 2024-04-09 |
InternLM-XComposer2-VL-1.8B | Benchmark, VL-Chat | ๐คinternlm-xcomposer2-vl-1_8b | internlm-xcomposer2-vl-1_8b | 2024-04-09 |
InternLM-XComposer2 | Text-Image Composition | ๐คinternlm-xcomposer2-7b | internlm-xcomposer2-7b | 2024-01-26 |
InternLM-XComposer2-VL | Benchmark, VL-Chat | ๐คinternlm-xcomposer2-vl-7b | internlm-xcomposer2-vl-7b | 2024-01-26 |
InternLM-XComposer2-4bit | Text-Image Composition | ๐คinternlm-xcomposer2-7b-4bit | internlm-xcomposer2-7b-4bit | 2024-02-06 |
InternLM-XComposer2-VL-4bit | Benchmark, VL-Chat | ๐คinternlm-xcomposer2-vl-7b-4bit | internlm-xcomposer2-vl-7b-4bit | 2024-02-06 |
InternLM-XComposer | Text-Image Composition, VL-Chat | ๐คinternlm-xcomposer-7b | internlm-xcomposer-7b | 2023-09-26 |
InternLM-XComposer-4bit | Text-Image Composition, VL-Chat | ๐คinternlm-xcomposer-7b-4bit | internlm-xcomposer-7b-4bit | 2023-09-26 |
InternLM-XComposer-VL | Benchmark | ๐คinternlm-xcomposer-vl-7b | internlm-xcomposer-vl-7b | 2023-09-26 |
We evaluate InternLM-XComposer-2.5 on 28 multimodal benchmarks, including image benchmarks MMDU, MMStar, RealWorldQA, Design2Code, DocVQA, Infographics VQA, TextVQA, ChartQA, OCRBench, DeepFrom, WTQ, VisualMRC, TabFact, MathVista, MMMU, AI2D, MME, MMBench, MMBench-CN, SEED-Bench, HallusionBench, MM-Vet, and video benchmarks MVBench, MLVU, Video-MME, MMBench-Video, TempCompass
See Evaluation Details here.
MVBench | MLVU | MME-Video | MMBench-Video | TempCompass | DocVQA | ChartVQA | InfoVQA | TextVQA | OCRBench | DeepForm | WTQ | VisualMRC | TabFact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VideoChat2 | InternVL1.5 | LIVA | InternVL1.5 | Qwen-VL | InternVL1.5 | InternVL1.5 | InternVL1.5 | InternVL1.5 | GLM-4v | DocOwl 1.5 | DocOwl 1.5 | DocOwl 1.5 | DocOwl 1.5 | |
7B | 26B | 34B | 26B | 7B | 26B | 26B | 26B | 26B | 9B | 8B | 8B | 8B | 8B | |
60.4 | 50.4 | 59.0 | 42.0 | 52.9 | 90.9 | 83.8 | 72.5 | 80.6 | 77.6 | 68.8 | 40.6 | 246.4 | 80.2 | |
GPT-4V | 43.5 | 49.2 | 59.9 | 56.0 | --- | 88.4 | 78.5 | 75.1 | 78.0 | 51.6 | --- | --- | --- | --- |
Gemini-Pro | --- | --- | 75.0 | 49.3 | 67.1 | 88.1 | 74.1 | 75.2 | 74.6 | 68.0 | --- | --- | --- | --- |
Ours | 69.1 | 58.8 | 55.8 | 46.9 | 90.9 | 82.2 | 69.9 | 78.2 | 69.0 | 71.2 | 53.6 | 307.5 | 85.2 |
Compared with closed-source APIs and previous SOTAs on Multi-Image dialog and General Visual QA Benchmarks.
MVBench | MLVU | MME-Video | MMBench-Video | TempCompass | DocVQA | ChartVQA | InfoVQA | TextVQA | OCRBench | DeepForm | WTQ | VisualMRC | TabFact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VideoChat2 | InternVL1.5 | LIVA | InternVL1.5 | Qwen-VL | InternVL1.5 | InternVL1.5 | InternVL1.5 | InternVL1.5 | GLM-4v | DocOwl 1.5 | DocOwl 1.5 | DocOwl 1.5 | DocOwl 1.5 | |
7B | 26B | 34B | 26B | 7B | 26B | 26B | 26B | 26B | 9B | 8B | 8B | 8B | 8B | |
60.4 | 50.4 | 59.0 | 42.0 | 58.4 | 90.9 | 83.8 | 72.5 | 80.6 | 77.6 | 68.8 | 40.6 | 246.4 | 80.2 | |
GPT-4V | 43.5 | 49.2 | 59.9 | 56.0 | --- | 88.4 | 78.5 | 75.1 | 78.0 | 51.6 | --- | --- | --- | --- |
Gemini-Pro | --- | --- | 75.0 | 49.3 | 70.6 | 88.1 | 74.1 | 75.2 | 74.6 | 68.0 | --- | --- | --- | --- |
Ours | 69.1 | 58.8 | 55.8 | 46.9 | 67.1 | 90.9 | 82.2 | 69.9 | 78.2 | 69.0 | 71.2 | 53.6 | 307.5 | 85.2 |
- python 3.8 and above
- pytorch 1.12 and above, 2.0 and above are recommended
- CUDA 11.4 and above are recommended (this is for GPU users)
-
flash-attention2 is required for high-resolution usage of InternLM-XComposer2.5.
Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions
We provide a simple example to show how to use InternLM-XComposer-2.5 with ๐ค Transformers.
Video Understanding
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Here are some frames of a video. Describe this video in detail'
image = ['./examples/liuxiang.mp4',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
#The video opens with a shot of an athlete, dressed in a red and yellow uniform with the word "CHINA" emblazoned across the front, preparing for a race.
#The athlete, Liu Xiang, is seen in a crouched position, focused and ready, with the Olympic rings visible in the background, indicating the prestigious setting of the Olympic Games. As the race commences, the athletes are seen sprinting towards the hurdles, their determination evident in their powerful strides.
#The camera captures the intensity of the competition, with the athletes' numbers and times displayed on the screen, providing a real-time update on their performance. The race reaches a climax as Liu Xiang, still in his red and yellow uniform, triumphantly crosses the finish line, his arms raised in victory.
#The crowd in the stands erupts into cheers, their excitement palpable as they witness the athlete's success. The video concludes with a close-up shot of Liu Xiang, still basking in the glory of his victory, as the Olympic rings continue to symbolize the significance of the event.
query = 'tell me the athlete code of Liu Xiang'
image = ['./examples/liuxiang.mp4',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, history=his, do_sample=False, num_beams=3, use_meta=True)
print(response)
#The athlete code of Liu Xiang, as displayed on his uniform in the video, is "1363".
Multi-Image Mutli-Tune Dialog
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Image1 <ImageHere>; Image2 <ImageHere>; Image3 <ImageHere>; I want to buy a car from the three given cars, analyze their advantages and weaknesses one by one'
image = ['./examples/cars1.jpg',
'./examples/cars2.jpg',
'./examples/cars3.jpg',]
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, his = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
#To analyze the advantages and disadvantages of each car, we need to consider factors such as brand reputation, performance, design, cost, and maintenance.
#1. Mercedes-Benz: - Advantages: Known for its luxury and reliability, the Mercedes-Benz brand offers a high level of comfort, advanced technology, and superior craftsmanship. The vehicle in the image appears to be an SUV, which is versatile for both city driving and off-road conditions. - Disadvantages: Typically, Mercedes-Benz vehicles are more expensive compared to other brands, and they may require more frequent maintenance due to their luxury status.
#2. Bugatti: - Advantages: Bugatti is renowned for producing some of the fastest and most powerful cars in the world. The vehicle in the image is a sports car, likely offering an exhilarating driving experience with its high-performance engine and advanced aerodynamics. - Disadvantages: Bugatti cars are extremely expensive, making them less accessible to the average consumer. They also require specialized knowledge for maintenance and may not be suitable for everyday driving due to their high-performance nature.
#3. BMW: - Advantages: BMW is a well-respected manufacturer known for its innovative engineering and design. The car in the image appears to be an electric vehicle (EV), which offers advantages such as lower emissions, reduced noise pollution, and potentially lower operating costs. - Disadvantages: EVs like the one in the image may have a shorter range compared to traditional gasoline-powered vehicles, limiting their suitability for long-distance travel.
#Additionally, the initial cost of purchasing an EV can be higher than that of a conventional car. In conclusion, each car has its own set of advantages and disadvantages. The Mercedes-Benz offers luxury and versatility, the Bugatti provides high performance and exclusivity, and the BMW represents innovation in electric vehicle technology. The choice ultimately depends on the buyer's priorities, such as comfort, performance, environmental impact, and budget.
query = 'Image4 <ImageHere>; How about the car in Image4'
image.append('./examples/cars4.jpg')
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, history= his, use_meta=True)
print(response)
#The car in Image4 is a red sports car, which appears to be a Ferrari. Ferrari is a renowned Italian manufacturer known for producing some of the most iconic and high-performance sports cars in the world. - Advantages: Ferrari vehicles are synonymous with speed, luxury, and engineering excellence.
#The car in the image likely offers an exhilarating driving experience with its powerful engine, advanced aerodynamics, and high-quality craftsmanship. The red color adds to the car's aesthetic appeal, making it stand out on the road. - Disadvantages: Ferrari cars are extremely expensive, making them less accessible to the average consumer.
#They also require specialized knowledge for maintenance and may not be suitable for everyday driving due to their high-performance nature. In conclusion, the Ferrari in Image4 represents a pinnacle of automotive engineering and design, offering unmatched performance and luxury.
#However, its high cost and specialized maintenance requirements make it less practical for everyday use compared to the other vehicles in the images.
High Resolution Image Understanding
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Analyze the given image in a detail manner'
image = ['./examples/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
#The infographic is a visual representation of various facts about Dubai. It begins with a statement about Palm Jumeirah, highlighting it as the largest artificial island visible from space. It then provides a historical context, noting that in 1968, there were only a few cars in Dubai, contrasting this with the current figure of more than 1.5 million vehicles.
#The infographic also points out that Dubai has the world's largest Gold Chain, with 7 of the top 10 tallest hotels located there. Additionally, it mentions that the crime rate is near 0%, and the income tax rate is also 0%, with 20% of the world's total cranes operating in Dubai. Furthermore, it states that 17% of the population is Emirati, and 83% are immigrants.
#The Dubai Mall is highlighted as the largest shopping mall in the world, with 1200 stores. The infographic also notes that Dubai has no standard address system, with no zip codes, area codes, or postal services. It mentions that the Burj Khalifa is so tall that its residents on top floors need to wait longer to break fast during Ramadan.
#The infographic also includes information about Dubai's climate-controlled City, with the Royal Suite at Burj Al Arab costing $24,000 per night. Lastly, it notes that the net worth of the four listed billionaires is roughly equal to the GDP of Honduras.
Instruction to Webpage
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'A website for Research institutions. The name is Shanghai AI lab. Top Navigation Bar is blue.Below left, an image shows the logo of the lab. In the right, there is a passage of text below that describes the mission of the laboratory.There are several images to show the research projects of Shanghai AI lab.'
with torch.autocast(device_type='cuda', dtype=torch.float16):
response = model.write_webpage(query, seed=202, task='Instruction-aware Webpage Generation', repetition_penalty=3.0)
print(response)
# see the Instruction-aware Webpage Generation.html
See the Instruction to Webpage results here.
Resume to Webpage
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
## the input should be a resume in markdown format
query = './examples/resume.md'
with torch.autocast(device_type='cuda', dtype=torch.float16):
response = model.resume_2_webpage(query, seed=202, repetition_penalty=3.0)
print(response)
See the Resume to Webpage results here.
Screenshot to Webpage
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Generate the HTML code of this web image with Tailwind CSS.'
image = ['./examples/screenshot.jpg']
with torch.autocast(device_type='cuda', dtype=torch.float16):
response = model.screen_2_webpage(query, image, seed=202, repetition_penalty=3.0)
print(response)
See the Screenshot to Webpage results here.
Write Article
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True)
model.tokenizer = tokenizer
query = '้
่ฏปไธ้ข็ๆๆ๏ผๆ นๆฎ่ฆๆฑๅไฝใ ็ตๅฝฑใ้ฟๅฎไธไธ้ใ็ๅบ็ฐ่ฎฉไบบๆๆ
จ๏ผๅฝฑ็ๅนถๆชๅฐ้็นๅ
จ่ฝๅจๅคงๅ้ฃๅไธ๏ผไนๅฑ็ฐไบๆขๅผๆฐ่ฑก็้ดๆ้ข๏ผๅณๆง้จ้็่ตๆบๅๆญใๆๆฟ็ๆฅ็่กฐ่ดฅไธ้ๅนดๆไฟ็ๅฃฎๅฟ้พ้
ฌใ้ซ้ไป่ฟๆ ้จ๏ผๅช่ฝๅไนก>ๆฒๆฝไฟฎ่กใๆ็ฝ่ฝๅพ็็ๅ
ฌไธปไธพ่๏ผๆขๅ
ฅ็ฟฐๆ๏ผไฝไปๅชๆฏๆไธบๅ็ๅฎ็ๅพก็จๆไบบ๏ผไธ่ฝ็ๆญฃๅฎ็ฐๆ็ไบๆๆฟ็ๅฟๆใ็ถ่๏ผ็ไธญ้ซๆฝฎ้จๅใๅฐ่ฟ้
ใไธ่๏ผไบบ่ณไธญๅนดใๆ็่่
ฉ็ๆ็ฝๅผไผไบบไนไป้นคไธๅคฉ๏ผไธ่ทฏไปๆฐด้ขใ็ๅธ้ฃๅ่ณ้ถๆฒณ่ฟๅ
ฅไป>ๅฎซ๏ผๆ็ฝ็ๅฅ็ไธไปไบบไปฌ็ขฐๆฏ๏ผๆๅๅคงๅฎถ็บต่บซ้ฃๅๆผฉๆถก่ฌ็ไน้ๅคฉใ่่บซ็ๅพฎ่ดฑใไธ่ทฏ็โๅคฉ็ๆๆๅฟ
ๆ็จ๏ผๅๅท๏ผๆไธไฝ็ฒพ็ฅ็้ซ่นใโๅคฉ็ๆๆๅฟ
ๆ็จ๏ผๅ้ๆฃๅฐฝ่ฟๅคๆฅใโ ๅคๅพไปๆฅ๏ผ่บซๅค้ฒ้กฟใ้ญๅๆซๆใ่ขซ็
็ๆ็ฃจ๏ผๅพๅคไบบ้ฝๆพ็ปๅ>ไบไบบ็็โๅคฑๆโ๏ผๅดๅ่ๆๅฐฑไบไปไปฌโ่ฏๆโ็ไบบ็ใๅฏนๆญฃๅจ่ฟฝๆฑไบบ็ไปทๅผ็ๅฝไปฃ้ๅนดๆฅ่ฏด๏ผๅฆไฝๅฏนๅพ
ไบบ็ไธญ็็ผบๆพๅๅฐ้กฟ?่ฏๆไบบ็ไธญๅๆๆๆ ท็่ชๆๅๅฎๅ่ชๆ่ฎคๅ?่ฏท็ปๅโๅคฑๆโไธโ่ฏๆโ่ฟไธคไธชๅ
ณ้ฎ่ฏๅไธ็ฏๆ็ซ ใ ่ฆๆฑ:้ๅ่งๅบฆ๏ผ็กฎๅฎ>็ซๆ๏ผๆ็กฎๆไฝ๏ผ่ชๆๆ ้ข;ไธ่ฆๅฅไฝ๏ผไธๅพๆ่ขญ;ไธๅพๆณ้ฒไธชไบบไฟกๆฏ;ไธๅฐไบ 800 ๅญใ'
with torch.autocast(device_type='cuda', dtype=torch.float16):
response = model.write_artical(query, seed=8192)
print(response)
#่ฏๆไบบ็๏ผ่ดตๅจๅๅฎ
#ใ่ๆ น่ฐญใๆไบ:โ้ฒๆถ่ฆๆๅ็ดง็ๅฟๆ,ๅฟ้่ฆ็ๅ้ฒๅทฅๅคซใโไบบ็ๅจไธ,ๆปๆๅคฑๆไนๆถ,ๅฝ้ขๅฏน็ผบๆพๅๅฐ้กฟ,่ฏๆๅฐ็ๆดป็ๆ่ฝไธบไบบ็ๅขๆทปไธๆนไบฎ่ฒใไฝ่ฐ่ฏๆๅฐ็ๆดป? ๆ่ฐ่ฏๆๅฐ็ๆดป๏ผไพฟๆฏๅจไบๅๅฎๆฌๅฟใ็ด้ข้ๆพใ่ถ
่ถ่ชๆ,ๅจๅคฑๆไธญๅฏปๆพไบบ็ไปทๅผใ
#่ฏๆๅฐ็ๆดป,้ๅๅฎๆฌๅฟ,ๆทก็ถๅคไนใ
#้ถๆธๆๆพๆงๆ่พๅปๅฝญๆณฝๅฟไปค,ๅฝ้็ฐๅญ,โ้่ไธ็ฏฑไธ,ๆ ็ถ่งๅๅฑฑโ,ๅจๅฑฑๆฐด้ดๅฏๆ
่ชๅจฑ๏ผ็็ปด้ขๅฏนไป้ๅคฑๆ,็ปๆฅๆฒ้ไบ่ฏ้
ไนไธญ,โๅ
ดๆฅๆฏ็ฌๅพ,่ไบ็ฉบ่ช็ฅโ,ๅจ่ฏ้
ไธญ้ฒ้ธ่ชๅฆ;ๆ็ฝไป้ไธ้กบ,่ขซ่ต้ๆพ่ฟ,ไฝไปไพๆง่ฑชๆฐๅนฒไบ,โๅคฉ็ๆๆๅฟ
ๆ็จ,ๅ้ๆฃๅฐฝ่ฟๅคๆฅโ,ๅจๅคฑๆไธญๅฆ็ถ่ฑ่พพใๅๅฎๆฌๅฟ๏ผไพฟ่ฝๅจ้ญ้ๅคฑๆไนๆถๅฎไฝ่ชๅทฑ็็ฒพ็ฅๅฎถๅญ,่ฎฉ็ๆดปๅ
ๆปก่ฏๆใๅไน,่ฅไธ่ฝๅๅฎๆฌๅฟ,่ๅชๆฏไธๅณ่ฟๅไธไฟไปฅๆฑๅพๅ่ฟ,้ฃ็บตไฝฟ่บซๅฑ
้ซไฝ,ไบฆไผไธงๅคฑ็ๆดป็ไน่ถฃใ
#่ฏๆๅฐ็ๆดป,้็ด้ข้ๆพ,่ถ
่ถ่ชๆใ
#โ่ฅฟๅกๅฑฑๅ็ฝ้นญ้ฃ,ๆก่ฑๆตๆฐด้ณ้ฑผ่ฅใ้็ฎฌ็ฌ ,็ปฟๆณๆ,ๅๆค้
,ไธ็บถไธใไบๆนๅๆตท็ๅฆๆญค,ไฝๅฆจๅฐๆญคๅคๅฝใโ็ฝๅฑ
ๆ็ใๆธๆญๅญใๅๅบไบๅคๅฐไบบ็ๆฟๆ:ๆฒกๆๆๅฟ็บทๆฐ,ๆฒกๆ่ดซๅฐๅๅ,ๅชๆ้ๅฑฑ็ปฟๆฐดใ็ฝ้นญ้ธฅ้ธไฝไผด,ๅฆๆญค่ช็ฑ่ชๅจ็็ๆดปไปคไบบ็ฅๅพใ็ถ่,็ฝๅฑ
ๆๅดๅนถๆฒกๆๅ ๆญค็็ๅฝ้ๅฑฑๆ,่ๆฏ็ด้ขไบบ็,่ถ
่ถ่ชๆ,ๅไธไบไธ้ฆ้ฆ่ฏๆ่ๅฏๆ็ฐๅฎๅ
ณๆ็ไฝๅใๅฆๆ็ฝๅฑ
ๆๅช้กพ้้ฟไบบ็,้ฃๅๆไผๆฅๆโๅคงๅผฆๅๅๅฆๆฅ้จ,ๅฐๅผฆๅๅๅฆ็ง่ฏญโ็็ป็พๆฏๅปๅข?ๅฆๆ็ฝๅฑ
ๆๅช้กพๅฝ้ๅฑฑๆ,้ฃๅๆไผๅๅบโๆญคๆฒๅชๅบๅคฉไธๆ,ไบบ้ดๅชๅพ้
็ฝๅฑ
ๆโ่ฟๆ ท็่ฏๅฅๅข?
#่ฏๆๅฐ็ๆดป,้็ด้ข้ๆพ,ๅๅฎๆฌๅฟใ
#ๆๆๆณขๆฃๆๆธๅป็,ๅป็่ฏดไปๆดปไธ่ฟไบๅนด,ไฝไปๆฒกๆๅ ๆญคๆพๅผๅฏน้ณไน็็ญ็ฑ,่ๆฏไธ็
้ญไฝๆไบ,ๆผๅฅๅบ็พๅฆ็ไนๆฒ;ๅญๅฎถๆ่ชๅนผๆฃๆ่็ซ,ไฝไปไธ็ไบๅฝ่ฟ็ๆๅผ,็ปๆๅ
จๅฝๆ็พๆๅธ;ๅฒ้็้ฅฑๅ็พ็
ๆ็ฃจ,ไฝไปไป่ฝๅๅบโๆๅธธๅธธๅจๆ็ๅฟๅคดๆธ
็น,ๆๆไปไน?โ็ๅฉ้ฎ,ๅนถ็ฑๆญค่ตฐไธๆๅญฆ้่ทฏ,ไธบๅไธ็ไธไธฐๅ็ๆๅ้ไบงใ่ฟไบไบบๆฒกๆ้้ฟ,่ๆฏ้ๆฉ็ด้ขไบบ็็็ผบๆพ,ๅจๅๅฎๆฌๅฟ็ๅๆถ่ถ
่ถ่ชๆ,ๆ็ปๅฎ็ฐไบ่ชๅทฑ็ไปทๅผใ
#่ฏๆๅฐ็ๆดป,ๆฏไบๅคฑๆไธญๅๅฎๆฌๅฟ,ไบ็ผบๆพไธญ่ถ
่ถ่ชๆใๅฝ้ขๅฏนไบบ็็็ผบๆพไธๆซๆ,ๅๅฎๆฌๅฟใ่ถ
่ถ่ชๆ็ๅๆถ,ไนๅฟ
ๅฐไนฆๅๅฑไบ่ชๅทฑ็่พ็
็ฏ็ซ ใ
#ๆฟไฝ ๆ้ฝ่ฝ่ฏๆๅฐ็ๆดป็!
query = 'Please write a blog based on the title: French Pastries: A Sweet Indulgence'
with torch.autocast(device_type='cuda', dtype=torch.float16):
response = model.write_artical(query, seed=8192)
print(response)
#French Pastries: A Sweet Indulgence
#The French are well known for their love of pastries, and itโs a love that is passed down through generations. When one visits France, they are treated to an assortment of baked goods that can range from the delicate macaron to the rich and decadent chocolate mousse. While there are many delicious types of pastries found in France, five stand out as being the most iconic. Each of these pastries has its own unique qualities that make it special.
#1. Croissant
#One of the most famous pastries from France is the croissant. It is a buttery, flaky pastry that is best enjoyed fresh from the bakery. The dough is laminated with butter, giving it its signature layers. Croissants are typically eaten for breakfast or brunch, often accompanied by coffee or hot chocolate.
#2. Macaron
#The macaron is a small, delicate French confection made from almond flour, powdered sugar, and egg whites. The macaron itself is sandwiched with a ganache or jam filling. They come in a variety of colors and flavors, making them a popular choice for both casual snacking and upscale desserts.
#3. Madeleine
#The madeleine is a small shell-shaped cake that is light and sponge-like. It is often flavored with lemon or orange zest and sometimes dipped in chocolate. Madeleines are perfect for an afternoon snack with tea or coffee.
#4. รclair
#The รฉclair is a long, thin pastry filled with cream and topped with chocolate glaze. It is a classic French treat that is both sweet and satisfying. รclairs can be found in bakeries all over France and are often enjoyed with a cup of hot chocolate.
#5. Tarte Tatin
#The tarte Tatin is an apple tart that is known for its caramelized apples and puff pastry crust. It is named after the Tatin sisters who created the recipe in the late 19th century. Tarte Tatin is best served warm with a scoop of vanilla ice cream.
#These pastries are just a few of the many delicious treats that France has to offer. Whether you are a seasoned traveler or a first-time visitor, indulging in French pastries is a must-do activity. So go ahead, treat yourselfโyou deserve it!
If you have multiple GPUs, but the memory size of each GPU is not enough to accommodate the entire model, you can split the model across multiple GPUs. First, install accelerate
using the command: pip install accelerate
. Then, execute the follows scripts for chat:
# chat with 2 GPUs
python example_code/example_chat.py --num_gpus 2
If InternLM-XComposer2d5 model inference optimization is required, we recommend using LMDeploy.
In the following subsections, we will introduce the usage of LMDeploy with the internlm-xcomposer2d5-7b model as an example.
First of all, please install the pypi package with pip install lmdeploy
. By default, it depends on CUDA 12.x. For a CUDA 11.x environment, please refer to the installation guide.
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('internlm/internlm-xcomposer2d5-7b')
image = load_image('examples/dubai.png')
response = pipe(('describe this image', image))
print(response.text)
For more on using the VLM pipeline, including multi-image inference or multi-turn chat, please overview this guide.
We offer 4-bit quantized models via LMDeploy to reduce memory requirements. For a memory usage comparison, please refer to here.
from lmdeploy import TurbomindEngineConfig, pipeline
from lmdeploy.vl import load_image
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline('internlm/internlm-xcomposer2d5-7b-4bit', backend_config=engine_config)
image = load_image('examples/dubai.png')
response = pipe(('describe this image', image))
print(response.text)
- Please refer to our finetune scripts.
- Inference and finetune support from ModelScope Swift
We provide code for users to build a web UI demo. Please use gradio==4.13.0
Please run the command below for Chat / Composition:
# For Multimodal Chat
python gradio_demo/gradio_demo_chat.py
# For Free-form Text-Image Composition
python gradio_demo/gradio_demo_composition.py
The user guidance of UI demo is given in HERE. If you wish to change the default folder of the model, please use the --code_path=new_folder
option.
If you find our models / code / papers useful in your research, please consider giving โญ and citations ๐, thx :)
@article{internlmxcomposer2_5_OL,
title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions},
author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2412.09596},
year={2024}
}
@article{internlmxcomposer2_5,
title={InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output},
author={Pan Zhang and Xiaoyi Dong and Yuhang Zang and Yuhang Cao and Rui Qian and Lin Chen and Qipeng Guo and Haodong Duan and Bin Wang and Linke Ouyang and Songyang Zhang and Wenwei Zhang and Yining Li and Yang Gao and Peng Sun and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Hang Yan and Conghui He and Xingcheng Zhang and Kai Chen and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2407.03320},
year={2024}
}
@article{internlmxcomposer2_4khd,
title={InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD},
author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Songyang Zhang and Haodong Duan and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Zhe Chen and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Kai Chen and Conghui He and Xingcheng Zhang and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2404.06512},
year={2024}
}
@article{internlmxcomposer2,
title={InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model},
author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Xilin Wei and Songyang Zhang and Haodong Duan and Maosong Cao and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2401.16420},
year={2024}
}
@article{internlmxcomposer,
title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition},
author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2309.15112},
year={2023}
}
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/็ณ่ฏท่กจ๏ผไธญๆ๏ผ. For other questions or collaborations, please contact [email protected].
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for InternLM-XComposer
Similar Open Source Tools
InternLM-XComposer
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) based on InternLM2-7B excelling in free-form text-image composition and comprehension. It boasts several amazing capabilities and applications: * **Free-form Interleaved Text-Image Composition** : InternLM-XComposer2 can effortlessly generate coherent and contextual articles with interleaved images following diverse inputs like outlines, detailed text requirements and reference images, enabling highly customizable content creation. * **Accurate Vision-language Problem-solving** : InternLM-XComposer2 accurately handles diverse and challenging vision-language Q&A tasks based on free-form instructions, excelling in recognition, perception, detailed captioning, visual reasoning, and more. * **Awesome performance** : InternLM-XComposer2 based on InternLM2-7B not only significantly outperforms existing open-source multimodal models in 13 benchmarks but also **matches or even surpasses GPT-4V and Gemini Pro in 6 benchmarks** We release InternLM-XComposer2 series in three versions: * **InternLM-XComposer2-4KHD-7B** ๐ค: The high-resolution multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for _High-resolution understanding_ , _VL benchmarks_ and _AI assistant_. * **InternLM-XComposer2-VL-7B** ๐ค : The multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for _VL benchmarks_ and _AI assistant_. **It ranks as the most powerful vision-language model based on 7B-parameter level LLMs, leading across 13 benchmarks.** * **InternLM-XComposer2-VL-1.8B** ๐ค : A lightweight version of InternLM-XComposer2-VL based on InternLM-1.8B. * **InternLM-XComposer2-7B** ๐ค: The further instruction tuned VLLM for _Interleaved Text-Image Composition_ with free-form inputs. Please refer to Technical Report and 4KHD Technical Reportfor more details.
CodeGeeX4
CodeGeeX4-ALL-9B is an open-source multilingual code generation model based on GLM-4-9B, offering enhanced code generation capabilities. It supports functions like code completion, code interpreter, web search, function call, and repository-level code Q&A. The model has competitive performance on benchmarks like BigCodeBench and NaturalCodeBench, outperforming larger models in terms of speed and performance.
FlagEmbedding
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: * **Long-Context LLM** : Activation Beacon * **Fine-tuning of LM** : LM-Cocktail * **Embedding Model** : Visualized-BGE, BGE-M3, LLM Embedder, BGE Embedding * **Reranker Model** : llm rerankers, BGE Reranker * **Benchmark** : C-MTEB
MeloTTS
MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai. It supports various languages including English (American, British, Indian, Australian), Spanish, French, Chinese, Japanese, and Korean. The Chinese speaker also supports mixed Chinese and English. The library is fast enough for CPU real-time inference and offers features like using without installation, local installation, and training on custom datasets. The Python API and model cards are available in the repository and on HuggingFace. The community can join the Discord channel for discussions and collaboration opportunities. Contributions are welcome, and the library is under the MIT License. MeloTTS is based on TTS, VITS, VITS2, and Bert-VITS2.
MMLU-Pro
MMLU-Pro is an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. It integrates more challenging, reasoning-focused questions and increases answer choices per question, significantly raising difficulty. The dataset comprises over 12,000 questions from academic exams and textbooks across 14 diverse domains. Experimental results show a significant drop in accuracy compared to the original MMLU, with greater stability under varying prompts. Models utilizing Chain of Thought reasoning achieved better performance on MMLU-Pro.
inference
Xorbits Inference (Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.
ABQ-LLM
ABQ-LLM is a novel arbitrary bit quantization scheme that achieves excellent performance under various quantization settings while enabling efficient arbitrary bit computation at the inference level. The algorithm supports precise weight-only quantization and weight-activation quantization. It provides pre-trained model weights and a set of out-of-the-box quantization operators for arbitrary bit model inference in modern architectures.
keras-llm-robot
The Keras-llm-robot Web UI project is an open-source tool designed for offline deployment and testing of various open-source models from the Hugging Face website. It allows users to combine multiple models through configuration to achieve functionalities like multimodal, RAG, Agent, and more. The project consists of three main interfaces: chat interface for language models, configuration interface for loading models, and tools & agent interface for auxiliary models. Users can interact with the language model through text, voice, and image inputs, and the tool supports features like model loading, quantization, fine-tuning, role-playing, code interpretation, speech recognition, image recognition, network search engine, and function calling.
DB-GPT-Hub
DB-GPT-Hub is an experimental project leveraging Large Language Models (LLMs) for Text-to-SQL parsing. It includes stages like data collection, preprocessing, model selection, construction, and fine-tuning of model weights. The project aims to enhance Text-to-SQL capabilities, reduce model training costs, and enable developers to contribute to improving Text-to-SQL accuracy. The ultimate goal is to achieve automated question-answering based on databases, allowing users to execute complex database queries using natural language descriptions. The project has successfully integrated multiple large models and established a comprehensive workflow for data processing, SFT model training, prediction output, and evaluation.
CuMo
CuMo is a project focused on scaling multimodal Large Language Models (LLMs) with Co-Upcycled Mixture-of-Experts. It introduces CuMo, which incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into the vision encoder and the MLP connector, enhancing the capabilities of multimodal LLMs. The project adopts a three-stage training approach with auxiliary losses to stabilize the training process and maintain a balanced loading of experts. CuMo achieves comparable performance to other state-of-the-art multimodal LLMs on various Visual Question Answering (VQA) and visual-instruction-following benchmarks.
LongLoRA
LongLoRA is a tool for efficient fine-tuning of long-context large language models. It includes LongAlpaca data with long QA data collected and short QA sampled, models from 7B to 70B with context length from 8k to 100k, and support for GPTNeoX models. The tool supports supervised fine-tuning, context extension, and improved LoRA fine-tuning. It provides pre-trained weights, fine-tuning instructions, evaluation methods, local and online demos, streaming inference, and data generation via Pdf2text. LongLoRA is licensed under Apache License 2.0, while data and weights are under CC-BY-NC 4.0 License for research use only.
Pearl
Pearl is a production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. It enables researchers and practitioners to develop Reinforcement Learning AI agents that prioritize cumulative long-term feedback over immediate feedback and can adapt to environments with limited observability, sparse feedback, and high stochasticity. Pearl offers a diverse set of unique features for production environments, including dynamic action spaces, offline learning, intelligent neural exploration, safe decision making, history summarization, and data augmentation.
RLHF-Reward-Modeling
This repository, RLHF-Reward-Modeling, is dedicated to training reward models for DRL-based RLHF (PPO), Iterative SFT, and iterative DPO. It provides state-of-the-art performance in reward models with a base model size of up to 13B. The installation instructions involve setting up the environment and aligning the handbook. Dataset preparation requires preprocessing conversations into a standard format. The code can be run with Gemma-2b-it, and evaluation results can be obtained using provided datasets. The to-do list includes various reward models like Bradley-Terry, preference model, regression-based reward model, and multi-objective reward model. The repository is part of iterative rejection sampling fine-tuning and iterative DPO.
Awesome-LLM
Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.
pytorch-grad-cam
This repository provides advanced AI explainability for PyTorch, offering state-of-the-art methods for Explainable AI in computer vision. It includes a comprehensive collection of Pixel Attribution methods for various tasks like Classification, Object Detection, Semantic Segmentation, and more. The package supports high performance with full batch image support and includes metrics for evaluating and tuning explanations. Users can visualize and interpret model predictions, making it suitable for both production and model development scenarios.
cambrian
Cambrian-1 is a fully open project focused on exploring multimodal Large Language Models (LLMs) with a vision-centric approach. It offers competitive performance across various benchmarks with models at different parameter levels. The project includes training configurations, model weights, instruction tuning data, and evaluation details. Users can interact with Cambrian-1 through a Gradio web interface for inference. The project is inspired by LLaVA and incorporates contributions from Vicuna, LLaMA, and Yi. Cambrian-1 is licensed under Apache 2.0 and utilizes datasets and checkpoints subject to their respective original licenses.
For similar tasks
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.
jupyter-ai
Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.
khoj
Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.
langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).
danswer
Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"
infinity
Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.
For similar jobs
ChatFAQ
ChatFAQ is an open-source comprehensive platform for creating a wide variety of chatbots: generic ones, business-trained, or even capable of redirecting requests to human operators. It includes a specialized NLP/NLG engine based on a RAG architecture and customized chat widgets, ensuring a tailored experience for users and avoiding vendor lock-in.
anything-llm
AnythingLLM is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
mikupad
mikupad is a lightweight and efficient language model front-end powered by ReactJS, all packed into a single HTML file. Inspired by the likes of NovelAI, it provides a simple yet powerful interface for generating text with the help of various backends.
glide
Glide is a cloud-native LLM gateway that provides a unified REST API for accessing various large language models (LLMs) from different providers. It handles LLMOps tasks such as model failover, caching, key management, and more, making it easy to integrate LLMs into applications. Glide supports popular LLM providers like OpenAI, Anthropic, Azure OpenAI, AWS Bedrock (Titan), Cohere, Google Gemini, OctoML, and Ollama. It offers high availability, performance, and observability, and provides SDKs for Python and NodeJS to simplify integration.
onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.
firecrawl
Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown. It crawls all accessible subpages and provides clean markdown for each, without requiring a sitemap. The API is easy to use and can be self-hosted. It also integrates with Langchain and Llama Index. The Python SDK makes it easy to crawl and scrape websites in Python code.