farel-bench
Testing LLM reasoning abilities with family relationship quizzes.
Stars: 55
The 'farel-bench' project is a benchmark tool for testing LLM reasoning abilities with family relationship quizzes. It generates quizzes based on family relationships of varying degrees and measures the accuracy of large language models in solving these quizzes. The project provides scripts for generating quizzes, running models locally or via APIs, and calculating benchmark metrics. The quizzes are designed to test logical reasoning skills using family relationship concepts, with the goal of evaluating the performance of language models in this specific domain.
README:
This project is a family relationship (FaRel) benchmark for testing LLM reasoning abilities with family relationship quizzes.
The table below presents 20 best FaRel-3 (family relationships of degree up to 3) benchmark results. The benchmark result is the macro-averaged accuracy value over all the family relationship classes.
Nr | Model | FaRel | child | parent | grand-child | sibling | grand-parent | great grand-child | niece or nephew | aunt or uncle | great grand-parent |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | o1-mini | 99.78 | 100.00 | 100.00 | 100.00 | 100.00 | 98.00 | 100.00 | 100.00 | 100.00 | 100.00 |
2 | o1-preview | 98.89 | 100.00 | 100.00 | 96.00 | 100.00 | 100.00 | 100.00 | 94.00 | 100.00 | 100.00 |
3 | qwq-32b-preview | 96.67 | 100.00 | 100.00 | 100.00 | 98.00 | 100.00 | 98.00 | 90.00 | 88.00 | 96.00 |
4 | deepseek-v3 | 96.44 | 100.00 | 100.00 | 100.00 | 96.00 | 100.00 | 100.00 | 82.00 | 92.00 | 98.00 |
5 | claude-3.5-sonnet-1022 | 93.33 | 100.00 | 100.00 | 100.00 | 92.00 | 100.00 | 98.00 | 76.00 | 74.00 | 100.00 |
6 | qvq-72b-preview | 91.56 | 100.00 | 100.00 | 98.00 | 92.00 | 96.00 | 86.00 | 84.00 | 76.00 | 92.00 |
7 | Sky-T1-32B-Preview-Q8_0 | 88.44 | 100.00 | 100.00 | 80.00 | 96.00 | 98.00 | 90.00 | 82.00 | 50.00 | 100.00 |
7 | mistral-large-2411-Q8_0 | 88.44 | 100.00 | 100.00 | 94.00 | 92.00 | 100.00 | 90.00 | 70.00 | 54.00 | 96.00 |
9 | deepseek-v2-chat-0628-Q8_0 | 87.78 | 100.00 | 100.00 | 98.00 | 86.00 | 94.00 | 94.00 | 60.00 | 60.00 | 98.00 |
10 | gemini-pro-1.5-002 | 87.11 | 100.00 | 100.00 | 74.00 | 88.00 | 100.00 | 84.00 | 70.00 | 72.00 | 96.00 |
11 | claude-3.5-sonnet | 86.89 | 100.00 | 100.00 | 98.00 | 80.00 | 98.00 | 94.00 | 60.00 | 56.00 | 96.00 |
11 | mistral-large-2 | 86.89 | 100.00 | 100.00 | 70.00 | 92.00 | 100.00 | 94.00 | 60.00 | 66.00 | 100.00 |
11 | mistral-large-2411 | 86.89 | 100.00 | 100.00 | 68.00 | 88.00 | 98.00 | 96.00 | 64.00 | 68.00 | 100.00 |
14 | llama-3.3-70b-instruct | 86.44 | 100.00 | 100.00 | 90.00 | 92.00 | 100.00 | 76.00 | 68.00 | 56.00 | 96.00 |
15 | gpt-4-turbo | 86.22 | 100.00 | 100.00 | 92.00 | 84.00 | 96.00 | 90.00 | 56.00 | 60.00 | 98.00 |
16 | llama-3.1-405b-instruct | 85.78 | 100.00 | 100.00 | 88.00 | 92.00 | 98.00 | 88.00 | 54.00 | 52.00 | 100.00 |
17 | gpt-4o-2024-11-20 | 84.22 | 100.00 | 100.00 | 84.00 | 78.00 | 98.00 | 82.00 | 62.00 | 56.00 | 98.00 |
18 | gemini-2.0-flash-exp | 84.00 | 100.00 | 100.00 | 84.00 | 78.00 | 94.00 | 86.00 | 66.00 | 50.00 | 98.00 |
19 | gpt-4o | 83.11 | 100.00 | 100.00 | 84.00 | 82.00 | 98.00 | 74.00 | 62.00 | 52.00 | 96.00 |
20 | Phi-4-Q8_0 | 81.11 | 100.00 | 100.00 | 94.00 | 74.00 | 96.00 | 86.00 | 40.00 | 42.00 | 98.00 |
To see results for all models benchmarked so far check the results.md file.
Notes:
- Models having quantization suffix in their name were run locally on llama.cpp, remaining models were run via OpenAI or OpenRouter APIs.
- Models with -sys suffix had system prompt set to 'You are a master of logical thinking. You carefully analyze the premises step by step, take detailed notes and draw intermediate conclusions based on which you can find the final answer to any question.'.
- Very low benchmark results for gemma-2b, qwen1_5-7b, and WizardLM-2-7B are caused by the inability of the models to mark the selected answer option as specified in the prompt.
- After running the benchmark for nemotron-4-340b-instruct I noticed that in several cases the model response was cut short by the activation of inference timeout set to 10 minutes. If not for that, the score could be even better.
- gemma-2-9b and gemma-2-27b model results are courtesy of Reddit user Healthy-Nebula-3603
I also tested some models in Polish language:
Nr | Model | FaRel | child | parent | grand-child | sibling | grand-parent | great grand-child | niece or nephew | aunt or uncle | great grand-parent |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | llama-3.1-405b-instruct-sys-pl | 92.89 | 100.00 | 100.00 | 98.00 | 94.00 | 100.00 | 94.00 | 72.00 | 80.00 | 98.00 |
2 | qwen-2.5-72b-instruct-pl | 88.44 | 100.00 | 100.00 | 98.00 | 90.00 | 100.00 | 88.00 | 76.00 | 46.00 | 98.00 |
3 | llama-3.1-405b-instruct-pl | 83.33 | 100.00 | 100.00 | 100.00 | 74.00 | 100.00 | 78.00 | 50.00 | 56.00 | 92.00 |
4 | Bielik-11B-v2.3-Instruct-Q8_0-sys-pl | 81.56 | 100.00 | 98.00 | 88.00 | 88.00 | 98.00 | 76.00 | 68.00 | 34.00 | 84.00 |
5 | qwen-2.5-72b-instruct-sys-pl | 80.00 | 100.00 | 100.00 | 96.00 | 92.00 | 100.00 | 82.00 | 52.00 | 32.00 | 66.00 |
6 | Bielik-11B-v2.3-Instruct-Q8_0-pl | 76.00 | 100.00 | 100.00 | 92.00 | 42.00 | 100.00 | 56.00 | 72.00 | 34.00 | 88.00 |
All tested models perform better in Polish compared to results for English language. Is it easier to "reason" in Polish language?
The purpose of this project is to test LLM reasoning abilities with family relationship quizzes. Why use family relationships?
- Family relationships are commonly known concepts.
- They allow to create quizzes of scalable difficulty by increasing the relationship degree.
- Easy randomization of the quizzes by changing the names of family members and the order of listed relations.
Consider the following graph of family relationships:
We can observe that:
- child and parent relationships have distance 1 from self,
- grandchild, grandparent, and sibling relationships have distance 2 from self,
- great grandchild, niece or nephew, aunt or uncle, and great grandparent relationships have distance 3 from self, and so on.
We use such relationship graphs to programmatically generate family quizzes. Generated quizzes have the following properties:
- Connections between family members are specified by using only the parental relationship.
- Family member connections specify a graph of all family relationships of degree up to N.
- The quiz question is to differentiate between family relationships of degree N.
- LLM is instructed to select the i-th quiz answer option by enclosing the selected answer number with the
<ANSWER>i</ANSWER>
tag.
Given the family relationships:
* Ralph is Anthony's parent.
* Albert is Ralph's parent.
What is Anthony's relationship to Ralph?
Select the correct answer:
1. Anthony is Ralph's child.
2. Anthony is Ralph's parent.
Enclose the selected answer number in the <ANSWER> tag, for example: <ANSWER>1</ANSWER>.
Given the family relationships:
* Wayne is Brittany's parent.
* Billy is Madison's parent.
* Madison is Wayne's parent.
* Brittany is Amanda's parent.
* Madison is Michael's parent.
What is Amanda's relationship to Wayne?
Select the correct answer:
1. Amanda is Wayne's grandparent.
2. Amanda is Wayne's sibling.
3. Amanda is Wayne's grandchild.
Enclose the selected answer number in the <ANSWER> tag, for example: <ANSWER>1</ANSWER>.
Given the family relationships:
* Brittany is Jeremy's parent.
* Peter is Lauren's parent.
* Peter is Madison's parent.
* Brittany is Peter's parent.
* Madison is Betty's parent.
* Richard is Andrea's parent.
* Lauren is Gabriel's parent.
* Gabriel is Richard's parent.
* Janet is Brittany's parent.
What is Andrea's relationship to Lauren?
Select the correct answer:
1. Andrea is Lauren's niece or nephew.
2. Andrea is Lauren's aunt or uncle.
3. Andrea is Lauren's great grandchild.
4. Andrea is Lauren's great grandparent.
Enclose the selected answer number in the <ANSWER> tag, for example: <ANSWER>1</ANSWER>.
We measure the performance of the LLM by macro-averaging the classification accuracy of all family relationships present in the dataset. So for example if a given LLM has the following accuracy values for family relationship quizzes of degrees up to 3:
- child: 100.00
- parent: 100.00
- grandchild: 96.00
- sibling: 22.00
- grandparent: 72.00
- great grandchild: 46.00
- niece or nephew: 46.00
- aunt or uncle: 18.00
- great grandparent: 68.00
then the overall macro-averaged accuracy is (100 + 100 + 96 + 22 + 72 + 46 + 46 + 18 + 68) / 9 = ~63.11 To differentiate between benchmark results calculated for datasets with different maximum family relationship lengths, we propose to include the maximum family relationship length in the benchmark result label. So if an accuracy of 63.11 was computed for family relationships of length up to 3, the overall result would be labeled as FaRel-3 and would have a value of 63.11.
There are five Python scripts in the FaRel benchmark:
- The farel_bench.py script generates family relationship quizzes in a CSV format.
- The run_model.py script generates answers for the quizzes by using llama.cpp and selected LLM and calculates the accuracy values for all family relationships.
- The run_openai.py script generates answers for the quizzes by using OpenAI API and calculates the accuracy values for all family relationships.
- The run_openrouter.py script generates answers for the quizzes by using OpenRouter API and calculates the accuracy values for all family relationships.
- The compute_metrics.py script analyzes log files in a given directory, calculates the macro-averaged accuracy value that is the FaRel benchmark result, and outputs a result table.
Generating quizzes and storing model answers:
./farel_bench.py --shuffle -l 1 -n 50 -r 42|./run_model.py -b ~/projects/llama.cpp/main -m ~/projects/llama.cpp/models/llama-2-7b-chat.Q8_0.gguf|tee ./results/llama-2-7b-chat.Q8_0.log
./farel_bench.py --shuffle -l 3 -n 50 -r 42|./run_openai.py -m gpt-4|tee ./results/gpt-4.log
Calculating FaRel benchmark metrics:
./compute_metrics.py ./results/
The farel_bench.py is the quiz generator script. It has the following options:
usage: farel_bench.py [-h] -l LENGTH [-p PROMPT] [-s] [-n NUMBER] [-r SEED]
options:
-h, --help show this help message and exit
-l LENGTH, --length LENGTH
Maximum length of family relationship paths.
-p PROMPT, --prompt PROMPT
Prompt template of the quiz. The default prompt template is: 'Given the family
relationships:\n$QUIZ_RELATIONS\n$QUIZ_QUESTION\nSelect the correct
answer:\n$QUIZ_ANSWERS\nEnclose the selected answer number in the <ANSWER> tag, for
example: <ANSWER>1</ANSWER>.'
-s, --shuffle Shuffle the order of parental relations and answer options in the quiz.
-n NUMBER, --number NUMBER
Number of quizzes generated for each family relationship.
-r SEED, --seed SEED Random seed value
The run_model.py script uses llama.cpp to generate answers for family relationship quizzes generated by farel_bench.py for a selected LLM.
usage: run_model.py [-h] -b BINARY [-t TIMEOUT] -m MODEL [-s [SYSTEM_PROMPT]]
options:
-h, --help show this help message and exit
-b BINARY, --binary BINARY
Path to the llama.cpp executable binary.
-t TIMEOUT, --timeout TIMEOUT
llama.cpp execution timeout (seconds)
-m MODEL, --model MODEL
Path to the GGUF model file.
-s [SYSTEM_PROMPT], --system-prompt [SYSTEM_PROMPT]
Use given system prompt. By default, the system prompt is not used. When this option is
passed without a value, the default system prompt value is used: 'You are a master of
logical thinking. You carefully analyze the premises step by step, take detailed notes and
draw intermediate conclusions based on which you can find the final answer to any
question.'
The run_model.py script uses OpenAI API to generate answers for family relationship quizzes generated by farel_bench.py for a selected LLM.
usage: run_openai.py [-h] -m MODEL [-s [SYSTEM_PROMPT]]
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
OpenAI model name.
-s [SYSTEM_PROMPT], --system-prompt [SYSTEM_PROMPT]
Use given system prompt. By default, the system prompt is not used. When this option is
passed without a value, the default system prompt value is used: 'You are a master of
logical thinking. You carefully analyze the premises step by step, take detailed notes and
draw intermediate conclusions based on which you can find the final answer to any
question.'
The run_model.py script uses OpenRouter API to generate answers for family relationship quizzes generated by farel_bench.py for a selected LLM.
usage: run_openrouter.py [-h] -m MODEL [-s [SYSTEM_PROMPT]]
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
OpenRouter model name.
-s [SYSTEM_PROMPT], --system-prompt [SYSTEM_PROMPT]
Use given system prompt. By default, the system prompt is not used. When this option is
passed without a value, the default system prompt value is used: 'You are a master of
logical thinking. You carefully analyze the premises step by step, take detailed notes and
draw intermediate conclusions based on which you can find the final answer to any
question.'
The compute_metrics.py script reads .log files from a given directory, calculates the FaRel benchmark metrics, and prints a result table in a markdown format.
usage: compute_metrics.py [-h] dir
positional arguments:
dir Directory containing farel-bench log files.
options:
-h, --help show this help message and exit
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for farel-bench
Similar Open Source Tools
farel-bench
The 'farel-bench' project is a benchmark tool for testing LLM reasoning abilities with family relationship quizzes. It generates quizzes based on family relationships of varying degrees and measures the accuracy of large language models in solving these quizzes. The project provides scripts for generating quizzes, running models locally or via APIs, and calculating benchmark metrics. The quizzes are designed to test logical reasoning skills using family relationship concepts, with the goal of evaluating the performance of language models in this specific domain.
llm-compression-intelligence
This repository presents the findings of the paper "Compression Represents Intelligence Linearly". The study reveals a strong linear correlation between the intelligence of LLMs, as measured by benchmark scores, and their ability to compress external text corpora. Compression efficiency, derived from raw text corpora, serves as a reliable evaluation metric that is linearly associated with model capabilities. The repository includes the compression corpora used in the paper, code for computing compression efficiency, and data collection and processing pipelines.
jailbreak_llms
This is the official repository for the ACM CCS 2024 paper 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. The project employs a new framework called JailbreakHub to conduct the first measurement study on jailbreak prompts in the wild, collecting 15,140 prompts from December 2022 to December 2023, including 1,405 jailbreak prompts. The dataset serves as the largest collection of in-the-wild jailbreak prompts. The repository contains examples of harmful language and is intended for research purposes only.
COLD-Attack
COLD-Attack is a framework designed for controllable jailbreaks on large language models (LLMs). It formulates the controllable attack generation problem and utilizes the Energy-based Constrained Decoding with Langevin Dynamics (COLD) algorithm to automate the search of adversarial LLM attacks with control over fluency, stealthiness, sentiment, and left-right-coherence. The framework includes steps for energy function formulation, Langevin dynamics sampling, and decoding process to generate discrete text attacks. It offers diverse jailbreak scenarios such as fluent suffix attacks, paraphrase attacks, and attacks with left-right-coherence.
OpenAI-CLIP-Feature
This repository provides code for extracting image and text features using OpenAI CLIP models, supporting both global and local grid visual features. It aims to facilitate multi visual-and-language downstream tasks by allowing users to customize input and output grid resolution easily. The extracted features have shown comparable or superior results in image captioning tasks without hyperparameter tuning. The repo supports various CLIP models and provides detailed information on supported settings and results on MSCOCO image captioning. Users can get started by setting up experiments with the extracted features using X-modaler.
LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.
Awesome-LLM-Large-Language-Models-Notes
Awesome-LLM-Large-Language-Models-Notes is a repository that provides a comprehensive collection of information on various Large Language Models (LLMs) classified by year, size, and name. It includes details on known LLM models, their papers, implementations, and specific characteristics. The repository also covers LLM models classified by architecture, must-read papers, blog articles, tutorials, and implementations from scratch. It serves as a valuable resource for individuals interested in understanding and working with LLMs in the field of Natural Language Processing (NLP).
SemanticFinder
SemanticFinder is a frontend-only live semantic search tool that calculates embeddings and cosine similarity client-side using transformers.js and SOTA embedding models from Huggingface. It allows users to search through large texts like books with pre-indexed examples, customize search parameters, and offers data privacy by keeping input text in the browser. The tool can be used for basic search tasks, analyzing texts for recurring themes, and has potential integrations with various applications like wikis, chat apps, and personal history search. It also provides options for building browser extensions and future ideas for further enhancements and integrations.
tamingLLMs
The 'Taming LLMs' repository provides a practical guide to the pitfalls and challenges associated with Large Language Models (LLMs) when building applications. It focuses on key limitations and implementation pitfalls, offering practical Python examples and open source solutions to help engineers and technical leaders navigate these challenges. The repository aims to equip readers with the knowledge to harness the power of LLMs while avoiding their inherent limitations.
crabml
Crabml is a llama.cpp compatible AI inference engine written in Rust, designed for efficient inference on various platforms with WebGPU support. It focuses on running inference tasks with SIMD acceleration and minimal memory requirements, supporting multiple models and quantization methods. The project is hackable, embeddable, and aims to provide high-performance AI inference capabilities.
llm-datasets
LLM Datasets is a repository containing high-quality datasets, tools, and concepts for LLM fine-tuning. It provides datasets with characteristics like accuracy, diversity, and complexity to train large language models for various tasks. The repository includes datasets for general-purpose, math & logic, code, conversation & role-play, and agent & function calling domains. It also offers guidance on creating high-quality datasets through data deduplication, data quality assessment, data exploration, and data generation techniques.
AV-Deepfake1M
The AV-Deepfake1M repository is the official repository for the paper AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset. It addresses the challenge of detecting and localizing deepfake audio-visual content by proposing a dataset containing video manipulations, audio manipulations, and audio-visual manipulations for over 2K subjects resulting in more than 1M videos. The dataset is crucial for developing next-generation deepfake localization methods.
rubra
Rubra is a collection of open-weight large language models enhanced with tool-calling capability. It allows users to call user-defined external tools in a deterministic manner while reasoning and chatting, making it ideal for agentic use cases. The models are further post-trained to teach instruct-tuned models new skills and mitigate catastrophic forgetting. Rubra extends popular inferencing projects for easy use, enabling users to run the models easily.
2024-AICS-EXP
This repository contains the complete archive of the 2024 version of the 'Intelligent Computing System' experiment at the University of Chinese Academy of Sciences. The experiment content for 2024 has undergone extensive adjustments to the knowledge system and experimental topics, including the transition from TensorFlow to PyTorch, significant modifications to previous code, and the addition of experiments with large models. The project is continuously updated in line with the course progress, currently up to the seventh experiment. Updates include the addition of experiments like YOLOv5 in Experiment 5-3, updates to theoretical teaching materials, and fixes for bugs in Experiment 6 code. The repository also includes experiment manuals, questions, and answers for various experiments, with some data sets hosted on Baidu Cloud due to size limitations on GitHub.
MathEval
MathEval is a benchmark designed for evaluating the mathematical capabilities of large models. It includes over 20 evaluation datasets covering various mathematical domains with more than 30,000 math problems. The goal is to assess the performance of large models across different difficulty levels and mathematical subfields. MathEval serves as a reliable reference for comparing mathematical abilities among large models and offers guidance on enhancing their mathematical capabilities in the future.
LLM-QAT
This repository contains the training code of LLM-QAT for large language models. The work investigates quantization-aware training for LLMs, including quantizing weights, activations, and the KV cache. Experiments were conducted on LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. Significant improvements were observed when quantizing weight, activations, and kv cache to 4-bit, 8-bit, and 4-bit, respectively.
For similar tasks
Efficient-Multimodal-LLMs-Survey
Efficient Multimodal Large Language Models: A Survey provides a comprehensive review of efficient and lightweight Multimodal Large Language Models (MLLMs), focusing on model size reduction and cost efficiency for edge computing scenarios. The survey covers the timeline of efficient MLLMs, research on efficient structures and strategies, and applications. It discusses current limitations and future directions in efficient MLLM research.
uvadlc_notebooks
The UvA Deep Learning Tutorials repository contains a series of Jupyter notebooks designed to help understand theoretical concepts from lectures by providing corresponding implementations. The notebooks cover topics such as optimization techniques, transformers, graph neural networks, and more. They aim to teach details of the PyTorch framework, including PyTorch Lightning, with alternative translations to JAX+Flax. The tutorials are integrated as official tutorials of PyTorch Lightning and are relevant for graded assignments and exams.
LiveBench
LiveBench is a benchmark tool designed for Language Model Models (LLMs) with a focus on limiting contamination through monthly new questions based on recent datasets, arXiv papers, news articles, and IMDb movie synopses. It provides verifiable, objective ground-truth answers for accurate scoring without an LLM judge. The tool offers 18 diverse tasks across 6 categories and promises to release more challenging tasks over time. LiveBench is built on FastChat's llm_judge module and incorporates code from LiveCodeBench and IFEval.
farel-bench
The 'farel-bench' project is a benchmark tool for testing LLM reasoning abilities with family relationship quizzes. It generates quizzes based on family relationships of varying degrees and measures the accuracy of large language models in solving these quizzes. The project provides scripts for generating quizzes, running models locally or via APIs, and calculating benchmark metrics. The quizzes are designed to test logical reasoning skills using family relationship concepts, with the goal of evaluating the performance of language models in this specific domain.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.