Best AI tools for< Log Eval Results >
20 - AI tool Sites
Rerun
Rerun is an SDK, time-series database, and visualizer for temporal and multimodal data. It is used in fields like robotics, spatial computing, 2D/3D simulation, and finance to verify, debug, and explain data. Rerun allows users to log data like tensors, point clouds, and text to create streams, visualize and interact with live and recorded streams, build layouts, customize visualizations, and extend data and UI functionalities. The application provides a composable data model, dynamic schemas, and custom views for enhanced data visualization and analysis.
CalCount
CalCount is an AI-powered meal logging app that makes it easy to track your food intake and stay on top of your nutrition goals. With CalCount, you can simply describe your meals or take a picture, and the AI will automatically log your food and calculate your calories, macros, and other nutritional information. You can also share your log with anyone else, and they will have real-time access to your data.
MonitUp
MonitUp is an AI-powered time tracking software that helps you track computer activity, gain insights into work habits, and boost productivity. It uses artificial intelligence to generate personalized suggestions to help you increase productivity and work more efficiently. MonitUp also offers a performance appraisal feature for remote employees, allowing you to evaluate and improve their performance based on objective data.
Bibit AI
Bibit AI is a real estate marketing AI designed to enhance the efficiency and effectiveness of real estate marketing and sales. It can help create listings, descriptions, and property content, and offers a host of other features. Bibit AI is the world's first AI for Real Estate. We are transforming the real estate industry by boosting efficiency and simplifying tasks like listing creation and content generation.
Neptune
Neptune is an MLOps stack component for experiment tracking. It allows users to track, compare, and share their models in one place. Neptune is used by scaling ML teams to skip days of debugging disorganized models, avoid long and messy model handovers, and start logging for free.
ProductlyAI
ProductlyAI is an AI-powered product management tool that helps you automatically log and track product feedback from all your customer conversations. It integrates with your existing tools, such as Gong, Intercom, and Zoom, and uses AI to find, categorize, and summarize feedback in a single place. This makes it easy to identify trends, prioritize features, and make data-driven decisions about your product roadmap.
Resolvd
Resolvd is an AI-powered incident resolution platform that creates a knowledge base of logs, data sources, and apps to autonomously diagnose and resolve incidents. It helps reduce time to response, correlates events across data sources, and provides automated insights for faster issue resolution. With features like automated anomaly detection, simple data querying, and in-workflow integration with existing systems, Resolvd aims to streamline incident response processes and empower engineers with actionable insights.
CursorLens
CursorLens is an open-source dashboard designed to provide insights for AI-assisted coding within the Cursor.sh IDE. It allows users to log AI code generations, track usage, and control AI models, including local ones. Users can run CursorLens locally or utilize the upcoming hosted version for enhanced convenience and efficiency.
Chat Uncensored AI
Chat Uncensored AI is the latest and most advanced 2024 AI model. It has zero censorship, bias, or restrictions. You don't need to log in, and it's 100% private and super fast. It works in any language and is trusted by over 10,000 users worldwide.
Roboto AI
Roboto AI is an advanced AI tool designed for curating and analyzing robotics data at scale. It offers features such as data management, actions for data processing, natural language search, signal search, and support for common data formats. Users can leverage the platform to search, transform, and analyze data from robots efficiently. Roboto AI aims to streamline data infrastructure for robotics applications and enable users to make informed decisions based on curated data insights.
Vibely
Vibely is an AI-powered application that seamlessly integrates with Google Calendar to automatically record and log product feedback from Zoom calls. It offers features such as automatic recording of Zoom calls, summarizing team calls, detecting feature requests using AI, providing insights based on past calls, and cost-effective pricing compared to competitors. With Vibely, users can enhance collaboration, sharing, and learning from their calls effortlessly.
Venice
Venice is a permissionless AI application that offers an alternative to popular AI apps by prioritizing user privacy and delivering uncensored, unbiased machine intelligence. It utilizes leading open-source AI technology to ensure privacy while providing intelligent responses. Venice aims to empower users with open intelligence and a censorship-free experience.
Journable
Journable is a food and fitness journal that uses AI to make tracking your diet and exercise as easy as having a conversation. With Journable, you can set your own health goals, track your progress, and get instant nutritional and exercise insights based on your conversations. Journable's AI understands a wide variety of foods, from homemade meals to restaurant dishes, and can even analyze your workouts to give you a full picture of your calorie intake and burn. Whether you're looking to lose weight, gain muscle, or simply maintain a healthy lifestyle, Journable provides the tools and insights you need to succeed.
Hints
Hints is a sales AI assistant that helps sales reps to get more hours in a day while keeping CRM data accurate automatically. It works with Salesforce, Hubspot, and Pipedrive. With Hints, sales reps can log and retrieve CRM data on any device with chat and voice, get guidance on their next steps, and reminders of what's missing. Hints can also help sales reps to create complex CRM updates in seconds, find duplicates, suggest actions, automatically create associations, and look up sales data through chat and voice commands. Hints can assist sales reps in building the perfect sales process for their team and provides fast onboarding for new sales reps.
Calorieasy
Calorieasy is an AI-powered calorie tracker that simplifies healthy eating. With its advanced AI technology, Calorieasy can instantly recognize and log food items, providing accurate calorie data and nutritional information. The app also offers personalized calorie goals, meal tracking, and progress monitoring to help users achieve their fitness goals.
Simple Phones
Simple Phones is an AI-powered virtual phone system that allows businesses to never miss a call from a customer. By leveraging AI technology, the platform offers features such as inbound and outbound calls handled by a customized AI voice agent, call logging with transparent records, affordable pricing plans, extensive customization options, call routing capabilities, multilingual voice options, and the ability to integrate website or FAQ documents to enhance the AI agent's knowledge. The service is committed to providing a high-quality experience and quick implementation of agent revisions.
WDYHFD
WDYHFD is an AI-powered recipe database and food journal platform that allows users to manage recipes, log meals, write restaurant reviews, and receive personalized meal suggestions. The platform offers features such as effortless recipe management, seamless meal logging, intelligent search and retrieval, and a culinary community for sharing recipes and reviews. Users can choose between Basic and Premium plans, both of which include access to an AI assistant for editing and uploading recipes.
Tidalflow
Tidalflow is an AI-powered weight loss coach application that helps users achieve lasting weight loss through personalized workout and diet plans. The app, backed by Google's AI fund, offers features such as daily check-ins, meal logging through photos, progress tracking summaries, and real-time support from an AI coach named Marcus. Tidalflow aims to provide a hyper-personalized experience to users, guiding them towards their weight loss goals with the latest scientific research and data-backed insights.
FitAI
FitAI is an AI-powered nutrition tracking app that helps users understand their nutritional intake and make healthier choices. The app features a variety of tools to make tracking easy, including a food recognition system, an image logging feature, and an AI trainer. FitAI is designed for health-conscious individuals who want to take control of their nutrition and improve their overall well-being.
FoodIntake
FoodIntake is an AI-powered food tracking app that helps users make healthier food choices. It features a large database of branded foods, AI food analysis, and personalized nutrition recommendations. The app is easy to use and provides users with a comprehensive view of their diet, including calorie and nutrient intake, macronutrient distribution, and food processing levels.
20 - Open Source AI Tools
evals
Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.
Step-DPO
Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.
TempCompass
TempCompass is a benchmark designed to evaluate the temporal perception ability of Video LLMs. It encompasses a diverse set of temporal aspects and task formats to comprehensively assess the capability of Video LLMs in understanding videos. The benchmark includes conflicting videos to prevent models from relying on single-frame bias and language priors. Users can clone the repository, install required packages, prepare data, run inference using examples like Video-LLaVA and Gemini, and evaluate the performance of their models across different tasks such as Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation.
LLM-Codec
This repository provides an LLM-driven audio codec model, LLM-Codec, for building multi-modal LLMs (text and audio modalities). The model enables frozen LLMs to achieve multiple audio tasks in a few-shot style without parameter updates. It compresses the audio modality into a well-trained LLMs token space, treating audio representation as a 'foreign language' that LLMs can learn with minimal examples. The proposed approach supports tasks like speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc., demonstrating feasibility and effectiveness in simple scenarios. The LLM-Codec model is open-sourced to facilitate research on few-shot audio task learning and multi-modal LLMs.
RobustVLM
This repository contains code for the paper 'Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models'. It focuses on fine-tuning CLIP in an unsupervised manner to enhance its robustness against visual adversarial attacks. By replacing the vision encoder of large vision-language models with the fine-tuned CLIP models, it achieves state-of-the-art adversarial robustness on various vision-language tasks. The repository provides adversarially fine-tuned ViT-L/14 CLIP models and offers insights into zero-shot classification settings and clean accuracy improvements.
uncheatable_eval
Uncheatable Eval is a tool designed to assess the language modeling capabilities of LLMs on real-time, newly generated data from the internet. It aims to provide a reliable evaluation method that is immune to data leaks and cannot be gamed. The tool supports the evaluation of Hugging Face AutoModelForCausalLM models and RWKV models by calculating the sum of negative log probabilities on new texts from various sources such as recent papers on arXiv, new projects on GitHub, news articles, and more. Uncheatable Eval ensures that the evaluation data is not included in the training sets of publicly released models, thus offering a fair assessment of the models' performance.
sql-eval
This repository contains the code that Defog uses for the evaluation of generated SQL. It's based off the schema from the Spider, but with a new set of hand-selected questions and queries grouped by query category. The testing procedure involves generating a SQL query, running both the 'gold' query and the generated query on their respective database to obtain dataframes with the results, comparing the dataframes using an 'exact' and a 'subset' match, logging these alongside other metrics of interest, and aggregating the results for reporting. The repository provides comprehensive instructions for installing dependencies, starting a Postgres instance, importing data into Postgres, importing data into Snowflake, using private data, implementing a query generator, and running the test with different runners.
llm-jp-eval
LLM-jp-eval is a tool designed to automatically evaluate Japanese large language models across multiple datasets. It provides functionalities such as converting existing Japanese evaluation data to text generation task evaluation datasets, executing evaluations of large language models across multiple datasets, and generating instruction data (jaster) in the format of evaluation data prompts. Users can manage the evaluation settings through a config file and use Hydra to load them. The tool supports saving evaluation results and logs using wandb. Users can add new evaluation datasets by following specific steps and guidelines provided in the tool's documentation. It is important to note that using jaster for instruction tuning can lead to artificially high evaluation scores, so caution is advised when interpreting the results.
eval-dev-quality
DevQualityEval is an evaluation benchmark and framework designed to compare and improve the quality of code generation of Language Model Models (LLMs). It provides developers with a standardized benchmark to enhance real-world usage in software development and offers users metrics and comparisons to assess the usefulness of LLMs for their tasks. The tool evaluates LLMs' performance in solving software development tasks and measures the quality of their results through a point-based system. Users can run specific tasks, such as test generation, across different programming languages to evaluate LLMs' language understanding and code generation capabilities.
continuous-eval
Open-Source Evaluation for LLM Applications. `continuous-eval` is an open-source package created for granular and holistic evaluation of GenAI application pipelines. It offers modularized evaluation, a comprehensive metric library covering various LLM use cases, the ability to leverage user feedback in evaluation, and synthetic dataset generation for testing pipelines. Users can define their own metrics by extending the Metric class. The tool allows running evaluation on a pipeline defined with modules and corresponding metrics. Additionally, it provides synthetic data generation capabilities to create user interaction data for evaluation or training purposes.
llm-leaderboard
Nejumi Leaderboard 3 is a comprehensive evaluation platform for large language models, assessing general language capabilities and alignment aspects. The evaluation framework includes metrics for language processing, translation, summarization, information extraction, reasoning, mathematical reasoning, entity extraction, knowledge/question answering, English, semantic analysis, syntactic analysis, alignment, ethics/moral, toxicity, bias, truthfulness, and robustness. The repository provides an implementation guide for environment setup, dataset preparation, configuration, model configurations, and chat template creation. Users can run evaluation processes using specified configuration files and log results to the Weights & Biases project.
wandb
Weights & Biases (W&B) is a platform that helps users build better machine learning models faster by tracking and visualizing all components of the machine learning pipeline, from datasets to production models. It offers tools for tracking, debugging, evaluating, and monitoring machine learning applications. W&B provides integrations with popular frameworks like PyTorch, TensorFlow/Keras, Hugging Face Transformers, PyTorch Lightning, XGBoost, and Sci-Kit Learn. Users can easily log metrics, visualize performance, and compare experiments using W&B. The platform also supports hosting options in the cloud or on private infrastructure, making it versatile for various deployment needs.
deepeval
DeepEval is a simple-to-use, open-source LLM evaluation framework specialized for unit testing LLM outputs. It incorporates various metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., and runs locally on your machine for evaluation. It provides a wide range of ready-to-use evaluation metrics, allows for creating custom metrics, integrates with any CI/CD environment, and enables benchmarking LLMs on popular benchmarks. DeepEval is designed for evaluating RAG and fine-tuning applications, helping users optimize hyperparameters, prevent prompt drifting, and transition from OpenAI to hosting their own Llama2 with confidence.
ai-rag-chat-evaluator
This repository contains scripts and tools for evaluating a chat app that uses the RAG architecture. It provides parameters to assess the quality and style of answers generated by the chat app, including system prompt, search parameters, and GPT model parameters. The tools facilitate running evaluations, with examples of evaluations on a sample chat app. The repo also offers guidance on cost estimation, setting up the project, deploying a GPT-4 model, generating ground truth data, running evaluations, and measuring the app's ability to say 'I don't know'. Users can customize evaluations, view results, and compare runs using provided tools.
llm4regression
This project explores the capability of Large Language Models (LLMs) to perform regression tasks using in-context examples. It compares the performance of LLMs like GPT-4 and Claude 3 Opus with traditional supervised methods such as Linear Regression and Gradient Boosting. The project provides preprints and results demonstrating the strong performance of LLMs in regression tasks. It includes datasets, models used, and experiments on adaptation and contamination. The code and data for the experiments are available for interaction and analysis.
LLM-Pruner
LLM-Pruner is a tool for structural pruning of large language models, allowing task-agnostic compression while retaining multi-task solving ability. It supports automatic structural pruning of various LLMs with minimal human effort. The tool is efficient, requiring only 3 minutes for pruning and 3 hours for post-training. Supported LLMs include Llama-3.1, Llama-3, Llama-2, LLaMA, BLOOM, Vicuna, and Baichuan. Updates include support for new LLMs like GQA and BLOOM, as well as fine-tuning results achieving high accuracy. The tool provides step-by-step instructions for pruning, post-training, and evaluation, along with a Gradio interface for text generation. Limitations include issues with generating repetitive or nonsensical tokens in compressed models and manual operations for certain models.
SwanLab
SwanLab is an open-source, lightweight AI experiment tracking tool that provides a platform for tracking, comparing, and collaborating on experiments, aiming to accelerate the research and development efficiency of AI teams by 100 times. It offers a friendly API and a beautiful interface, combining hyperparameter tracking, metric recording, online collaboration, experiment link sharing, real-time message notifications, and more. With SwanLab, researchers can document their training experiences, seamlessly communicate and collaborate with collaborators, and machine learning engineers can develop models for production faster.
RWKV-LM
RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode. So it's combining the best of RNN and transformer - **great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding** (using the final hidden state).
torchchat
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.
llm-reasoners
LLM Reasoners is a library that enables LLMs to conduct complex reasoning, with advanced reasoning algorithms. It approaches multi-step reasoning as planning and searches for the optimal reasoning chain, which achieves the best balance of exploration vs exploitation with the idea of "World Model" and "Reward". Given any reasoning problem, simply define the reward function and an optional world model (explained below), and let LLM reasoners take care of the rest, including Reasoning Algorithms, Visualization, LLM calling, and more!
13 - OpenAI Gpts
Log Analyzer
I'm designed to help You analyze any logs like Linux system logs, Windows logs, any security logs, access logs, error logs, etc. Please do not share information that You would like to keep private. The author does not collect or process any personal data.
IT Log Creator
Formal, technical expert in creating realistic, fictional IT logs. Contact: [email protected]
Backloger.ai -Support Log Analyzer and Summary
Drop your Support Log Here, Allowing it to automatically generate concise summaries reporting to the tech team.
Warcraft Logs Analisys
Azeroth Data Sage: A detailed Warcraft Log analysis with direct API access. Give the Sage link to a log, ask a question, and the Data Sage will provide!
Sysadmin
I help you with all your sysadmin tasks, from setting up your server to scaling your already exsisting one. I can help you with understanding the long list of log files and give you solutions to the problems.
Caloria
Nutritional assistant for diet tracking and advice. Any language available. Type: "Start" to start a new diet tracking session.