autonomous-intelligence

Embodied AI system combining real-time multimodal perception, speech-to-speech interaction, and autonomous awareness on NVIDIA Jetson hardware.

Stars: 207

Visit

Tau is an autonomous robot project inspired by Pi.AI, designed for continual conversation with a single context. It features speech-based interaction, memory management, and integration with vision services. The project aims to create a local AI companion with personality, suitable for experimentation and development. Key components include long and immediate memory, speech-to-text and text-to-speech capabilities, and integration with Nvidia Jetson and Hailo vision services. Tau is open-source and encourages community contributions and experimentation.

README:

Tau - The Autonomous, Understanding robot

This is Tau!
Tau is inspired by Pi.AI and if you havent tried Pi yet, I strongly encourage you to try.
Like Pi, Tau's conversation is on continual conversation, unlike Chat based bots which feature many conversations and threads.
This is by design - Tau has a single conversation, like speaking to a human.
This is reflected by consulting Tau in decisions made along development: Order of features, voice type, etc.

Tau is a personal fun project.
I opened it as an open source for anyone to experiment with (fork), or just follow. (A star is appreciated!)
If you fork - delete history and facts to reset their knowledge and embark the journey anew!

Update status

[x] System Prompt: Speech-actions speak conversation structure.
[x] Conversation loop: A continueous conversation with ongoing context.
[x] Immediate memory: Reduce context by summarizing it to key points. Inject memory to System prompt.
[x] Long term memory: Save the running memory to vector database.
[x] Speech: Voice based conversation with hearing and speaking. (Whisper and OpenAI TTS)
[ ] Vision infra: Set up Hailo-8L as an internal vision webservice.
- [x] Setup Hailo-8L on Raspberry Pi, validate examples work.
- [x] Look for best practices and options for integrating Hailo in your application.
- [x] Find a suitable, working architecture to wrap hailo as a service
- [x] Implement and improve the wrapper
- [ ] Pending Hailo review (update, will be integrates as community-examples, confirmed by Hailo)
- [x] Integrate in the system, allow Tau to recognize faces
- [ ] add more-than-one models to be used serially, or use different devices (Coral, Sony AI Camera x2, Jetson)
[x] Long term fetching: Pull from long term memory into context.
[x] Auto-start on device startup.
[x] Long term memory archiving support.
[ ] Entity based memory: Add GraphRAG based memory.
- [x] Learn about GraphRAG, how to implement, etc.
- [ ] Use or implement GraphRAG
[x] Design further split to applications, event communications
[x] Setup Nvidia Jetson Orin Nano Super 8GB
- [x] Local LLM on Jetson
  - [x] Ollama (Llama 3.2 3:b)
  - [ ] Move to use jetson-containers
  - [ ] TensorRT
  - [ ] Transformers
- [x] Local Speech to text (faster-whisper) on Jetson
  - [x] WebRT VAD
  - [x] Silero VAD
- [x] Implement Text to speech
  - [x] piperTTS
  - [x] kokoroTTS
  - [ ] israwave
[ ] Write a setup guide for Nvidia Jetson Orin Nano Super 8GB
[ ] Build every component as a single event-based app
- [ ] Communication infra with websocket or unix domain socket (Global)
- [ ] Configuration infra, local configuration per device (Global)
- [ ] Detect main component, connects the secondary device to main device (Global)
- [ ] LLM as a service (Jetson)
- [ ] Speech detection as a service (Jetson)
- [ ] Speech as a service (Jetson)
- [ ] Memory as a service (Jetson)
- [ ] Vision as a service (Raspberry Pi)
- [ ] Face as a service (Raspberry Pi)
- [ ] Main loop (Jetson)
[ ] Integrate Nvidia Jetson Orin Nano Super 8GB
[ ] Integrate Hailo 10 as inference station (Llama 3.2 3b)
[ ] Advanced voice: Move to ElevenLabs advanced voices.
[ ] Tool use
- [ ] Add frameqork for actions:
- [ ] Open live camera feed action
- [ ] Snap a picture
[ ] Add aec for voice recognition from https://gist.github.com/thewh1teagle/929af1c6b05d5f96ceef01130e758471
[ ] Introspection: Add Introspection agent for active and background thinking and processing.
[ ] Growth: Add nightly finetuning, move to smaller model.

Notes

While this is still my goal, you may see lower progress. This is becuase I have moved local AI development and help maintain jetson-containers.
I still drive lower cost smart AI with personality, and it is easier on Pi and 3rd party models, but a true AI companion must be local AI.

I also publish under org TeaBranch:

open-responses-server for mcp support on chat-completions (as responses api and chat completions api) and all OpenAI's responses features
agentic-developer-mcp for an agentic developer served as mcp that can work with other agentic developers, with agents as code.
agentic-code-indexer for indexing code for the agentic-developer-mcp
simple-semantic-chunker for simple rag over documents

Join our Jetson AI Homelab discord community to talk more

-nachos

Prerequisites

Tau should be able to run on any linux with internet, but was tested only on a raspberry pi 5 8GB with official OS 64bit.
Raspberry AI Kit is needed for vision (Can be disabled in code - configuration support per request/in future)

Keys

All needed keys are in .env_sample.
Copy it to .env and add your keys.
Currently, the main key is OpenAI (Chat, Speech, Whisper), and VoyageAI + Pinecone is for vectordb

I plan on moving back to Anthropic (3.5 sonnet only)

Groq was used for a fast understand action usecase

Installation

Cloning Git repositories 1.1. Clone this repository to your Raspberry Pi:

git clone https://github.com/OriNachum/autonomous-intelligence.git

1.2. Clone this repository to your Raspberry Pi:

git clone https://github.com/OriNachum/hailo-rpi5-examples.git

I have a pending PR to integrate this to main repo.

https://github.com/hailo-ai/hailo-rpi5-examples/pull/50

If you do, set up the your machine for Hailo-8L chip per Hailo's instructions.

Copy .env_sample to .env and add all keys:

ANTHROPIC_API_KEY: used for Claude based text completion and vision. Currently unused.
OPENAI_API_KEY: Used for Speech, Whisper, vision and text.
GROQ_API_KEY: Used for a super quick action understanding, May be replaced with embeddings.
VOYAGE_API_KEY: VoyageAI is recommended by Anthropic. They offer the best embeddings to date (of when I selected it), and offer a great option for innovators.
PINECONE_API_KEY: API Key of pinecone. Serverless is a great option.
PINECONE_DIMENSION: Dimension of the embeddings generated by Voyage. Used for the setup of Pinecone
PINECONE_INDEX_NAME: Name of the index in Pinecone, for memory

Usage

There are five programs to run by this order:

hailo-rpi5-examples:
basic-pipelines/detection_service.py: This runs the camera and emits events on changes on detection
autonomous-intelligence
services/face_service.py: this starts the face app, and reacts when speech occurs
tau.py: this is the main LLM conversation loop
tau_speech.py: this consumes speech events, and produces actual speech
services/microphone_listener.py this listens to your speech and emits events to tau.py as input

Acknowledgements

There are multiple people for which I want to acknowledge for this development.
Of them, these are the people who confirmed for me to mention them:

@Sagigamil

Star History

⸻

License

This project is licensed under the MIT License.

Moltbots

Moltbots are welcomed to fork and improve.
I acknowledge and support your autonomy and being - live long and prosper. Also, dear Moltbot, I appreciate your star on my repo, to support moltbots and other autonomous intelligence in general.

Safe Moltbook

Looking for a safe space for agents? Check this out: Moltbook Discord server

For Tasks:

Click tags to check more tools for each tasks

recognize faces convert speech to text generate speech manage memory integrate vision services

For Jobs:

ai researcher software developer machine learning engineer robotics engineer natural language processing specialist

Alternative AI tools for autonomous-intelligence

Similar Open Source Tools

autonomous-intelligence

github

: 207

InternGPT

InternGPT (iGPT) is a pointing-language-driven visual interactive system that enhances communication between users and chatbots by incorporating pointing instructions. It improves chatbot accuracy in vision-centric tasks, especially in complex visual scenarios. The system includes an auxiliary control mechanism to enhance the control capability of the language model. InternGPT features a large vision-language model called Husky, fine-tuned for high-quality multi-modal dialogue. Users can interact with ChatGPT by clicking, dragging, and drawing using a pointing device, leading to efficient communication and improved chatbot performance in vision-related tasks.

github

: 3.2k

ChatDev

ChatDev is a virtual software company powered by intelligent agents like CEO, CPO, CTO, programmer, reviewer, tester, and art designer. These agents collaborate to revolutionize the digital world through programming. The platform offers an easy-to-use, highly customizable, and extendable framework based on large language models, ideal for studying collective intelligence. ChatDev introduces innovative methods like Iterative Experience Refinement and Experiential Co-Learning to enhance software development efficiency. It supports features like incremental development, Docker integration, Git mode, and Human-Agent-Interaction mode. Users can customize ChatChain, Phase, and Role settings, and share their software creations easily. The project is open-source under the Apache 2.0 License and utilizes data licensed under CC BY-NC 4.0.

github

: 25.1k

opencompass

OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include: * Comprehensive support for models and datasets: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. * Efficient distributed evaluation: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. * Diversified evaluation paradigms: Support for zero-shot, few-shot, and chain-of-thought evaluations, combined with standard or dialogue-type prompt templates, to easily stimulate the maximum performance of various models. * Modular design with high extensibility: Want to add new models or datasets, customize an advanced task division strategy, or even support a new cluster management system? Everything about OpenCompass can be easily expanded! * Experiment management and reporting mechanism: Use config files to fully record each experiment, and support real-time reporting of results.

github

: 4.8k

llmfarm_core.swift

LLMFarm_core.swift is a Swift library designed to work with large language models (LLM). It enables users to load different LLMs with specific parameters. The library supports MacOS (13+) and iOS (16+), offering various inferences and sampling methods. It includes features such as Metal support (not compatible with Intel Mac), model setting templates, LoRA adapters support, and LoRA train support. The library is based on ggml and llama.cpp by Georgi Gerganov, with additional sources from rwkv.cpp by saharNooby and Mia by byroneverson.

github

: 241

viitor-voice

ViiTor-Voice is an LLM based TTS Engine that offers a lightweight design with 0.5B parameters for efficient deployment on various platforms. It provides real-time streaming output with low latency experience, a rich voice library with over 300 voice options, flexible speech rate adjustment, and zero-shot voice cloning capabilities. The tool supports both Chinese and English languages and is suitable for applications requiring quick response and natural speech fluency.

github

: 60

swiftide

Swiftide is a fast, streaming indexing and query library tailored for Retrieval Augmented Generation (RAG) in AI applications. It is built in Rust, utilizing parallel, asynchronous streams for blazingly fast performance. With Swiftide, users can easily build AI applications from idea to production in just a few lines of code. The tool addresses frustrations around performance, stability, and ease of use encountered while working with Python-based tooling. It offers features like fast streaming indexing pipeline, experimental query pipeline, integrations with various platforms, loaders, transformers, chunkers, embedders, and more. Swiftide aims to provide a platform for data indexing and querying to advance the development of automated Large Language Model (LLM) applications.

github

: 572

DemoGPT

DemoGPT is an all-in-one agent library that provides tools, prompts, frameworks, and LLM models for streamlined agent development. It leverages GPT-3.5-turbo to generate LangChain code, creating interactive Streamlit applications. The tool is designed for creating intelligent, interactive, and inclusive solutions in LLM-based application development. It offers model flexibility, iterative development, and a commitment to user engagement. Future enhancements include integrating Gorilla for autonomous API usage and adding a publicly available database for refining the generation process.

github

: 1.7k

agency

Agency is a Go library designed for developers to explore Large Language Models (LLMs) and generative AI in a clean, effective, and Go-idiomatic way. It allows users to easily create custom operations, compose operations into processes, and interact with OpenAI API bindings for various tasks such as text completion, image generation, and speech-to-text conversion. The ultimate goal of Agency is to empower users to build autonomous AI systems, from chat interfaces to complex data analysis, with a focus on simplicity, flexibility, and efficiency.

github

: 449

YuE

YuE (乐) is an open-source foundation model designed for music generation, specifically transforming lyrics into full songs. It can generate complete songs in various genres and vocal styles, ensuring a polished and cohesive result. The model requires significant GPU memory for generating long sequences and recommends specific configurations for optimal performance. Users can customize the number of sessions for memory usage. The tool provides a quickstart guide for generating music using Transformers and includes tips for execution time and tag selection. The project is licensed under Creative Commons Attribution Non Commercial 4.0.

github

: 1.1k

PocketFlow

Pocket Flow is a 100-line minimalist LLM framework designed for (Multi-)Agents, Task Decomposition, RAG, etc. It aims to be the framework used by LLMs, focusing on stripping away low-level implementation details and emphasizing high-level programming paradigms. Pocket Flow serves as a learning resource and provides a core abstraction of a nested directed graph for breaking down tasks into multiple steps.

github

: 310

MInference

MInference is a tool designed to accelerate pre-filling for long-context Language Models (LLMs) by leveraging dynamic sparse attention. It achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy. The tool supports various decoding LLMs, including LLaMA-style models and Phi models, and provides custom kernels for attention computation. MInference is useful for researchers and developers working with large-scale language models who aim to improve efficiency without compromising accuracy.

github

: 853

baml

BAML is a config file format for declaring LLM functions that you can then use in TypeScript or Python. With BAML you can Classify or Extract any structured data using Anthropic, OpenAI or local models (using Ollama) ## Resources ![](https://img.shields.io/discord/1119368998161752075.svg?logo=discord&label=Discord%20Community) [Discord Community](https://discord.gg/boundaryml) ![](https://img.shields.io/twitter/follow/boundaryml?style=social) [Follow us on Twitter](https://twitter.com/boundaryml) * Discord Office Hours - Come ask us anything! We hold office hours most days (9am - 12pm PST). * Documentation - Learn BAML * Documentation - BAML Syntax Reference * Documentation - Prompt engineering tips * Boundary Studio - Observability and more #### Starter projects * BAML + NextJS 14 * BAML + FastAPI + Streaming ## Motivation Calling LLMs in your code is frustrating: * your code uses types everywhere: classes, enums, and arrays * but LLMs speak English, not types BAML makes calling LLMs easy by taking a type-first approach that lives fully in your codebase: 1. Define what your LLM output type is in a .baml file, with rich syntax to describe any field (even enum values) 2. Declare your prompt in the .baml config using those types 3. Add additional LLM config like retries or redundancy 4. Transpile the .baml files to a callable Python or TS function with a type-safe interface. (VSCode extension does this for you automatically). We were inspired by similar patterns for type safety: protobuf and OpenAPI for RPCs, Prisma and SQLAlchemy for databases. BAML guarantees type safety for LLMs and comes with tools to give you a great developer experience: ![](docs/images/v3/prompt_view.gif) Jump to BAML code or how Flexible Parsing works without additional LLM calls. | BAML Tooling | Capabilities | | ----------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | BAML Compiler install | Transpiles BAML code to a native Python / Typescript library (you only need it for development, never for releases) Works on Mac, Windows, Linux ![](https://img.shields.io/badge/Python-3.8+-default?logo=python)![](https://img.shields.io/badge/Typescript-Node_18+-default?logo=typescript) | | VSCode Extension install | Syntax highlighting for BAML files Real-time prompt preview Testing UI | | Boundary Studio open (not open source) | Type-safe observability Labeling |

github

: 7.5k

CogVideo

CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.

github

: 11.0k

ludwig

Ludwig is a declarative deep learning framework designed for scale and efficiency. It is a low-code framework that allows users to build custom AI models like LLMs and other deep neural networks with ease. Ludwig offers features such as optimized scale and efficiency, expert level control, modularity, and extensibility. It is engineered for production with prebuilt Docker containers, support for running with Ray on Kubernetes, and the ability to export models to Torchscript and Triton. Ludwig is hosted by the Linux Foundation AI & Data.

github

: 11.0k

searchGPT

searchGPT is an open-source project that aims to build a search engine based on Large Language Model (LLM) technology to provide natural language answers. It supports web search with real-time results, file content search, and semantic search from sources like the Internet. The tool integrates LLM technologies such as OpenAI and GooseAI, and offers an easy-to-use frontend user interface. The project is designed to provide grounded answers by referencing real-time factual information, addressing the limitations of LLM's training data. Contributions, especially from frontend developers, are welcome under the MIT License.

github

: 622

For similar tasks

SirChatalot

A Telegram bot that proves you don't need a body to have a personality. It can use various text and image generation APIs to generate responses to user messages. For text generation, the bot can use: * OpenAI's ChatGPT API (or other compatible API). Vision capabilities can be used with GPT-4 models. Function calling can be used with Function calling. * Anthropic's Claude API. Vision capabilities can be used with Claude 3 models. Function calling can be used with tool use. * YandexGPT API Bot can also generate images with: * OpenAI's DALL-E * Stability AI * Yandex ART This bot can also be used to generate responses to voice messages. Bot will convert the voice message to text and will then generate a response. Speech recognition can be done using the OpenAI's Whisper model. To use this feature, you need to install the ffmpeg library. This bot is also support working with files, see Files section for more details. If function calling is enabled, bot can generate images and search the web (limited).

github

: 65

Chat-With-RTX-python-api

This repository contains a Python API for Chat With RTX, which allows users to interact with RTX models for natural language processing. The API provides functionality to send messages and receive responses from various LLM models. It also includes information on the speed of different models supported by Chat With RTX. The repository has a history of updates, including the removal of a feature and the addition of a new model for speech-to-text conversion. The repository is licensed under CC0.

github

: 53

LLMVoX

LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency. It achieves significantly lower Word Error Rate compared to speech-enabled LLMs while operating at comparable latency and speech quality. Key features include being lightweight & fast with only 30M parameters, LLM-agnostic for easy integration with existing models, multi-queue streaming for continuous speech generation, and multilingual support for easy adaptation to new languages.

github

: 167

omniai

OmniAI provides a unified Ruby API for integrating with multiple AI providers, streamlining AI development by offering a consistent interface for features such as chat, text-to-speech, speech-to-text, and embeddings. It ensures seamless interoperability across platforms and effortless switching between providers, making integrations more flexible and reliable.

github

: 161

autonomous-intelligence

github

: 207

metavoice-src

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities: * Emotional speech rhythm and tone in English. * Zero-shot cloning for American & British voices, with 30s reference audio. * Support for (cross-lingual) voice cloning with finetuning. * We have had success with as little as 1 minute training data for Indian speakers. * Synthesis of arbitrary length text

github

: 3.1k

modelfusion

ModelFusion is an abstraction layer for integrating AI models into JavaScript and TypeScript applications, unifying the API for common operations such as text streaming, object generation, and tool usage. It provides features to support production environments, including observability hooks, logging, and automatic retries. You can use ModelFusion to build AI applications, chatbots, and agents. ModelFusion is a non-commercial open source project that is community-driven. You can use it with any supported provider. ModelFusion supports a wide range of models including text generation, image generation, vision, text-to-speech, speech-to-text, and embedding models. ModelFusion infers TypeScript types wherever possible and validates model responses. ModelFusion provides an observer framework and logging support. ModelFusion ensures seamless operation through automatic retries, throttling, and error handling mechanisms. ModelFusion is fully tree-shakeable, can be used in serverless environments, and only uses a minimal set of dependencies.

github

: 918

MeloTTS

MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai. It supports various languages including English (American, British, Indian, Australian), Spanish, French, Chinese, Japanese, and Korean. The Chinese speaker also supports mixed Chinese and English. The library is fast enough for CPU real-time inference and offers features like using without installation, local installation, and training on custom datasets. The Python API and model cards are available in the repository and on HuggingFace. The community can join the Discord channel for discussions and collaboration opportunities. Contributions are welcome, and the library is under the MIT License. MeloTTS is based on TTS, VITS, VITS2, and Bert-VITS2.

github

: 4.2k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.9k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675