SeaGOAT

local-first semantic code search engine

Stars: 1054

Visit

SeaGOAT is a local search tool that leverages vector embeddings to enable you to search your codebase semantically. It is designed to work on Linux, macOS, and Windows and can process files in various formats, including text, Markdown, Python, C, C++, TypeScript, JavaScript, HTML, Go, Java, PHP, and Ruby. SeaGOAT uses a vector database called ChromaDB and a local vector embedding engine to provide fast and accurate search results. It also supports regular expression/keyword-based matches. SeaGOAT is open-source and licensed under an open-source license, and users are welcome to examine the source code, raise concerns, or create pull requests to fix problems.

README:

[!TIP] Check out CodeGate, the AI project I am currently working on. It's all about security in AI code generation.

SeaGOAT

A code search engine for the AI age. SeaGOAT is a local search tool that leverages vector embeddings to enable you to search your codebase semantically.

Getting started

Install SeaGOAT

In order to install SeaGOAT, you need to have the following dependencies already installed on your computer:

Python 3.11 or newer
ripgrep
bat (optional, highly recommended)

When bat is installed, it is used to display results as long as color is enabled. When SeaGOAT is used as part of a pipeline, a grep-line output format is used. When color is enabled, but bat is not installed, SeaGOAT will highlight the output using pygments. Using bat is recommended.

To install SeaGOAT using pipx, use the following command:

pipx install seagoat

System requirements

Hardware

Should work on any decent laptop.

Operating system

SeaGOAT is designed to work on Linux (tested ✅), macOS (partly tested, help 🙏) and Windows (help needed 🙏).

Start SeaGOAT server

In order to use SeaGOAT in your project, you have to start the SeaGOAT server using the following command:

seagoat-server start /path/to/your/repo

Search your repository

If you have the server running, you can simply use the gt or seagoat command to query your repository. For example:

gt "Where are the numbers rounded"

You can also use Regular Expressions in your queries, for example

gt "function calc_.* that deals with taxes"

Stopping the server

You can stop the running server using the following command:

seagoat-server stop /path/to/your/repo

Configuring SeaGOAT

SeaGOAT can be tailored to your needs through YAML configuration files, either globally or project-specifically with a .seagoat.yml file. For instance:

# .seagoat.yml

server:
  port: 31134  # Specify server port

Check out the documentation for more details!

Development

Requirements:

Poetry
Python 3.11 or newer
ripgrep

Install dependencies

After cloning the repository, install dependencies using the following command:

poetry install

Running tests

Watch mode (recommended)

poetry run ptw

Test changed files

poetry run pytest .  --testmon

Test all files

poetry run pytest .

Manual testing

You can test any SeaGOAT command manually in your local development environment. For example to test the development version of the seagoat-server command, you can run:

poetry run seagoat-server start ~/path/an/example/repository

FAQ

The points in this FAQ are indications of how SeaGOAT works, but are not a legal contract. SeaGOAT is licensed under an open source license and if you are in doubt about the privacy/safety/etc implications of SeaGOAT, you are welcome to examine the source code, raise your concerns, or create a pull request to fix a problem.

How does SeaGOAT work? Does it send my data to ChatGPT?

SeaGOAT does not rely on 3rd party APIs or any remote APIs and executes all functionality locally using the SeaGOAT server that you are able to run on your own machine.

Instead of relying on APIs or "connecting to ChatGPT", it uses the vector database called ChromaDB, with a local vector embedding engine and telemetry disabled by default.

Apart from that, SeaGOAT also uses ripgrep, a regular-expression based code search engine in order to provider regular expression/keyword based matches in addition to the "AI-based" matches.

While the current version of SeaGOAT does not send your data to remote servers, it might be possible that in the future there will be optional features that do so, if any further improvement can be gained from that.

Why does SeaGOAT need a server?

SeaGOAT needs a server in order to provide a speedy response. SeaGOAT heavily relies on vector embeddings and vector databases, which at the moment cannot be replace with an architecture that processes files on the fly.

It's worth noting that you are able to run SeaGOAT server entirely locally, and it works even if you don't have an internet connection. This use case does not require you to share data with a remote server, you are able to use your own SeaGOAT server locally, albeit it's also possible to run a SeaGOAT server and allow other computers to connect to it, if you so wish.

Does SeaGOAT create AI-derived work? Is SeaGOAT ethical?

If you are concerned about the ethical implications of using AI tools keep in mind that SeaGOAT is not a code generator but a code search engine, therefore it does not create AI derived work.

That being said, a language model is being used to generate vector embeddings. At the moment SeaGOAT uses ChromaDB's default model for calculating vector embeddings, and I am not aware of this being an ethical concern.

What programming languages are supported?

Currently SeaGOAT is hard coded to only process files in the following formats:

Text Files (*.txt)
Markdown (*.md)
Python (*.py)
C (*.c, *.h)
C++ (*.cpp, *.cc, *.cxx, *.hpp)
TypeScript (*.ts, *.tsx)
JavaScript (*.js, *.jsx)
HTML (*.html)
Go (*.go)
Java (*.java)
PHP (*.php)
Ruby (*.rb)

Why is SeaGOAT processing files so slowly while barely using my CPU?

Since processing files for large repositories can take a long time, SeaGOAT is designed to allow you to use your computer while processing files. It is an intentional design choice to avoid blocking/slowing down your computer.

This design decision does not affect the performance of queries.

By the way, you are able to use SeaGOAT to query your repository while it's processing your files! When you make a query, and the files are not processed yet, you will receive a warning with an estimation of the accuracy of your results. Also, regular expression/full text search based results will be displayed from the very beginning!

What character encodings are supported?

The preferred character encoding is UTF-8. Most other character encodings should also work. Only text files are supported, SeaGOAT ignores binary files.

Where does SeaGOAT store it's database/cache?

Where SeaGOAT stores databases and cache depends on your operating system. For your convenience, you can use the seagoat-server server-info command to find out where these files are stored on your system.

Can I host SeaGOAT server on a different computer?

Yes, if you would like to use SeaGOAT without having to run the server on the same computer, you can simply self-host SeaGOAT server on a different computer or in the cloud, and configure the seagoat/gt command to connect to this remote server through the internet.

Keep in mind that SeaGOAT itself does not enforce any security as it is primarily designed to run locally. If you have private code that you do not wish to leak, you will have to make sure that only trusted people have access to the SeaGOAT server. This could be done by making it only available through a VPN that only your teammates can access.

Can I ignore files/directories?

SeaGOAT already ignores all files/directories ignored in your .gitignore. If you wish to ignore additional files but keep them in git, you can use the ignorePatterns attribute from the server configuration. Learn more

For Tasks:

Click tags to check more tools for each tasks

find code search codebase debug code

For Jobs:

software engineer developer programmer data scientist researcher

Alternative AI tools for SeaGOAT

Similar Open Source Tools

SeaGOAT

github

: 1.1k

gpt-pilot

GPT Pilot is a core technology for the Pythagora VS Code extension, aiming to provide the first real AI developer companion. It goes beyond autocomplete, helping with writing full features, debugging, issue discussions, and reviews. The tool utilizes LLMs to generate production-ready apps, with developers overseeing the implementation. GPT Pilot works step by step like a developer, debugging issues as they arise. It can work at any scale, filtering out code to show only relevant parts to the AI during tasks. Contributions are welcome, with debugging and telemetry being key areas of focus for improvement.

github

: 32.2k

webwhiz

WebWhiz is an open-source tool that allows users to train ChatGPT on website data to build AI chatbots for customer queries. It offers easy integration, data-specific responses, regular data updates, no-code builder, chatbot customization, fine-tuning, and offline messaging. Users can create and train chatbots in a few simple steps by entering their website URL, automatically fetching and preparing training data, training ChatGPT, and embedding the chatbot on their website. WebWhiz can crawl websites monthly, collect text data and metadata, and process text data using tokens. Users can train custom data, but bringing custom open AI keys is not yet supported. The tool has no limitations on context size but may limit the number of pages based on the chosen plan. WebWhiz SDK is available on NPM, CDNs, and GitHub, and users can self-host it using Docker or manual setup involving MongoDB, Redis, Node, Python, and environment variables setup. For any issues, users can contact [email protected].

github

: 886

aider-composer

Aider Composer is a VSCode extension that integrates Aider into your development workflow. It allows users to easily add and remove files, toggle between read-only and editable modes, review code changes, use different chat modes, and reference files in the chat. The extension supports multiple models, code generation, code snippets, and settings customization. It has limitations such as lack of support for multiple workspaces, Git repository features, linting, testing, voice features, in-chat commands, and configuration options.

github

: 362

openui

OpenUI is a tool designed to simplify the process of building UI components by allowing users to describe UI using their imagination and see it rendered live. It supports converting HTML to React, Svelte, Web Components, etc. The tool is open source and aims to make UI development fun, fast, and flexible. It integrates with various AI services like OpenAI, Groq, Gemini, Anthropic, Cohere, and Mistral, providing users with the flexibility to use different models. OpenUI also supports LiteLLM for connecting to various LLM services and allows users to create custom proxy configs. The tool can be run locally using Docker or Python, and it offers a development environment for quick setup and testing.

github

: 16.7k

Discord-AI-Selfbot

Discord-AI-Selfbot is a Python-based Discord selfbot that uses the `discord.py-self` library to automatically respond to messages mentioning its trigger word using Groq API's Llama-3 model. It functions as a normal Discord bot on a real Discord account, enabling interactions in DMs, servers, and group chats without needing to invite a bot. The selfbot comes with features like custom AI instructions, free LLM model usage, mention and reply recognition, message handling, channel-specific responses, and a psychoanalysis command to analyze user messages for insights on personality.

github

: 120

MiniSearch

MiniSearch is a minimalist search engine with integrated browser-based AI. It is privacy-focused, easy to use, cross-platform, integrated, time-saving, efficient, optimized, and open-source. MiniSearch can be used for a variety of tasks, including searching the web, finding files on your computer, and getting answers to questions. It is a great tool for anyone who wants a fast, private, and easy-to-use search engine.

github

: 384

Perplexica

Perplexica is an open-source AI-powered search engine that utilizes advanced machine learning algorithms to provide clear answers with sources cited. It offers various modes like Copilot Mode, Normal Mode, and Focus Modes for specific types of questions. Perplexica ensures up-to-date information by using SearxNG metasearch engine. It also features image and video search capabilities and upcoming features include finalizing Copilot Mode and adding Discover and History Saving features.

github

: 21.0k

CLI

Bito CLI provides a command line interface to the Bito AI chat functionality, allowing users to interact with the AI through commands. It supports complex automation and workflows, with features like long prompts and slash commands. Users can install Bito CLI on Mac, Linux, and Windows systems using various methods. The tool also offers configuration options for AI model type, access key management, and output language customization. Bito CLI is designed to enhance user experience in querying AI models and automating tasks through the command line interface.

github

: 546

airbyte_serverless

AirbyteServerless is a lightweight tool designed to simplify the management of Airbyte connectors. It offers a serverless mode for running connectors, allowing users to easily move data from any source to their data warehouse. Unlike the full Airbyte-Open-Source-Platform, AirbyteServerless focuses solely on the Extract-Load process without a UI, database, or transform layer. It provides a CLI tool, 'abs', for managing connectors, creating connections, running jobs, selecting specific data streams, handling secrets securely, and scheduling remote runs. The tool is scalable, allowing independent deployment of multiple connectors. It aims to streamline the connector management process and provide a more agile alternative to the comprehensive Airbyte platform.

github

: 135

bugbug

Bugbug is a tool developed by Mozilla that leverages machine learning techniques to assist with bug and quality management, as well as other software engineering tasks like test selection and defect prediction. It provides various classifiers to suggest assignees, detect patches likely to be backed-out, classify bugs, assign product/components, distinguish between bugs and feature requests, detect bugs needing documentation, identify invalid issues, verify bugs needing QA, detect regressions, select relevant tests, track bugs, and more. Bugbug can be trained and tested using Python scripts, and it offers the ability to run model training tasks on Taskcluster. The project structure includes modules for data mining, bug/commit feature extraction, model implementations, NLP utilities, label handling, bug history playback, and GitHub issue retrieval.

github

: 530

AIOStreams

AIOStreams is a versatile tool that combines streams from various addons into one platform, offering extensive customization options. Users can change result formats, filter results by various criteria, remove duplicates, prioritize services, sort results, specify size limits, and more. The tool scrapes results from selected addons, applies user configurations, and presents the results in a unified manner. It simplifies the process of finding and accessing desired content from multiple sources, enhancing user experience and efficiency.

github

: 174

WebCraftifyAI

WebCraftifyAI is a software aid that makes it easy to create and build web pages and content. It is designed to be user-friendly and accessible to people of all skill levels. With WebCraftifyAI, you can quickly and easily create professional-looking websites without having to learn complex coding or design skills.

github

: 90

open-repo-wiki

OpenRepoWiki is a tool designed to automatically generate a comprehensive wiki page for any GitHub repository. It simplifies the process of understanding the purpose, functionality, and core components of a repository by analyzing its code structure, identifying key files and functions, and providing explanations. The tool aims to assist individuals who want to learn how to build various projects by providing a summarized overview of the repository's contents. OpenRepoWiki requires certain dependencies such as Google AI Studio or Deepseek API Key, PostgreSQL for storing repository information, Github API Key for accessing repository data, and Amazon S3 for optional usage. Users can configure the tool by setting up environment variables, installing dependencies, building the server, and running the application. It is recommended to consider the token usage and opt for cost-effective options when utilizing the tool.

github

: 137

aiCoder

aiCoder is an AI-powered tool designed to streamline the coding process by automating repetitive tasks, providing intelligent code suggestions, and facilitating the integration of new features into existing codebases. It offers a chat interface for natural language interactions, methods and stubs lists for code modification, and settings customization for project-specific prompts. Users can leverage aiCoder to enhance code quality, focus on higher-level design, and save time during development.

github

: 157

polis

Polis is an AI powered sentiment gathering platform that offers a more organic approach than surveys and requires less effort than focus groups. It provides a comprehensive wiki, main deployment at https://pol.is, discussions, issue tracking, and project board for users. Polis can be set up using Docker infrastructure and offers various commands for building and running containers. Users can test their instance, update the system, and deploy Polis for production. The tool also provides developer conveniences for code reloading, type checking, and database connections. Additionally, Polis supports end-to-end browser testing using Cypress and offers troubleshooting tips for common Docker and npm issues.

github

: 836

For similar tasks

lollms-webui

LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

github

: 4.6k

ava

Air-gapped Virtual Assistant / Personal Language Server

github

: 407

continue

Continue is an open-source autopilot for VS Code and JetBrains that allows you to code with any LLM. With Continue, you can ask coding questions, edit code in natural language, generate files from scratch, and more. Continue is easy to use and can help you save time and improve your coding skills.

github

: 25.2k

anterion

Anterion is an open-source AI software engineer that extends the capabilities of `SWE-agent` to plan and execute open-ended engineering tasks, with a frontend inspired by `OpenDevin`. It is designed to help users fix bugs and prototype ideas with ease. Anterion is equipped with easy deployment and a user-friendly interface, making it accessible to users of all skill levels.

github

: 137

sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system. The core features of SGLang include: - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction. - **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.

github

: 12.9k

ChatDBG

ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.

github

: 825

aider

Aider is a command-line tool that lets you pair program with GPT-3.5/GPT-4 to edit code stored in your local git repository. Aider will directly edit the code in your local source files and git commit the changes with sensible commit messages. You can start a new project or work with an existing git repo. Aider is unique in that it lets you ask for changes to pre-existing, larger codebases.

github

: 19.6k

chatgpt-web

ChatGPT Web is a web application that provides access to the ChatGPT API. It offers two non-official methods to interact with ChatGPT: through the ChatGPTAPI (using the `gpt-3.5-turbo-0301` model) or through the ChatGPTUnofficialProxyAPI (using a web access token). The ChatGPTAPI method is more reliable but requires an OpenAI API key, while the ChatGPTUnofficialProxyAPI method is free but less reliable. The application includes features such as user registration and login, synchronization of conversation history, customization of API keys and sensitive words, and management of users and keys. It also provides a user interface for interacting with ChatGPT and supports multiple languages and themes.

github

: 1.4k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

oss-fuzz-gen

This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

github

: 1.2k

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136