xGitGuard
AI based Secrets Detection Python Framework
Stars: 53
xGitGuard is an AI-based system developed by Comcast Cybersecurity Research and Development team to detect secrets (e.g., API tokens, usernames, passwords) exposed on GitHub repositories. It uses advanced Natural Language Processing to detect secrets at scale and with appropriate velocity. The tool provides workflows for detecting credentials and keys/tokens in both enterprise and public GitHub accounts. Users can set up search patterns, configure API access, run detections with or without ML filters, and train ML models for improved detection accuracy. xGitGuard also supports custom keyword scans for targeted organizations or repositories. The tool is licensed under Apache 2.0.
README:
AI-Based Secrets Detection
Detect Secrets (API Tokens, Usernames, Passwords, etc.) Exposed on GitHub Repositories
Designed and Developed by Comcast Cybersecurity Research and Development Team
-
Detecting Publicly Exposed Secrets on GitHub at Scale
- xGitGuard is an AI-based system designed and developed by the Comcast Cybersecurity Research and Development team that detects secrets (e.g., API tokens, usernames, passwords, etc.) exposed on GitHub. xGitGuard uses advanced Natural Language Processing to detect secrets at scale and with appropriate velocity in GitHub repositories.
- What are Secrets?
-
Credentials
- Usernames & passwords, server credentials, account credentials, etc.
-
Keys/Tokens
- Service API tokens (AWS, Azure, etc), encryption keys, etc.
-
Credentials
-
Enterprise Credential Secrets Detection - Run Secret detection on the given
GitHub Enterprise
account -
Public Credential Secrets Detection - Run Secret detection on the
GitHub Public
account
-
Enterprise Keys and Tokens Secrets Detection - Run Secret detection on the given
GitHub Enterprise
account -
Public Keys and Tokens Secrets Detection - Run Secret detection on the
GitHub Public
account
-
Install Python >= v3.6
-
Clone/Download the repository from GitHub
-
Traverse into the cloned
xGitGuard
foldercd xGitGuard
-
Install Python Dependency Packages
python -m pip install -r requirements.txt
-
Check for Outdated Packages
pip list --outdated
-
There are two ways to define configurations in xGitGuard
- Config Files
- Command Line Inputs
-
For
Enterprise
Github Detection(Secondary Keyword + Extension)
under config directory- Secondary Keyword: secondary_keys.csv file or User Feed - list of Keys & Tokens
- Secondary Keyword: secondary_creds.csv file or User Feed - list of Credentials
- Extension: extensions.csv file or User Feed - List of file Extensions
-
For
Public
Github Detection(Primary Keyword + Secondary Keyword + Extension)
under config directory- Primary Keyword: primary_keywords.csv file or User Feed - list of primary Keys
- Secondary Keyword: secondary_keys.csv file or User Feed - list of Keys & Toekns
- Secondary Keyword: secondary_creds.csv file or User Feed - list of Credentials
- Extension: extensions.csv file or User Feed - List of file Extensions
- Setup the system Environment variable below for accessing GitHub
-
GITHUB_ENTERPRISE_TOKEN
- Enterprise GitHub API Token with full scopes of repository and user.- Refer to the GitHub documentation How To Get GitHub API Token for help
-
- Update the following configs with
your Enterprise Name
in config filexgg_configs.yaml
in config Data folderxgitguard\config\*
- enterprise_api_url:
https://github.<<
Enterprise_Name
>>.com/api/v3/search/code
- enterprise_pre_url:
https://github.<<
Enterprise_Name
>>.com/api/v3/repos/
- url_validator:
https://github.<<
Enterprise_Name
>>.com/api/v3/search/code
- enterprise_commits_url:
https://github.<<
Enterprise_Name
>>.com/api/v3/repos/{user_name}/{repo_name}/commits?path={file_path}
- enterprise_api_url:
-
Traverse into the
github-enterprise
script foldercd github-enterprise
By default, the Credential Secrets Detection script runs for given Secondary Keywords and extensions without ML Filter.
# Run with Default configs
python enterprise_cred_detections.py
xGitGuard also has an additional ML filter where users can collect their organization/targeted data and train their model. Having this ML filter helps to reduce the false positives from the detection.
User Needs to follow the below process to collect data and train the model to use ML filter.
- Follow ML Model Training
NOTE :
- To use ML Filter, ML training is mandatory. This includes data collection, feature engineering & model persisting.
- This process is going to be based on user requirements. It can be one time or if the user needs to improve the data, then needs to be done periodically.
# Run for given Secondary Keyword and extension with ML model,
python enterprise_cred_detections.py -m Yes
# Run for targeted org,
python enterprise_cred_detections.py -o org_name #Ex: python enterprise_cred_detections.py -o test_org
# Run for targeted repo,
python enterprise_cred_detections.py -r org_name/repo_name #Ex: python enterprise_cred_detections.py -r test_org/public_docker
Run usage:
enterprise_cred_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
-
Inputs used for search and scan
Note: Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from config files will be used for the search.
- secondary_creds.csv file has a default list of credential relevant patterns for search, which can be updated by users based on their requirement.
- extensions.csv file has a default list of file extensions to be searched, which can be updated by the users based on their requirement.
-
GitHub search pattern for above examples:
password +extension:py
By default, the Keys and Tokens Secrets Detection script runs for given Secondary Keywords and the extensions without ML Filter.
# Run with Default configs
python enterprise_key_detections.py
# Run for targeted org,
python enterprise_key_detections.py -o org_name #Ex: python enterprise_key_detections.py -o test_org
# Run for targeted repo,
python enterprise_key_detections.py -r org_name/repo_name #Ex: python enterprise_key_detections.py -r test_org/public_docker
xGitGuard also has an additional ML filter where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.
The user needs to follow the below process to collect data and train the model to use ML filter.
- Follow ML Model Training
NOTE :
- To use ML filter, ML training is mandatory. It includes data collection, feature engineering & model persisting.
- This process is going to be based on user requirements. It can be one time or if the user needs to improve the data, then it needs to be done periodically.
# Run for given Secondary Keyword and extension with ML model
python enterprise_key_detections.py -m Yes
Run usage:
enterprise_key_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
-
Inputs used for search and scan
Note: Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from the config files will be used for search.
- secondary_keys.csv file will have a default list of key relevant patterns for search, which can be updated by the users based on their requirement.
- extensions.csv file has a default list of file extensions to be searched, which can be updated by the users based on their requirement.
-
GitHub search pattern for above examples:
api_key +extension:py
-
Credentials
1. Hashed Url Files: xgitguard\output\*_enterprise_hashed_url_creds.csv - List previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing. 2. Secrets Detected: xgitguard\output\*_xgg_enterprise_creds_detected.csv 3. Log File: xgitguard\logs\enterprise_key_detections_*yyyymmdd_hhmmss*.log
-
Keys & Tokens
1. Hashed Url Files: xgitguard\output\*_enterprise_hashed_url_keys.csv - List previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing. 2. Secrets Detected: xgitguard\output\*_xgg_enterprise_keys_detected.csv 3. Log File: xgitguard\logs\enterprise_key_detections_*yyyymmdd_hhmmss*.log
- Setup the Environment variable below for accessing GitHub
-
GITHUB_TOKEN
- Public GitHub API Token with full scopes of the repository and user.- Refer to GitHub Docs How To Get GitHub API Token for help
-
- Config data folder
xgitguard\config\*
- Traverse into the
github-public
script foldercd github-public
Note: User needs to remove the sample content from primary_keywords.csv and add primary keywords like targeted domain names to be searched in public GitHub.
By default, Credential Secrets Detection script runs for given Primary Keyword, Secondary Keyword, and extension without ML Filter.
# Run with Default configs
python public_cred_detections.py
# Run for targeted org,
python public_cred_detections.py -o org_name #Ex: python public_cred_detections.py -o test_org
# Run for targeted repo,
python public_cred_detections.py -r org_name/repo_name #Ex: python public_cred_detections.py -r test_org/public_docker
xGitGuard also has an additional ML filter, where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.
The user needs to follow the below process to collect data and train the model to use ML filter.
- Follow ML Model Training
NOTE :
- To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
# Run for given Primary Keyword, Secondary Keyword, and extension with ML model
python public_cred_detections.py -m Yes
Run usage:
usage: public_cred_detections.py [-h] [-p Primary Keywords] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction] [-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-p Primary Keywords, --primary_keywords Primary Keywords
Pass the Primary Keywords list as a comma-separated string
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
-
Inputs used for search and scan
Note: Command line argument keywords have precedence over config files (Default). If no keywords are passed in cli, config files data will be used for search.
- primary_keywords.csv file will have a default list of primary keyword-relevant patterns for search
- secondary_creds.csv file will have a default list of credential relevant patterns for search, which can be updated by the users based on their requirement.
- extensions.csv file has a default list of file extensions to be searched, which can be updated by the users based on their requirement.
-
GitHub search pattern for above examples:
abc.xyz.com password +extension:py
By default, Keys and Tokens Secret Detection script runs for given Primary Keyword, Secondary Keyword and extension without ML Filter.
# Run with Default configs
python public_key_detections.py
# Run for targeted org,
python public_key_detections.py -o org_name #Ex: python public_key_detections.py -o test_org
# Run for targeted repo,
python public_key_detections.py -r org_name/repo_name #Ex: python public_key_detections.py -r test_org/public_docker
xGitGuard also has an additional ML filter, where users can collect their organization/targeted data and train their model. Having this ML filter helps in reducing the false positives from the detection.
The user needs to follow the below process to collect data and train the model to use ML filter.
- Follow ML Model Training
NOTE : To use ML Feature, ML training is mandatory. It includes data collection,feature engineering & model persisting.
# Run for given Primary Keyword, Secondary Keyword, and extension with ML model,
python public_key_detections.py -m Yes
usage:
public_key_detections.py [-h] [-s Secondary Keywords] [-e Extensions] [-m Ml prediction][-u Unmask Secret] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-s Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keywords list as a comma-separated string
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is No
-u Set Unmask, --unmask_secret To write secret unmasked, then set Yes
Pass the flag as Yes or No. Default is No
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
-
Inputs used for search and scan
Note: Command line argument keywords have precedence over config files (Default). If no keywords are passed in cli, config files data will be used for search.
- primary_keywords.csv file will have a default list of primary keyword-relevant patterns for search, which can be updated by the users based on their requirement.
- secondary_keys.csv file will have a default list of tokens & keys relevant patterns for search, which can be updated by the users based on their requirement.
- extensions.csv file has a default list of file extensions to be searched, which can be updated by the users based on their requirement.
-
GitHub search pattern for above examples:
abc.xyz.com api_key +extension:py
-
Credentials
1. Hashed Url Files: xgitguard\output\*_public_hashed_url_creds.csv - List pf previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing. 2. Secrets Detected: xgitguard\output\*_xgg_public_creds_detected.csv 3. Log File: xgitguard\logs\public_key_detections_*yyyymmdd_hhmmss*.log
-
Keys & Tokens
1. Hashed Url Files: xgitguard\output\*_public_hashed_url_keys.csv - List pf previously Processed Search urls. Urls stored will be skipped in next run to avoid re processing. 2. Secrets Detected: xgitguard\output\*_xgg_public_keys_detected.csv 3. Log File: xgitguard\logs\public_key_detections_*yyyymmdd_hhmmss*.log
Note: By Default, the detected secrets will be masked to hide sensitive data. If needed, user can skip the masking to write raw secret using command line argument
-u Yes or --unmask_secret Yes
. Refer command line options for more details.
To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
Note: Labelling the collected secret is an important process to improve the ML prediction.
-
Traverse into the "ml_training" folder
cd ml_training
Traverse into the "data collector" folder under ml_training
cd ml_data_collector\github-enterprise-ml-data-collector
-
Credentials
- Run for given Secondary Keywords and extensions
python enterprise_cred_data_collector.py
- To run with other parameters, please use help.
python enterprise_cred_data_collector.py -h
- Training data for Enterprise Creds collected will be placed in
xgitguard\output\cred_train_source.csv
folder
- Run for given Secondary Keywords and extensions
-
Keys & Tokens
- Run for given Secondary Keywords and extensions,
python enterprise_key_data_collector.py
- To run with other parameters, please use help.
python enterprise_key_data_collector.py -h
- Training data for Enterprise Keys and Tokens collected will be placed in
xgitguard\output\key_train_source.csv
folder
- Run for given Secondary Keywords and extensions,
- By default all the data collected will be labeled as 1 under column "Label" in collected training data indicating the collected secret as a valid one.
-
User needs to review each row in the collected data and update the label value.
i.e: if the user thinks collected data is not a secret, then change the value to 0 for that particular row.
- By doing this, ML will have quality data for the model to reduce false positives.
Traverse into the "ml_training" folder
-
Credentials
- Run with option cred for engineering collected cred data
python ml_feature_engineering.py cred
- By default in Enterprise mode, input will be cred_train_source.csv
- Engineered data for Enterprise Creds output will be placed in
xgitguard\output\cred_train.csv
folder
- Run with option cred for engineering collected cred data
-
Keys & Tokens
- Run with option cred for engineering collected keys & tokens data
python ml_feature_engineering.py key
- By default in Enterprise mode, input will be key_train_source.csv
- Engineered data for Enterprise Keys & Tokens output will be placed in
xgitguard\output\key_train.csv
folder
- Run with option cred for engineering collected keys & tokens data
Traverse into the "ml_training" folder
-
Run training with Cred Training Data and persist model
python model.py cred
-
Run training with Key Training Data and persist model
python model.py key
-
For help on command line arguments, run
python model.py -h
Note: If persisted model xgitguard\output\xgg_*.pickle is not present in the output folder, then use engineered data to create a model and persist it.
To use ML Feature, ML training is mandatory. It includes data collection, feature engineering & model persisting.
Note: Labelling the collected secret is an important process to use the ML effectively.
-
Traverse into the "models" folder
cd ml_training
Traverse into the "data collector" folder
cd ml_training\ml_data_collector\github-public-ml-data-collector
Note: User needs to remove the sample content from primary_keywords.csv and add primary keywords like targeted domain names to be searched in public GitHub.
-
Credentials
- Run for given Primary Keywords, Secondary Keywords, and extensions
python public_cred_data_collector.py
- To run with other parameters, please use help.
python public_cred_data_collector.py -h
- Training data for Public Creds collected will be placed in
xgitguard\output\public_cred_train_source.csv
folder
- Run for given Primary Keywords, Secondary Keywords, and extensions
-
Keys & Tokens
- Run for given Primary Keywords, Secondary keywords, and extensions
python public_key_data_collector.py
- To run with other parameters, please use help.
python public_key_data_collector.py -h
- Training data for Public Keys and Tokens collected will be placed in
xgitguard\output\public_key_train_source.csv
folder
- Run for given Primary Keywords, Secondary keywords, and extensions
Note: The data collection for public GitHub is optional.
- If targeted data collected from Enterprise is enough to use, then we can skip the data collection & Label review process
- By default, all the data collected will be labeled as 1 under column "Label" in collected training data indicating the collected secret as a valid one.
-
User needs to review each row in the collected data and update the label value.
i.e: if the user thinks collected data is not a secret, then change the value to 0 for that particular row.
- By doing this, ML will have quality data for the model to reduce false positives.
Note: Labelling the collected secret is an important process to use the ML effectively.
Traverse into the "ml_training" folder
-
Credentials
- Run with option cred for engineering collected cred data with public source data.
python ml_feature_engineering.py cred -s public
- In public mode, input will be public_cred_train_source.csv
- Engineered data for Public Creds output will be placed in
xgitguard\output\public_cred_train.csv
folder
- Run with option cred for engineering collected cred data with public source data.
-
Keys & Tokens
- Run with option cred for engineering collected keys & tokens data with public source data.
python ml_feature_engineering.py key -s public
- In public mode, input will be public_key_train_source.csv
- Engineered data for Public Keys & Tokens output will be placed in
xgitguard\output\public_key_train.csv
folder
- Run with option cred for engineering collected keys & tokens data with public source data.
Note:
- Data collection & feature engineering for public GitHub scan is optional.
- When public training data not available, feature engineering will use enterprise source data.
Traverse into the "ml_training" folder
-
Run training with Cred Training Data and persist model with public source data
python model.py cred -s public
-
Run training with Key Training Data and persist model with public source data
python model.py key -s public
-
For help on command line arguments, run
python model.py -h
Note:
- If persisted model xgitguard\output\public_*xgg*.pickle is not present in the output folder, then use feature engineered data to create a model and persist it.
- By default, when feature engineered data collected in Public mode not available, then model creation will be using enterprise-based engineered data.
-
Traverse into the
custom-keyword-search
script foldercd custom-keyword-search
Please add the required keywords to be searched into config/enterprise_keywords.csv
# Run with given configs,
python enterprise_keyword_search.py
# Run Run for targeted org,
python enterprise_keyword_search.py -o org_name #Ex: python enterprise_keyword_search.py -o test_ccs
# Run Run for targeted repo,
python enterprise_keyword_search.py -r org_name/repo_name #Ex: python enterprise_keyword_search.py -r test_ccs/ccs_repo_1
Run usage:
enterprise_keyword_search.py [-h] [-e Enterprise Keywords] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-e Enterprise Keywords, --enterprise_keywords Enterprise Keywords
Pass the Enterprise Keywords list as a comma-separated string.This is optional argument. Keywords can also be provided in the `enterprise_keywords.csv` file located in the `configs` directory.
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
Please add the required keywords to be searched into config/public_keywords.csv
# Run with given configs,
python public_keyword_search.py
# Run Run for targeted org,
python public_keyword_search.py -o org_name #Ex: python public_keyword_search.py -o test_org
# Run Run for targeted repo,
python public_keyword_search.py -r org_name/repo_name #Ex: python public_keyword_search.py -r test_org/public_docker
Run usage:
public_keyword_search.py [-h] [-p Public Keywords] [-o org_name] [-r repo_name] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-e Public Keywords, --public_keywords Public Keywords
Pass the Public Keywords list as a comma-separated string.This is optional argument. Keywords can also be provided in the `public_keywords.csv` file located in the `configs` directory.
-o pass org name, --org Pass the targeted org list as a comma-separated string
-r pass repo name, --repo Pass the targeted repo list as a comma-separated string
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
- Users can update confidence_values.csv based on secondary_keys, secondary_creds, extensions value and give scoring from level 0 (lowest) to 5 (highest) to denote associated keyword suspiciousness.
- If users need to add any custom/new secondary creds/keys or extensions to the config files, then the same has to be added in the confidence_values.csv file with respective score level.
- Stop Words provided in config files are very limited and generic.Users need to update stop_words.csv with keywords considered has false postives to filter it out from the detections.
- Users can add additional extensions to extensions.csv to search types of files other than the default list.
- Users can enhance secondary_creds.csv/secondary_keys.csv by adding new patterns to do searches other than the default list.
- Users need to add primary keywords for public search in primary_keywords.csv after removing the sample content.
- In case of GitHub API calls resulting in 403 due to API rate-limit, increase the throttle timeout (github.throttle_time: 10) in the config ("config/xgg_configs.yaml)".
Licensed under the Apache 2.0 license.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for xGitGuard
Similar Open Source Tools
xGitGuard
xGitGuard is an AI-based system developed by Comcast Cybersecurity Research and Development team to detect secrets (e.g., API tokens, usernames, passwords) exposed on GitHub repositories. It uses advanced Natural Language Processing to detect secrets at scale and with appropriate velocity. The tool provides workflows for detecting credentials and keys/tokens in both enterprise and public GitHub accounts. Users can set up search patterns, configure API access, run detections with or without ML filters, and train ML models for improved detection accuracy. xGitGuard also supports custom keyword scans for targeted organizations or repositories. The tool is licensed under Apache 2.0.
search_with_ai
Build your own conversation-based search with AI, a simple implementation with Node.js & Vue3. Live Demo Features: * Built-in support for LLM: OpenAI, Google, Lepton, Ollama(Free) * Built-in support for search engine: Bing, Sogou, Google, SearXNG(Free) * Customizable pretty UI interface * Support dark mode * Support mobile display * Support local LLM with Ollama * Support i18n * Support Continue Q&A with contexts.
rclip
rclip is a command-line photo search tool powered by the OpenAI's CLIP neural network. It allows users to search for images using text queries, similar image search, and combining multiple queries. The tool extracts features from photos to enable searching and indexing, with options for previewing results in supported terminals or custom viewers. Users can install rclip on Linux, macOS, and Windows using different installation methods. The repository follows the Conventional Commits standard and welcomes contributions from the community.
rag-gpt
RAG-GPT is a tool that allows users to quickly launch an intelligent customer service system with Flask, LLM, and RAG. It includes frontend, backend, and admin console components. The tool supports cloud-based and local LLMs, enables deployment of conversational service robots in minutes, integrates diverse knowledge bases, offers flexible configuration options, and features an attractive user interface.
OpenAI-Api-Unreal
The OpenAIApi Plugin provides access to the OpenAI API in Unreal Engine, allowing users to generate images, transcribe speech, and power NPCs using advanced AI models. It offers blueprint nodes for making API calls, setting parameters, and accessing completion values. Users can authenticate using an API key directly or as an environment variable. The plugin supports various tasks such as generating images, transcribing speech, and interacting with NPCs through chat endpoints.
alog
ALog is an open-source project designed to facilitate the deployment of server-side code to Cloudflare. It provides a step-by-step guide on creating a Cloudflare worker, configuring environment variables, and updating API base URL. The project aims to simplify the process of deploying server-side code and interacting with OpenAI API. ALog is distributed under the GNU General Public License v2.0, allowing users to modify and distribute the app while adhering to App Store Review Guidelines.
morphic
Morphic is an AI-powered answer engine with a generative UI. It utilizes a stack of Next.js, Vercel AI SDK, OpenAI, Tavily AI, shadcn/ui, Radix UI, and Tailwind CSS. To get started, fork and clone the repo, install dependencies, fill out secrets in the .env.local file, and run the app locally using 'bun dev'. You can also deploy your own live version of Morphic with Vercel. Verified models that can be specified to writers include Groq, LLaMA3 8b, and LLaMA3 70b.
horde-worker-reGen
This repository provides the latest implementation for the AI Horde Worker, allowing users to utilize their graphics card(s) to generate, post-process, or analyze images for others. It offers a platform where users can create images and earn 'kudos' in return, granting priority for their own image generations. The repository includes important details for setup, recommendations for system configurations, instructions for installation on Windows and Linux, basic usage guidelines, and information on updating the AI Horde Worker. Users can also run the worker with multiple GPUs and receive notifications for updates through Discord. Additionally, the repository contains models that are licensed under the CreativeML OpenRAIL License.
openai-kotlin
OpenAI Kotlin API client is a Kotlin client for OpenAI's API with multiplatform and coroutines capabilities. It allows users to interact with OpenAI's API using Kotlin programming language. The client supports various features such as models, chat, images, embeddings, files, fine-tuning, moderations, audio, assistants, threads, messages, and runs. It also provides guides on getting started, chat & function call, file source guide, and assistants. Sample apps are available for reference, and troubleshooting guides are provided for common issues. The project is open-source and licensed under the MIT license, allowing contributions from the community.
RA.Aid
RA.Aid is an AI software development agent powered by `aider` and advanced reasoning models like `o1`. It combines `aider`'s code editing capabilities with LangChain's agent-based task execution framework to provide an intelligent assistant for research, planning, and implementation of multi-step development tasks. It handles complex programming tasks by breaking them down into manageable steps, running shell commands automatically, and leveraging expert reasoning models like OpenAI's o1. RA.Aid is designed for everyday software development, offering features such as multi-step task planning, automated command execution, and the ability to handle complex programming tasks beyond single-shot code edits.
web-ui
WebUI is a user-friendly tool built on Gradio that enhances website accessibility for AI agents. It supports various Large Language Models (LLMs) and allows custom browser integration for seamless interaction. The tool eliminates the need for re-login and authentication challenges, offering high-definition screen recording capabilities.
xlang
XLang™ is a cutting-edge language designed for AI and IoT applications, offering exceptional dynamic and high-performance capabilities. It excels in distributed computing and seamless integration with popular languages like C++, Python, and JavaScript. Notably efficient, running 3 to 5 times faster than Python in AI and deep learning contexts. Features optimized tensor computing architecture for constructing neural networks through tensor expressions. Automates tensor data flow graph generation and compilation for specific targets, enhancing GPU performance by 6 to 10 times in CUDA environments.
any-parser
AnyParser provides an API to accurately extract unstructured data (e.g., PDFs, images, charts) into a structured format. Users can set up their API key, run synchronous and asynchronous extractions, and perform batch extraction. The tool is useful for extracting text, numbers, and symbols from various sources like PDFs and images. It offers flexibility in processing data and provides immediate results for synchronous extraction while allowing users to fetch results later for asynchronous and batch extraction. AnyParser is designed to simplify data extraction tasks and enhance data processing efficiency.
XcodeLLMEligible
XcodeLLMEligible is a project that provides ways to enjoy Xcode LLM on ChinaSKU Mac without disabling SIP. It offers methods for script execution and manual execution, allowing users to override eligibility service features. The project is for learning and research purposes only, and users are responsible for compliance with applicable laws. The author disclaims any responsibility for consequences arising from the use of the project.
preswald
Preswald is a full-stack platform for building, deploying, and managing interactive data applications in Python. It simplifies the process by combining ingestion, storage, transformation, and visualization into one lightweight SDK. With Preswald, users can connect to various data sources, customize app themes, and easily deploy apps locally. The platform focuses on code-first simplicity, end-to-end coverage, and efficiency by design, making it suitable for prototyping internal tools or deploying production-grade apps with reduced complexity and cost.
phospho
Phospho is a text analytics platform for LLM apps. It helps you detect issues and extract insights from text messages of your users or your app. You can gather user feedback, measure success, and iterate on your app to create the best conversational experience for your users.
For similar tasks
xGitGuard
xGitGuard is an AI-based system developed by Comcast Cybersecurity Research and Development team to detect secrets (e.g., API tokens, usernames, passwords) exposed on GitHub repositories. It uses advanced Natural Language Processing to detect secrets at scale and with appropriate velocity. The tool provides workflows for detecting credentials and keys/tokens in both enterprise and public GitHub accounts. Users can set up search patterns, configure API access, run detections with or without ML filters, and train ML models for improved detection accuracy. xGitGuard also supports custom keyword scans for targeted organizations or repositories. The tool is licensed under Apache 2.0.
For similar jobs
promptflow
**Prompt flow** is a suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, evaluation to production deployment and monitoring. It makes prompt engineering much easier and enables you to build LLM apps with production quality.
deepeval
DeepEval is a simple-to-use, open-source LLM evaluation framework specialized for unit testing LLM outputs. It incorporates various metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., and runs locally on your machine for evaluation. It provides a wide range of ready-to-use evaluation metrics, allows for creating custom metrics, integrates with any CI/CD environment, and enables benchmarking LLMs on popular benchmarks. DeepEval is designed for evaluating RAG and fine-tuning applications, helping users optimize hyperparameters, prevent prompt drifting, and transition from OpenAI to hosting their own Llama2 with confidence.
MegaDetector
MegaDetector is an AI model that identifies animals, people, and vehicles in camera trap images (which also makes it useful for eliminating blank images). This model is trained on several million images from a variety of ecosystems. MegaDetector is just one of many tools that aims to make conservation biologists more efficient with AI. If you want to learn about other ways to use AI to accelerate camera trap workflows, check out our of the field, affectionately titled "Everything I know about machine learning and camera traps".
leapfrogai
LeapfrogAI is a self-hosted AI platform designed to be deployed in air-gapped resource-constrained environments. It brings sophisticated AI solutions to these environments by hosting all the necessary components of an AI stack, including vector databases, model backends, API, and UI. LeapfrogAI's API closely matches that of OpenAI, allowing tools built for OpenAI/ChatGPT to function seamlessly with a LeapfrogAI backend. It provides several backends for various use cases, including llama-cpp-python, whisper, text-embeddings, and vllm. LeapfrogAI leverages Chainguard's apko to harden base python images, ensuring the latest supported Python versions are used by the other components of the stack. The LeapfrogAI SDK provides a standard set of protobuffs and python utilities for implementing backends and gRPC. LeapfrogAI offers UI options for common use-cases like chat, summarization, and transcription. It can be deployed and run locally via UDS and Kubernetes, built out using Zarf packages. LeapfrogAI is supported by a community of users and contributors, including Defense Unicorns, Beast Code, Chainguard, Exovera, Hypergiant, Pulze, SOSi, United States Navy, United States Air Force, and United States Space Force.
llava-docker
This Docker image for LLaVA (Large Language and Vision Assistant) provides a convenient way to run LLaVA locally or on RunPod. LLaVA is a powerful AI tool that combines natural language processing and computer vision capabilities. With this Docker image, you can easily access LLaVA's functionalities for various tasks, including image captioning, visual question answering, text summarization, and more. The image comes pre-installed with LLaVA v1.2.0, Torch 2.1.2, xformers 0.0.23.post1, and other necessary dependencies. You can customize the model used by setting the MODEL environment variable. The image also includes a Jupyter Lab environment for interactive development and exploration. Overall, this Docker image offers a comprehensive and user-friendly platform for leveraging LLaVA's capabilities.
carrot
The 'carrot' repository on GitHub provides a list of free and user-friendly ChatGPT mirror sites for easy access. The repository includes sponsored sites offering various GPT models and services. Users can find and share sites, report errors, and access stable and recommended sites for ChatGPT usage. The repository also includes a detailed list of ChatGPT sites, their features, and accessibility options, making it a valuable resource for ChatGPT users seeking free and unlimited GPT services.
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
AI-YinMei
AI-YinMei is an AI virtual anchor Vtuber development tool (N card version). It supports fastgpt knowledge base chat dialogue, a complete set of solutions for LLM large language models: [fastgpt] + [one-api] + [Xinference], supports docking bilibili live broadcast barrage reply and entering live broadcast welcome speech, supports Microsoft edge-tts speech synthesis, supports Bert-VITS2 speech synthesis, supports GPT-SoVITS speech synthesis, supports expression control Vtuber Studio, supports painting stable-diffusion-webui output OBS live broadcast room, supports painting picture pornography public-NSFW-y-distinguish, supports search and image search service duckduckgo (requires magic Internet access), supports image search service Baidu image search (no magic Internet access), supports AI reply chat box [html plug-in], supports AI singing Auto-Convert-Music, supports playlist [html plug-in], supports dancing function, supports expression video playback, supports head touching action, supports gift smashing action, supports singing automatic start dancing function, chat and singing automatic cycle swing action, supports multi scene switching, background music switching, day and night automatic switching scene, supports open singing and painting, let AI automatically judge the content.