aibolit

A Static Analyzer for Java Powered by Machine Learning: Identifies Anti-Patterns Begging for Refactoring

Stars: 87

Visit

Aibolit is a machine learning-based static analyzer for Java that helps identify patterns contributing to Cyclomatic Complexity in Java source code. It provides recommendations for fixing identified issues and allows users to suppress certain patterns. Aibolit can analyze individual Java files or entire folders of Java source code. Users can customize the output format and exclude specific files from analysis. The tool also supports training custom models for analyzing Java code. Aibolit is designed to help developers improve code quality and maintainability by identifying and addressing potential issues in Java code.

README:

ML-Based Static Analyzer for Java

Learn how Aibolit works in our White Paper.

First, you install it (you must have Python 3.11+ and Pip installed):

pip3 install aibolit~=1.3.0

To analyze your Java sources, located at src/java (for example), run:

aibolit check --filenames src/java/File.java src/java/AnotherFile.java

aibolit recommend --filenames src/java/File.java src/java/AnotherFile.java

Also, you can set a folder with Java files:

aibolit recommend --folder src/java

It will run recommendation function for the model (model is located in aibolit/binary_files/model.pkl). The model finds a pattern which contribution is the largest to the Cyclomatic Complexity. If anything is found, you will see all recommendations for the mentioned patterns. You can see the list of all patterns in Patterns.md. The output of recommendation will be redirected to the stdout. If the program has the 0 exit code, it means that all analyzed files do not have any issues. If the program has the 1 exit code, it means that at least 1 analyzed file has an issue. If the program has the 2 exit code, it means that program crash occurred.

You can suppress certain patterns (comma separated value) and they will be ignored. They won't be included into the report, also their importance will be set to 0.

aibolit recommend --folder src/java --suppress=P12,P13

You can change the format, using the --format parameter. The default value is --format=compact.

aibolit recommend --folder src/java --format=compact --full

It will output sorted patterns by importance in descending order and grouped by a pattern name:

Show all patterns
Configuration.java score: 127.67642529949538
Configuration.java[3840]: Var in the middle (P21: 30.95612931128819 1/4)
Configuration.java[3844]: Var in the middle (P21: 30.95612931128819 1/4)
Configuration.java[3848]: Var in the middle (P21: 30.95612931128819 1/4)
Configuration.java[2411]: Null Assignment (P28: 10.76 2/4)
Configuration.java[826]: Many primary constructors (P9: 10.76 3/4)
Configuration.java[840]: Many primary constructors (P9: 10.76 3/4)
Configuration.java[829]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[841]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[865]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[2586]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3230]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3261]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3727]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3956]: Partial synchronized (P14: 0.228 4/4)
ErrorExample.java: error when calculating patterns: Can't count P1 metric:
Total score: 127.67642529949538

(P21: 30.95612931128819 1/4) means the following:

30.95612931128819 is the score of this pattern
1 is the position of this pattern in the total list of patterns
found in the file 4 is the total number of found patterns

You can use format=long. In this case all results will be sorted by a line number:

Show all patterns
Configuration.java: some issues found
Configuration.java score: 127.67642529949538
Configuration.java[826]: Many primary constructors (P9: 10.76 3/4)
Configuration.java[829]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[840]: Many primary constructors (P9: 10.76 3/4)
Configuration.java[841]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[865]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[2411]: Null Assignment (P28: 10.76 2/4)
Configuration.java[2586]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3230]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3261]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3727]: Partial synchronized (P14: 0.228 4/4)
Configuration.java[3840]: Var in the middle (P21: 30.95612931128819 1/4)
Configuration.java[3844]: Var in the middle (P21: 30.95612931128819 1/4)
Configuration.java[3848]: Var in the middle (P21: 30.95612931128819 1/4)
Configuration.java[3956]: Partial synchronized (P14: 0.228 4/4)
ErrorExample.java: error when calculating patterns: Can't count P1 metric:
MavenSlice.java: your code is perfect in aibolit's opinion
Total score: 127.67642529949538

You can also choose xml format. It will have the same format as compact mode, but xml will be created:

<report>
  <score>127.67642529949538</score>
  <!--Show all patterns-->
  <files>
    <file>
      <path>Configuration.java</path>
      <summary>Some issues found</summary>
      <score>127.67642529949538</score>
      <patterns>
        <pattern code="P13">
          <details>Null check</details>
          <lines>
            <number>294</number>
            <number>391</number>
          </lines>
          <score>30.95612931128819</score>
          <order>1/4</order>
        </pattern>
        <pattern code="P12">
          <details>Non final attribute</details>
          <lines>
            <number>235</number>
          </lines>
          <score>10.76</score>
          <order>2/4</order>
        </pattern>
          <pattern code="P21">
          <details>Var in the middle</details>
          <lines>
            <number>235</number>
          </lines>
          <score>2.056</score>
          <order>3/4</order>
        </pattern>
          <pattern code="P28">
          <details>Null Assignment</details>
          <lines>
            <number>2411</number>
          </lines>
          <score>0.228</score>
          <order>4/4</order>
        </pattern>
      </patterns>
    </file>
    <file>
      <path>ErrorExample.java</path>
      <summary>Error when calculating patterns: Can't count P1 metric:</summary>
    </file>
    <file>
      <path>MavenSlice.java</path>
      <summary>Your code is perfect in aibolit's opinion</summary>
    </file>
  </files>
</report>

The score is the relative importance of the pattern (there is no range for it). The larger score is, the most important pattern is. E.g., if you have several patterns, first you need to fix the pattern with the score 5.45:

SampleTests.java[43]: Non final attribute (P12: 5.45 1/10)
SampleTests.java[44]: Non final attribute (P12: 5.45 1/10)
SampleTests.java[80]: Var in the middle (P21: 3.71 2/10)
SampleTests.java[121]: Var in the middle (P21: 3.71 2/10)
SampleTests.java[122]: Var declaration distance for 5 lines (P20_5: 2.13 3/10)
SampleTests.java[41]: Non final class (P24: 1.95 4/10)
SampleTests.java[59]: Force Type Casting (P5: 1.45 5/10)
SampleTests.java[122]: Var declaration distance for 7 lines (P20_7: 1.07 6/10)
SampleTests.java[122]: Var declaration distance for 11 lines (P20_11: 0.78 7/10)
SampleTests.java[51]: Protected Method (P30: 0.60 8/10)
SampleTests.java[52]: Super Method (P18: 0.35 9/10)
SampleTests.java[100]: Partial synchronized (P14: 0.08 10/10)
SampleTests.java[106]: Partial synchronized (P14: 0.08 10/10)
SampleTests.java[113]: Partial synchronized (P14: 0.08 10/10)

The score per class is the sum of all patterns scores.

SampleTests.java score: 17.54698560768407

The total score is an average among all java files in a project (folder you've set to analyze)

Total average score: 4.0801854775508914

If you have 2 scores of different projects, the worst project is that one which has the highest score.

Model is automatically installed with aibolit package, but you can also try your own model

aibolit recommend --folder src/java --model /mnt/d/some_folder/model.pkl

You can get full report with --full command, then all patterns will be included to the output:

aibolit recommend --folder src/java --full

You can exclude files with --exclude command. You to set glob patterns to ignore:

aibolit recommend --folder src/java \
  --exclude=**/*Test*.java --exclude=**/*Impl*.java

If you need help, run

aibolit recommend --help

How to retrain it?

Train command does the following:

Calculates patterns and metrics
Creates a dataset
Trains model and save it

Train works only with cloned git repository.

Clone aibolit repository
Go to cloned_aibolit_path
Run pip install .
Set env variable export HOME_AIBOLIT=cloned_aibolit_path (example for Linux).
Set env variable TARGET_FOLDER if you need to save all dataset files to another directory.
You have to specify train and test dataset: set the HOME_TRAIN_DATASET environment variable for train dataset and the HOME_TEST_DATASET environment variable for test dataset.

Usually, these files are in scripts/target/08 directory after dataset collection (if you have not skipped it). But you can use your own datasets.

Please notice, that if you set TARGET_FOLDER, your dataset files will be in TARGET_FOLDER/target. That is why it is necessary to set HOME_TRAIN_DATASET=TARGET_FOLDER\target\08\08-train.csv, HOME_TEST_DATASET =TARGET_FOLDER\target\08\08-test.csv 7. If you need to set up own directory where model will be saved, set up also SAVE_MODEL_FOLDER environment variable. Otherwise model will be saved into cloned_aibolit_path/aibolit/binary_files/model.pkl 8. If you need to set up own folder with Java files, use --java_folder parameter, the default value will be scripts/target/01 of aibolit cloned repo

Or you can use our docker image (link will be soon here)

Run train pipeline:

aibolit train --java_folder=src/java [--max_classes=100] [--dataset_file]

If you need to save the dataset with all calculated metrics to a different directory, you need to use dataset_file parameter

aibolit train --java_folder=src/java --dataset_file /mnt/d/new_dir/dataset.csv

You can skip dataset collection with skip_collect_dataset parameter. In this case the model will be trained with predefined dataset (see 5 point):

aibolit train --java_folder=src/java --skip_collect_dataset

How to contribute?

First, you need to install:

Install the following packages if you don't have them:

apt-get install ruby-dev libz-dev libxml2

This project does not include a virtual environment by default. If you're using one (e.g., .venv, venv), update the .xcop file to exclude it:

--exclude=.venv/**

After forking and editing the repo, verify the build is clean by running:

make

To build white paper:

cd wp
latexmk -c && latexmk -pdf wp.tex

If everything is fine, submit a pull request.

Using Docker recommendation pipeline

docker run --rm -it \
  -v <absolute_path_to_folder_with_classes>:/in \
  -v <absolute_path_to_out_dir>:/out \
  cqfn/aibolit-image

For Tasks:

Click tags to check more tools for each tasks

identify code patterns recommend fixes suppress specific patterns train custom models improve code quality

For Jobs:

software developer quality assurance analyst code reviewer technical lead software architect

Alternative AI tools for aibolit

Similar Open Source Tools

aibolit

github

: 87

fastfit

FastFit is a Python package designed for fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. It utilizes a novel approach integrating batch contrastive learning and token-level similarity score, significantly improving multi-class classification performance in speed and accuracy across various datasets. FastFit provides a convenient command-line tool for training text classification models with customizable parameters. It offers a 3-20x improvement in training speed, completing training in just a few seconds. Users can also train models with Python scripts and perform inference using pretrained models for text classification tasks.

github

: 183

aiohttp-session

aiohttp_session is a Python library that provides session management for aiohttp.web applications. It allows storing user-specific data in session objects with a dict-like interface. The library offers different session storage options, including SimpleCookieStorage for testing, EncryptedCookieStorage for secure data storage, and RedisStorage for storing data in Redis. Users can easily integrate session management into their aiohttp.web applications by registering the session middleware. The library is designed to simplify session handling and enhance the security of web applications.

github

: 237

lollms

LoLLMs Server is a text generation server based on large language models. It provides a Flask-based API for generating text using various pre-trained language models. This server is designed to be easy to install and use, allowing developers to integrate powerful text generation capabilities into their applications.

github

: 287

lollms_legacy

Lord of Large Language Models (LoLLMs) Server is a text generation server based on large language models. It provides a Flask-based API for generating text using various pre-trained language models. This server is designed to be easy to install and use, allowing developers to integrate powerful text generation capabilities into their applications. The tool supports multiple personalities for generating text with different styles and tones, real-time text generation with WebSocket-based communication, RESTful API for listing personalities and adding new personalities, easy integration with various applications and frameworks, sending files to personalities, running on multiple nodes to provide a generation service to many outputs at once, and keeping data local even in the remote version.

github

: 294

DeepPavlov

DeepPavlov is an open-source conversational AI library built on PyTorch. It is designed for the development of production-ready chatbots and complex conversational systems, as well as for research in the area of NLP and dialog systems. The library offers a wide range of models for tasks such as Named Entity Recognition, Intent/Sentence Classification, Question Answering, Sentence Similarity/Ranking, Syntactic Parsing, and more. DeepPavlov also provides embeddings like BERT, ELMo, and FastText for various languages, along with AutoML capabilities and integrations with REST API, Socket API, and Amazon AWS.

github

: 6.6k

HuggingFaceModelDownloader

The HuggingFace Model Downloader is a utility tool for downloading models and datasets from the HuggingFace website. It offers multithreaded downloading for LFS files and ensures the integrity of downloaded models with SHA256 checksum verification. The tool provides features such as nested file downloading, filter downloads for specific LFS model files, support for HuggingFace Access Token, and configuration file support. It can be used as a library or a single binary for easy model downloading and inference in projects.

github

: 475

openai

An open-source client package that allows developers to easily integrate the power of OpenAI's state-of-the-art AI models into their Dart/Flutter applications. The library provides simple and intuitive methods for making requests to OpenAI's various APIs, including the GPT-3 language model, DALL-E image generation, and more. It is designed to be lightweight and easy to use, enabling developers to focus on building their applications without worrying about the complexities of dealing with HTTP requests. Note that this is an unofficial library as OpenAI does not have an official Dart library.

github

: 641

clarity-template

Clarity Template is an open-source, minimalist website template tailored for presenting AI research. It offers a modular and clean design that can be easily customized for creating simple, project-based websites. With Clarity, users can effectively showcase their work in a visually appealing and professional manner. The template includes various design themes and features like navbar design, slideshow display, title page layout, and comparison display. Users can follow the visual guidelines provided in the Clarity blog post or clarity.html file to optimize visual layouts and design. Clarity offers two font options: free fonts (Charter and Poppins) and licensed fonts (Tiempos Text and Athletics) with trial licenses available for non-commercial use. Users can switch between font options by downloading and placing font files in the assets/fonts directory, updating font settings in the stylesheet, and ensuring compatibility across different browsers using a web font converter.

github

: 85

llmgraph

llmgraph is a tool that enables users to create knowledge graphs in GraphML, GEXF, and HTML formats by extracting world knowledge from large language models (LLMs) like ChatGPT. It supports various entity types and relationships, offers cache support for efficient graph growth, and provides insights into LLM costs. Users can customize the model used and interact with different LLM providers. The tool allows users to generate interactive graphs based on a specified entity type and Wikipedia link, making it a valuable resource for knowledge graph creation and exploration.

github

: 271

litserve

LitServe is a high-throughput serving engine for deploying AI models at scale. It generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs, and more. Built for enterprise scale, it supports every framework like PyTorch, JAX, Tensorflow, and more. LitServe is designed to let users focus on model performance, not the serving boilerplate. It is like PyTorch Lightning for model serving but with broader framework support and scalability.

github

: 53

LLMDebugger

This repository contains the code and dataset for LDB, a novel debugging framework that enables Large Language Models (LLMs) to refine their generated programs by tracking the values of intermediate variables throughout the runtime execution. LDB segments programs into basic blocks, allowing LLMs to concentrate on simpler code units, verify correctness block by block, and pinpoint errors efficiently. The tool provides APIs for debugging and generating code with debugging messages, mimicking how human developers debug programs.

github

: 302

minja

Minja is a minimalistic C++ Jinja templating engine designed specifically for integration with C++ LLM projects, such as llama.cpp or gemma.cpp. It is not a general-purpose tool but focuses on providing a limited set of filters, tests, and language features tailored for chat templates. The library is header-only, requires C++17, and depends only on nlohmann::json. Minja aims to keep the codebase small, easy to understand, and offers decent performance compared to Python. Users should be cautious when using Minja due to potential security risks, and it is not intended for producing HTML or JavaScript output.

github

: 102

hugging-chat-api

Unofficial HuggingChat Python API for creating chatbots, supporting features like image generation, web search, memorizing context, and changing LLMs. Users can log in, chat with the ChatBot, perform web searches, create new conversations, manage conversations, switch models, get conversation info, use assistants, and delete conversations. The API also includes a CLI mode with various commands for interacting with the tool. Users are advised not to use the application for high-stakes decisions or advice and to avoid high-frequency requests to preserve server resources.

github

: 780

openedai-speech

OpenedAI Speech is a free, private text-to-speech server compatible with the OpenAI audio/speech API. It offers custom voice cloning and supports various models like tts-1 and tts-1-hd. Users can map their own piper voices and create custom cloned voices. The server provides multilingual support with XTTS voices and allows fixing incorrect sounds with regex. Recent changes include bug fixes, improved error handling, and updates for multilingual support. Installation can be done via Docker or manual setup, with usage instructions provided. Custom voices can be created using Piper or Coqui XTTS v2, with guidelines for preparing audio files. The tool is suitable for tasks like generating speech from text, creating custom voices, and multilingual text-to-speech applications.

github

: 243

elyra

Elyra is a set of AI-centric extensions to JupyterLab Notebooks that includes features like Visual Pipeline Editor, running notebooks/scripts as batch jobs, reusable code snippets, hybrid runtime support, script editors with execution capabilities, debugger, version control using Git, and more. It provides a comprehensive environment for data scientists and AI practitioners to develop, test, and deploy machine learning models and workflows efficiently.

github

: 2.0k

For similar tasks

aibolit

github

: 87

code-review-gpt

Code Review GPT uses Large Language Models to review code in your CI/CD pipeline. It helps streamline the code review process by providing feedback on code that may have issues or areas for improvement. It should pick up on common issues such as exposed secrets, slow or inefficient code, and unreadable code. It can also be run locally in your command line to review staged files. Code Review GPT is in alpha and should be used for fun only. It may provide useful feedback but please check any suggestions thoroughly.

github

: 1.8k

digma

Digma is a Continuous Feedback platform that provides code-level insights related to performance, errors, and usage during development. It empowers developers to own their code all the way to production, improving code quality and preventing critical issues. Digma integrates with OpenTelemetry traces and metrics to generate insights in the IDE, helping developers analyze code scalability, bottlenecks, errors, and usage patterns.

github

: 396

ai-codereviewer

AI Code Reviewer is a GitHub Action that utilizes OpenAI's GPT-4 API to provide intelligent feedback and suggestions on pull requests. It helps enhance code quality and streamline the code review process by offering insightful comments and filtering out specified files. The tool is easy to set up and integrate into GitHub workflows.

github

: 456

sourcery

Sourcery is an automated code reviewer tool that provides instant feedback on pull requests, helping to speed up the code review process, improve code quality, and accelerate development velocity. It offers high-level feedback, line-by-line suggestions, and aims to mimic the type of code review one would expect from a colleague. Sourcery can also be used as an IDE coding assistant to understand existing code, add unit tests, optimize code, and improve code quality with instant suggestions. It is free for public repos/open source projects and offers a 14-day trial for private repos.

github

: 1.6k

RTL-Coder

RTL-Coder is a tool designed to outperform GPT-3.5 in RTL code generation by providing a fully open-source dataset and a lightweight solution. It targets Verilog code generation and offers an automated flow to generate a large labeled dataset with over 27,000 diverse Verilog design problems and answers. The tool addresses the data availability challenge in IC design-related tasks and can be used for various applications beyond LLMs. The tool includes four RTL code generation models available on the HuggingFace platform, each with specific features and performance characteristics. Additionally, RTL-Coder introduces a new LLM training scheme based on code quality feedback to further enhance model performance and reduce GPU memory consumption.

github

: 121

AwesomeLLM4APR

Awesome LLM for APR is a repository dedicated to exploring the capabilities of Large Language Models (LLMs) in Automated Program Repair (APR). It provides a comprehensive collection of research papers, tools, and resources related to using LLMs for various scenarios such as repairing semantic bugs, security vulnerabilities, syntax errors, programming problems, static warnings, self-debugging, type errors, web UI tests, smart contracts, hardware bugs, performance bugs, API misuses, crash bugs, test case repairs, formal proofs, GitHub issues, code reviews, motion planners, human studies, and patch correctness assessments. The repository serves as a valuable reference for researchers and practitioners interested in leveraging LLMs for automated program repair.

github

: 88

monacopilot

Monacopilot is a powerful and customizable AI auto-completion plugin for the Monaco Editor. It supports multiple AI providers such as Anthropic, OpenAI, Groq, and Google, providing real-time code completions with an efficient caching system. The plugin offers context-aware suggestions, customizable completion behavior, and framework agnostic features. Users can also customize the model support and trigger completions manually. Monacopilot is designed to enhance coding productivity by providing accurate and contextually appropriate completions in daily spoken language.

github

: 111

For similar jobs

sourcegraph

Sourcegraph is a code search and navigation tool that helps developers read, write, and fix code in large, complex codebases. It provides features such as code search across all repositories and branches, code intelligence for navigation and refactoring, and the ability to fix and refactor code across multiple repositories at once.

github

: 10.0k

pr-agent

PR-Agent is a tool that helps to efficiently review and handle pull requests by providing AI feedbacks and suggestions. It supports various commands such as generating PR descriptions, providing code suggestions, answering questions about the PR, and updating the CHANGELOG.md file. PR-Agent can be used via CLI, GitHub Action, GitHub App, Docker, and supports multiple git providers and models. It emphasizes real-life practical usage, with each tool having a single GPT-4 call for quick and affordable responses. The PR Compression strategy enables effective handling of both short and long PRs, while the JSON prompting strategy allows for modular and customizable tools. PR-Agent Pro, the hosted version by CodiumAI, provides additional benefits such as full management, improved privacy, priority support, and extra features.

github

: 6.5k

code-review-gpt

github

: 1.8k

DevoxxGenieIDEAPlugin

Devoxx Genie is a Java-based IntelliJ IDEA plugin that integrates with local and cloud-based LLM providers to aid in reviewing, testing, and explaining project code. It supports features like code highlighting, chat conversations, and adding files/code snippets to context. Users can modify REST endpoints and LLM parameters in settings, including support for cloud-based LLMs. The plugin requires IntelliJ version 2023.3.4 and JDK 17. Building and publishing the plugin is done using Gradle tasks. Users can select an LLM provider, choose code, and use commands like review, explain, or generate unit tests for code analysis.

github

: 414

code2prompt

code2prompt is a command-line tool that converts your codebase into a single LLM prompt with a source tree, prompt templating, and token counting. It automates generating LLM prompts from codebases of any size, customizing prompt generation with Handlebars templates, respecting .gitignore, filtering and excluding files using glob patterns, displaying token count, including Git diff output, copying prompt to clipboard, saving prompt to an output file, excluding files and folders, adding line numbers to source code blocks, and more. It helps streamline the process of creating LLM prompts for code analysis, generation, and other tasks.

github

: 6.6k

ai-codereviewer

github

: 456

github-pr-summary

github-pr-summary is a bot designed to summarize GitHub Pull Requests, helping open source contributors make faster decisions. It automatically summarizes commits and changed files in PRs, triggered by new commits or a magic trigger phrase. Users can deploy their own code review bot in 3 steps: create a bot from their GitHub repo, configure it to review PRs, and connect to GitHub for access to the target repo. The bot runs on flows.network using Rust and WasmEdge Runtimes. It utilizes ChatGPT/4 to review and summarize PR content, posting the result back as a comment on the PR. The bot can be used on multiple repos by creating new flows and importing the source code repo, specifying the target repo using flow config. Users can also change the magic phrase to trigger a review from a PR comment.

github

: 209

fittencode.nvim

Fitten Code AI Programming Assistant for Neovim provides fast completion using AI, asynchronous I/O, and support for various actions like document code, edit code, explain code, find bugs, generate unit test, implement features, optimize code, refactor code, start chat, and more. It offers features like accepting suggestions with Tab, accepting line with Ctrl + Down, accepting word with Ctrl + Right, undoing accepted text, automatic scrolling, and multiple HTTP/REST backends. It can run as a coc.nvim source or nvim-cmp source.

github

: 108