CodeLLMPaper

A continuously updated collection of CodeLLM papers maintained by PurCL group @ Purdue

Stars: 356

Visit

CodeLLM Paper repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation. The selection strategy involves abstract extraction, keyword matching, relevance check using LLMs, and manual labeling. The papers are categorized based on Application, Principle, and Research Paradigm dimensions. Contributions to expand the repository are welcome through PR submission, issue submission, or request for batch updates. The repository is intended solely for research purposes, with raw data sourced from publicly available information on ACM, IEEE, and corresponding conference websites.

README:

CodeLLM Paper

This repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation.

A. Venues
B. Selection Strategy
C. Taxonomy
D. How to Contribute
E. Disclaimer and Contract

A. Venues

We have systematically selected papers from the following venues, which are top-tier conferences and journals in SE/PL/Sec/NLP communities.

Software Engineering (SE)
- ICSE2023, FSE2023, ASE2023, ISSTA2023, TSE2023, TOSEM2023
- ICSE2024, FSE2024, ASE2024, ISSTA2024, TSE2024, TOSEM2024
Programming Languages (PL)
- PLDI2023, OOPSLA2023
- OOPSLA2024
- POPL2025, CC2025
Security (Sec)
- S&P2023, USENIXSec2023, CCS2023, NDSS2023
- S&P2024, USENIXSec2024, NDSS2024, CCS2024
- NDSS2025
Natural Language Processing (NLP)
- ACL2023, EMNLP2023, NAACL2023
- ACL2024, EMNLP2024, NAACL2024

Due to the large volume, we do not systematically collect the papers published in top-tier ML conferences (ICML, NeurIPS, and ICLR) and arXiv. However, we are keeping manually adding important works published in these venues. We plan to expand the collection over time, and contributions are welcome. For details, see the section How to Contribute.

B. Selection Strategy

Abstract Extraction: Extract the abstracts from bib files or HTML files. The bib and HTML files of the above listed venues are stored in the directory data/rawdata.
Keyword Matching: Filter abstracts that meet both of the following conditions:
- Contains at least one keyword from: {"pretrain", "LLM", "large language model", "transformer", "code model"}.
- Contains the keyword "code" or "program".
Relevance Check Using LLMs: Use LLMs to verify if the papers obtained in Step 2 are related to LLMs for code.
Manual Labeling: Manually assign labels to the papers based on domain knowledge.

All the selected papers along with the labels are maintained in the json file data/labeldata/labeldata.json. src/process.py is the python script used for selecting and labeling papers.

C. Taxonomy

The papers in this repository are categorized along three dimensions: Application, Principle, and Research Paradigm. Each paper is assigned multiple labels based on these categories. Note that categories are not necessarily disjoint.

C.1. Application

This category focuses on typical tasks in Software Engineering (SE) and Programming Languages (PL).

General Coding Task (32)
Code Generation (199)
- Program Synthesis (83)
- Code Completion (23)
- Program Repair (41)
- Program Transformation (32)
Program Testing (55)
- General Testing (1)
- Fuzzing (24)
- Library Testing (1)
- DBMS Testing (1)
- Compiler Testing (4)
- Protocol Fuzzing (1)
- Mutation Testing (2)
- Unit Testing (7)
- Differential Testing (2)
- Debugging (9)
- Bug Reproduction (2)
- Vulnerability Exploitation (6)
Static Analysis (149)
- Syntactic Analysis (1)
- Pointer Analysis (3)
- Call Graph Analysis (2)
- Data-flow Analysis (8)
- Type Inference (3)
- Specification Inference (17)
- Equivalence Checking (1)
- Code Similarity Analysis (5)
- Bug Detection (77)
- Program Verification (20)
- Program Optimization (4)
- Program Decompilation (9)
- Code Summarization (10)
- Code Search (5)
- Software Composition Analysis (3)
Software Maintenance and Deployment (18)
- Code Review (6)
- Documentation Generation (2)
- Commit Message Generation (4)
- Software Configuration (1)
- System Log Analysis (4)

C.2. Principle

This category concentrates on the LLMs' ability in understanding different forms of code and the non-functional properties of the LLMs (e.g., security and robustness). We also consider how to utilize the LLMs for general reasoning problems, such as typical agent-centric designs and specific PL designs for LLMs.

Code Model (112)
- Code Model Training (84)
  - Source Code Model (64)
  - IR Code Model (5)
  - Binary Code Model (15)
- Code Model Security (23)
- Code Model Robustness (4)
Hallucination In Reasoning (12)
PL Design For LLMs (3)
Agent Design (58)
- Prompt Strategy (39)
  - Retrieval-augmented Generation (9)
  - Reason With Code (17)
  - Sampling And Ranking (3)
- Planning (9)

C.3. Research Paradigm

This category includes studies on benchmarks, empirical evaluations, and surveys. The papers that do not belong to the following three categories are purely technical papers.

D. How to Contribute

D.1. PR Submission

We welcome contributions to expand this repository. If you want to add new papers to the list, please follow these steps:

Prepare a JSON File: Format the file like data/labeldata/patch/example.json. Each paper should include:
- title, authors, abstract, url, venue, and labels (aligned with the taxonomy in data/labeldata/patch).
Upload the File: Place the JSON file in the data/labeldata/patch directory.
Update Markdown Files: Run the following command to update the repository:
```
cd src && python patch.py
```

If you want to add new labels and change the current taxonomy, please post an issue first and suggest your taxonomy (See below).

D.2. Issue Submission

Another option is to post the papers you wish to add in an issue. Please include a permanently valid link to the paper and specify the venue. If you'd like, you can also categorize the paper based on your understanding of the work by attaching appropriate labels from the existing options in data/category.json or by creating new ones. We will add the paper to our repository very soon.

D.3. Request for Batch Updates

To facilitate timely batch updates to the paper repository, we prefer to utilize the proceedings of various conferences and journals. Here are several examples: ASE2024, OOPSLA2023, S&P2023, and ACL2024. By parsing and extracting information from bib files and HTML files (See data/rawdata), including abstracts, we can semi-automatically classify papers based on the aforementioned selection strategy. If the conference or journal you are following has recently released its complete proceedings, please notify us by submitting an issue. We will prioritize the batch update and add the corresponding conference or journal name to the venue list.

E. Disclaimer and Contact

This paper repository is intended solely for research purposes. All raw data is sourced from publicly available information on ACM, IEEE, and corresponding conference websites. Any content involving additional copyright information, including full PDF versions of the papers, is not disclosed in this repository.

For any questions or suggestions, please contact [email protected] or [email protected]

For Tasks:

Click tags to check more tools for each tasks

explore papers label papers contribute papers update taxonomy research code models

For Jobs:

researcher data scientist software engineer machine learning engineer academic

Alternative AI tools for CodeLLMPaper

Similar Open Source Tools

CodeLLMPaper

github

: 356

ExtractThinker

ExtractThinker is a library designed for extracting data from files and documents using Language Model Models (LLMs). It offers ORM-style interaction between files and LLMs, supporting multiple document loaders such as Tesseract OCR, Azure Form Recognizer, AWS TextExtract, and Google Document AI. Users can customize extraction using contract definitions, process documents asynchronously, handle various document formats efficiently, and split and process documents. The project is inspired by the LangChain ecosystem and focuses on Intelligent Document Processing (IDP) using LLMs to achieve high accuracy in document extraction tasks.

github

: 1.1k

glossAPI

The glossAPI project aims to develop a Greek language model as open-source software, with code licensed under EUPL and data under Creative Commons BY-SA. The project focuses on collecting and evaluating open text sources in Greek, with efforts to prioritize and gather textual data sets. The project encourages contributions through the CONTRIBUTING.md file and provides resources in the wiki for viewing and modifying recorded sources. It also welcomes ideas and corrections through issue submissions. The project emphasizes the importance of open standards, ethically secured data, privacy protection, and addressing digital divides in the context of artificial intelligence and advanced language technologies.

github

: 101

AIL-framework

AIL framework is a modular framework to analyze potential information leaks from unstructured data sources like pastes from Pastebin or similar services or unstructured data streams. AIL framework is flexible and can be extended to support other functionalities to mine or process sensitive information (e.g. data leak prevention).

github

: 1.3k

ail-framework

github

: 793

swiftide

Swiftide is a fast, streaming indexing and query library tailored for Retrieval Augmented Generation (RAG) in AI applications. It is built in Rust, utilizing parallel, asynchronous streams for blazingly fast performance. With Swiftide, users can easily build AI applications from idea to production in just a few lines of code. The tool addresses frustrations around performance, stability, and ease of use encountered while working with Python-based tooling. It offers features like fast streaming indexing pipeline, experimental query pipeline, integrations with various platforms, loaders, transformers, chunkers, embedders, and more. Swiftide aims to provide a platform for data indexing and querying to advance the development of automated Large Language Model (LLM) applications.

github

: 572

cleanlab

Cleanlab helps you **clean** data and **lab** els by automatically detecting issues in a ML dataset. To facilitate **machine learning with messy, real-world data** , this data-centric AI package uses your _existing_ models to estimate dataset problems that can be fixed to train even _better_ models.

github

: 10.2k

mobius

Mobius is an AI infra platform including realtime computing and training. It is built on Ray, a distributed computing framework, and provides a number of features that make it well-suited for online machine learning tasks. These features include: * **Cross Language**: Mobius can run in multiple languages (only Python and Java are supported currently) with high efficiency. You can implement your operator in different languages and run them in one job. * **Single Node Failover**: Mobius has a special failover mechanism that only needs to rollback the failed node itself, in most cases, to recover the job. This is a huge benefit if your job is sensitive about failure recovery time. * **AutoScaling**: Mobius can generate a new graph with different configurations in runtime without stopping the job. * **Fusion Training**: Mobius can combine TensorFlow/Pytorch and streaming, then building an e2e online machine learning pipeline. Mobius is still under development, but it has already been used to power a number of real-world applications, including: * A real-time recommendation system for a major e-commerce company * A fraud detection system for a large financial institution * A personalized news feed for a major news organization If you are interested in using Mobius for your own online machine learning projects, you can find more information in the documentation.

github

: 78

learn-agentic-ai

Learn Agentic AI is a repository that is part of the Panaversity Certified Agentic and Robotic AI Engineer program. It covers AI-201 and AI-202 courses, providing fundamentals and advanced knowledge in Agentic AI. The repository includes video playlists, projects, and project submission guidelines for students to enhance their understanding and skills in the field of AI engineering.

github

: 3.7k

SLAM-LLM

SLAM-LLM is a deep learning toolkit designed for researchers and developers to train custom multimodal large language models (MLLM) focusing on speech, language, audio, and music processing. It provides detailed recipes for training and high-performance checkpoints for inference. The toolkit supports tasks such as automatic speech recognition (ASR), text-to-speech (TTS), visual speech recognition (VSR), automated audio captioning (AAC), spatial audio understanding, and music caption (MC). SLAM-LLM features easy extension to new models and tasks, mixed precision training for faster training with less GPU memory, multi-GPU training with data and model parallelism, and flexible configuration based on Hydra and dataclass.

github

: 343

FinRobot

FinRobot is an open-source AI agent platform designed for financial applications using large language models. It transcends the scope of FinGPT, offering a comprehensive solution that integrates a diverse array of AI technologies. The platform's versatility and adaptability cater to the multifaceted needs of the financial industry. FinRobot's ecosystem is organized into four layers, including Financial AI Agents Layer, Financial LLMs Algorithms Layer, LLMOps and DataOps Layers, and Multi-source LLM Foundation Models Layer. The platform's agent workflow involves Perception, Brain, and Action modules to capture, process, and execute financial data and insights. The Smart Scheduler optimizes model diversity and selection for tasks, managed by components like Director Agent, Agent Registration, Agent Adaptor, and Task Manager. The tool provides a structured file organization with subfolders for agents, data sources, and functional modules, along with installation instructions and hands-on tutorials.

github

: 1.0k

starwhale

Starwhale is an MLOps/LLMOps platform that brings efficiency and standardization to machine learning operations. It streamlines the model development lifecycle, enabling teams to optimize workflows around key areas like model building, evaluation, release, and fine-tuning. Starwhale abstracts Model, Runtime, and Dataset as first-class citizens, providing tailored capabilities for common workflow scenarios including Models Evaluation, Live Demo, and LLM Fine-tuning. It is an open-source platform designed for clarity and ease of use, empowering developers to build customized MLOps features tailored to their needs.

github

: 196

Mooncake

Mooncake is a serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates prefill and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster. Mooncake's scheduler balances throughput and latency-related SLOs, with a prediction-based early rejection policy for highly overloaded scenarios. It excels in long-context scenarios, achieving up to a 525% increase in throughput while handling 75% more requests under real workloads.

github

: 4.0k

ai-optimizer

The Oracle AI Optimizer and Toolkit provides a streamlined environment for developers and data scientists to explore Generative Artificial Intelligence (GenAI) and Retrieval-Augmented Generation (RAG) capabilities. It integrates Oracle Database 23ai AI VectorSearch and SelectAI to enhance Large Language Models (LLMs) through RAG.

github

: 53

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

AIforEarthDataSets

The Microsoft AI for Earth program hosts geospatial data on Azure that is important to environmental sustainability and Earth science. This repo hosts documentation and demonstration notebooks for all the data that is managed by AI for Earth. It also serves as a "staging ground" for the Planetary Computer Data Catalog.

github

: 263

For similar tasks

CodeLLMPaper

github

: 356

For similar jobs

asreview

The ASReview project implements active learning for systematic reviews, utilizing AI-aided pipelines to assist in finding relevant texts for search tasks. It accelerates the screening of textual data with minimal human input, saving time and increasing output quality. The software offers three modes: Oracle for interactive screening, Exploration for teaching purposes, and Simulation for evaluating active learning models. ASReview LAB is designed to support decision-making in any discipline or industry by improving efficiency and transparency in screening large amounts of textual data.

github

: 709

NewEraAI-Papers

The NewEraAI-Papers repository provides links to collections of influential and interesting research papers from top AI conferences, along with open-source code to promote reproducibility and provide detailed implementation insights beyond the scope of the article. Users can stay up to date with the latest advances in AI research by exploring this repository. Contributions to improve the completeness of the list are welcomed, and users can create pull requests, open issues, or contact the repository owner via email to enhance the repository further.

github

: 65

cltk

The Classical Language Toolkit (CLTK) is a Python library that provides natural language processing (NLP) capabilities for pre-modern languages. It offers a modular processing pipeline with pre-configured defaults and supports almost 20 languages. Users can install the latest version using pip and access detailed documentation on the official website. The toolkit is designed to meet the unique needs of researchers working with historical languages, filling a void in the NLP landscape that often neglects non-spoken languages and different research goals.

github

: 819

Conference-Acceptance-Rate

The 'Conference-Acceptance-Rate' repository provides acceptance rates for top-tier AI-related conferences in the fields of Natural Language Processing, Computational Linguistics, Computer Vision, Pattern Recognition, Machine Learning, Learning Theory, Artificial Intelligence, Data Mining, Information Retrieval, Speech Processing, and Signal Processing. The data includes acceptance rates for long papers and short papers over several years for each conference, allowing researchers to track trends and make informed decisions about where to submit their work.

github

: 4.0k

pdftochat

PDFToChat is a tool that allows users to chat with their PDF documents in seconds. It is powered by Together AI and Pinecone, utilizing a tech stack including Next.js, Mixtral, M2 Bert, LangChain.js, MongoDB Atlas, Bytescale, Vercel, Clerk, and Tailwind CSS. Users can deploy the tool to Vercel or any other host by setting up Together.ai, MongoDB Atlas database, Bytescale, Clerk, and Vercel. The tool enables users to interact with PDFs through chat, with future tasks including adding features like trash icon for deleting PDFs, exploring different embedding models, implementing auto scrolling, improving replies, benchmarking accuracy, researching chunking and retrieval best practices, adding demo video, upgrading to Next.js 14, adding analytics, customizing tailwind prose, saving chats in postgres DB, compressing large PDFs, implementing custom uploader, session tracking, error handling, and support for images in PDFs.

github

: 916

tods-arxiv-daily-paper

This repository provides a tool for fetching and summarizing daily papers from the arXiv repository. It allows users to stay updated with the latest research in various fields by automatically retrieving and summarizing papers on a daily basis. The tool simplifies the process of accessing and digesting academic papers, making it easier for researchers and enthusiasts to keep track of new developments in their areas of interest.

github

: 125

Awesome-LLM-Strawberry

Awesome LLM Strawberry is a collection of research papers and blogs related to OpenAI Strawberry(o1) and Reasoning. The repository is continuously updated to track the frontier of LLM Reasoning.

github

: 6.3k

Call-for-Reviewers

The `Call-for-Reviewers` repository aims to collect the latest 'call for reviewers' links from various top CS/ML/AI conferences/journals. It provides an opportunity for individuals in the computer/ machine learning/ artificial intelligence fields to gain review experience for applying for NIW/H1B/EB1 or enhancing their CV. The repository helps users stay updated with the latest research trends and engage with the academic community.

github

: 688

CodeLLMPaper

README:

CodeLLM Paper

Table of Contents

A. Venues

B. Selection Strategy

C. Taxonomy

C.1. Application

C.2. Principle

C.3. Research Paradigm

D. How to Contribute

D.1. PR Submission

D.2. Issue Submission

D.3. Request for Batch Updates

E. Disclaimer and Contact

For Tasks:

For Jobs:

Alternative AI tools for CodeLLMPaper

Similar Open Source Tools

CodeLLMPaper

ExtractThinker

glossAPI

AIL-framework

ail-framework

swiftide

cleanlab

mobius

learn-agentic-ai

SLAM-LLM

FinRobot

starwhale

Mooncake

ai-optimizer

agentcloud

AIforEarthDataSets

For similar tasks

CodeLLMPaper

For similar jobs

asreview

NewEraAI-Papers

cltk

Conference-Acceptance-Rate

pdftochat

tods-arxiv-daily-paper

Awesome-LLM-Strawberry

Call-for-Reviewers