can-ai-code

can-ai-code

Self-evaluating interview for AI coders

Stars: 511

Visit
 screenshot

Can AI Code is a self-evaluating interview tool for AI coding models. It includes interview questions written by humans and tests taken by AI, inference scripts for common API providers and CUDA-enabled quantization runtimes, a Docker-based sandbox environment for validating untrusted Python and NodeJS code, and the ability to evaluate the impact of prompting techniques and sampling parameters on large language model (LLM) coding performance. Users can also assess LLM coding performance degradation due to quantization. The tool provides test suites for evaluating LLM coding performance, a webapp for exploring results, and comparison scripts for evaluations. It supports multiple interviewers for API and CUDA runtimes, with detailed instructions on running the tool in different environments. The repository structure includes folders for interviews, prompts, parameters, evaluation scripts, comparison scripts, and more.

README:

Can AI Code?

A cute robot working on a laptop

A self-evaluating interview for AI coding models.

Key Ideas

  • Interview questions written by humans, test taken by AI
  • Inference scripts for all common API providers and CUDA-enabled quantization runtimes
  • Sandbox environment (Docker-based) for untrusted Python and NodeJS code validation
  • Evaluate effects of prompting techniques and sampling parameters on LLM coding performance
  • Evaluate LLM coding performance degradation due to quantization

News

9/04 Evaluate Command-R and Command-R Plus (API).

8/25 Evaluate NTQAI/Nxcode-CQ-7B-orpo (FP16), PHI 3.5 Mini and MoE (FP16), codefuse-ai/CodeFuse-StarCoder2-15B (FP16), internlm/AlchemistCoder-CL-7B (FP16), InternLM2.5 Chat 7B and 20B (FP16).

8/11 Evaluate Llama-3.1-Instruct 8B HQQ.

8/10 Evaluate Llama-3.1-Instruct 8B and 70B EXL2 and some low-bit GGUFs.

8/1 Evaluate Llama-3-Instruct 8B and 70B with AQLM-2bit. Very slow. 8B is badly damaged.

Test Suites

junior-v2 is a multi-language (Python, JavaScript) suite of 12 tests created for this project to test small LLM coding performance. This project provides all necessary components to execute this evaluation.

🚧 humaneval is a Python-only suite of 164 tests created by OpenAI. This project provides template scripts to prepare and execute the humaneval interview, as well as result extraction scripts to help their evaluator. See https://github.com/openai/human-eval for more information.

Results data

All model answers and evaluation results are now included inside this repository! Install a recent release of streamlit pip install streamlit==1.23 then streamlit run app.py or streamlit run compare-app.py to run the above webapps locally.

Results HumanEval

🚧 humaneval/ development work is currently paused, there's other projects that are much further along.

See https://github.com/my-other-github-account/llm-humaneval-benchmarks and https://github.com/abacaj/code-eval for large lists of Humaneval LLM benchmark results.

Repository Structure

Interviews

  • junior-v2/*.yaml - junior coder interview questions (stable)
  • senior/*.yaml - senior coder interview questions (WIP)

Prepare

  • prompts/*.txt - LLM prompt templates for the various models
  • prepare.py - Applies templates to question turning them into language- and model-specific prompts suitable for interview

Prompts

See prompts/ for all prompts references in the leaderboard.

Interview

  • params/*.json - Sampling hyper-parameter sets (used by all interview scripts)
  • interview-*.py - Interview scripts

Parameters

See params/ for all params references in the leaderboard.

Evaluate

Compare

Interviewers: API

API Runtime Script
LiteLLM (OpenAI, etc..) interview-litellm.py
OobaBooga/KoboldCpp interview-oobabooga.py
Huggingface Inference interview-hfinference.py
Gradio (HF Spaces) interview-gradio.py

Interviewers: CUDA (Local)

Quantization Type Script Dependency
GGUF interview-llamacpp.py llamacpp or ggml binary
GPTQ (AutoGptQ) interview-cuda.py auto-gptq==0.6.0
GPTQ (ExLlama) interview-cuda.py exllama @ 3b013cd53c7d413cf99ca04c7c28dd5c95117c0d
EXL2, GPTQ (ExLlama2) interview-cuda.py exllamav2 @ 0.0.12
HQQ interview-cuda.py hqq @ 0.1.1
AWQ, FP16 (vLLM) interview-cuda.py vllm==0.3.0
CTranslate2 interview-cuda.py ctranslate2>=3.16.0
bitsandbytes interview-cuda.py bitsandbytes==0.41.3
FP16 (Transformers) interview-cuda.py transformers==4.37.2

Running on Modal

The recommended modal wrapper is interview_modal_cuda11.py which builds a CUDA11.8 based container with all the above dependencies working. An interview_modal_cuda12.py is also provided, but AutoGPTQ and CTranslate2 are not compatible.

Unfortunately the nature of Modal does not allow command-line selection of eitehr LLM model or runtime engine.

To select models, open the script and uncomment the .run_function(download...) line of choice. Note that only one model can be selected at a time. To add a new model, implement a new download... function.

To select runtime, open the script and uncomment one of the RUNTIME options. Note that for transformers you must also specify QUANT.

Question Format

A set of interview questions is a folder of .yaml files. Each Question is a top-level key:

SanityList:
    Signature: "things()"
    Input: "with no inputs"
    Output: "a list with three values: the number 5, the string 'foobar', the capital city of Spain"
    Fact: "the capital city of Spain is Madrid"
    Description: "List function, see if the model can combine input facts with internal knowledge."
    Checks:
        input_name:
            assert: "f.name"
            eq: "things"

In this example SanityList is the name of the interview question.

The first four fields are used by prepare.py to create the interview:

  • Signature is the desired function signature
  • Input describes the function inputs
  • Output describes the function outputs
  • Fact is optional and provides any context that is required to correctly perform the task

These 4 variables along with language (either python or javascript) are used to expand templates in prompts/.

The last two fields are used by evaluate.py to judge the results:

  • Description is a human-readable explanation of why this test is useful
  • Checks defines the expected behavior of the output.

Checks and the 'f' object

Each check has a name, some assert value (python code) and an expected eq value.

The f object represents the sandbox view of the function. Static analysis is performed on the function signature to extract the f.name and f.args fields, while f.call allows for function evaluation.

Output formats

All scripts output automatically named .ndjson files to the results/ directory.

Each stage outputs a super-set of fields from the stage before it, so its possible to feed eval/interview back to interview (to re-run the questions) or back to eval (to re-run the eval).

prepare

results/prepare_{interview}_{languages}_{template}.ndjson

Fields:

  • all Question fields (Signature, Input, Output, Fact, Description)
  • name
  • language
  • prompt

interview

results/interview_{interview}_{languages}_{template}_{templateout}_{params}_{model}_{timestamp}.ndjson

Fields:

  • all prepare fields
  • model
  • params
  • answer
  • runtime

eval

results/eval_{interview}_{languages}_{template}_{templateout}_{params}_{model}_{timestamp}.ndjson

Fields:

  • all eval fields
  • status
  • passed
  • total
  • checks

Roadmap / Future Work

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for can-ai-code

Similar Open Source Tools

For similar tasks

For similar jobs