
web-codegen-scorer
Web Codegen Scorer is a tool for evaluating the quality of web code generated by LLMs.
Stars: 237

Web Codegen Scorer is a tool designed to evaluate the quality of web code generated by Large Language Models (LLMs). It allows users to make evidence-based decisions related to AI-generated code by iterating on system prompts, comparing code quality from different models, and monitoring code quality over time. The tool focuses specifically on web code and offers various features such as configuring evaluations, specifying system instructions, using built-in checks for code quality, automatically repairing issues, and viewing results with an intuitive report viewer UI.
README:
Web Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language Models (LLMs).
You can use this tool to make evidence-based decisions relating to AI-generated code. For example:
- 🔄 Iterate on a system prompt to find most effective instructions for your project.
- ⚖️ Compare the code quality of code produced by different models.
- 📈 Monitor generated code quality over time as models and agents evolve.
Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on web code and relies primarily on well-established measures of code quality.
- ⚙️ Configure your evaluations with different models, frameworks, and tools.
- ✍️ Specify system instructions and add MCP servers.
- 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and coding best practices. (More built-in checks coming soon!)
- 🔧 Automatically attempt to repair issues detected during code generating.
- 📊 View and compare results with an intuitive report viewer UI.
- Install the package:
npm install -g web-codegen-scorer
-
Set up your API keys:
In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:
export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models
-
Run an eval:
You can run your first eval using our Angular example with the following command:
web-codegen-scorer eval --env=angular-example
-
(Optional) Set up your own eval:
If you want to set up a custom eval, instead of using our built-in examples, you can run the following command which will guide you through the process:
web-codegen-scorer init
-
(Optional) Run an evaluated app locally:
Once you've evaluated an app, you can run it locally with the following command:
web-codegen-scorer run --env=angular-example --prompt=<name of the prompt you want to run>
You can customize the web-codegen-scorer eval
script with the following flags:
-
--env=<path>
(alias:--environment
): (Required) Specifies the path from which to load the environment config.- Example:
web-codegen-scorer eval --env=foo/bar/my-env.mjs
- Example:
-
--model=<name>
: Specifies the model to use when generating code. Defaults to the value ofDEFAULT_MODEL_NAME
.- Example:
web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>
- Example:
-
--runner=<name>
: Specifies the runner to use to execute the eval. Supported runners aregenkit
(default) orgemini-cli
. -
--local
: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the.web-codegen-scorer/llm-output
directory (e.g.,.web-codegen-scorer/llm-output/todo-app.ts
). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.-
Note: You typically need to run
web-codegen-scorer eval
once without--local
to generate the initial files in.web-codegen-scorer/llm-output
. - The
web-codegen-scorer eval:local
script is a shortcut forweb-codegen-scorer eval --local
.
-
Note: You typically need to run
-
--limit=<number>
: Specifies the number of application prompts to process. Defaults to5
.- Example:
web-codegen-scorer eval --limit=10 --env=<config path>
- Example:
-
--output-directory=<name>
(alias:--output-dir
): Specifies which directory to output the generated code under which is useful for debugging. By default, the code will be generated in a temporary directory.- Example:
web-codegen-scorer eval --output-dir=test-output --env=<config path>
- Example:
-
--concurrency=<number>
: Sets the maximum number of concurrent AI API requests. Defaults to5
( as defined byDEFAULT_CONCURRENCY
insrc/config.ts
).- Example:
web-codegen-scorer eval --concurrency=3 --env=<config path>
- Example:
-
--report-name=<name>
: Sets the name for the generated report directory. Defaults to a timestamp (e.g.,2023-10-27T10-30-00-000Z
). The name will be sanitized (non-alphanumeric characters replaced with hyphens).- Example:
web-codegen-scorer eval --report-name=my-custom-report --env=<config path>
- Example:
-
--rag-endpoint=<url>
: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain aPROMPT
substring, which will be replaced with the user prompt.- Example:
web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>
- Example:
-
--prompt-filter=<name>
: String used to filter which prompts should be run. By default, a random sample (controlled by--limit
) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.- Example:
web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>
- Example:
-
--skip-screenshots
: Whether to skip taking screenshots of the generated app. Defaults tofalse
.- Example:
web-codegen-scorer eval --skip-screenshots --env=<config path>
- Example:
-
--labels=<label1> <label2>
: Metadata labels that will be attached to the run.- Example:
web-codegen-scorer eval --labels my-label another-label --env=<config path>
- Example:
-
--mcp
: Whether to start an MCP for the evaluation. Defaults tofalse
.- Example:
web-codegen-scorer eval --mcp --env=<config path>
- Example:
-
--help
: Prints out usage information about the script.
If you've cloned this repo and want to work on the tool, you have to install its dependencies by
running pnpm install
.
Once they're installed, you can run the following commands:
-
pnpm run release-build
- Creates a release build of the package indist
directory. -
pnpm run npm-publish
- Builds the package and publishes it to npm. -
pnpm run eval
- Runs an eval from source. -
pnpm run report
- Runs the report app from source. -
pnpm run init
- Runs the init script from source. -
pnpm run format
- Formats the source code using Prettier.
This tool is built by the Angular team at Google.
No! You can use this tool with any web library or framework (or none at all) as well as any model.
As more and more developers reach for LLM-based tools to create and modify code, we wanted to be able to empirically measure the effect of different factors on the quality of generated code. While many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the specific quality metrics we cared about.
In the absence of such a tool, we found that many developers based their judgements on codegen with different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web Codegen Scorer gives us a platform to consistently measure codegen across different configurations with consistency and repeatability.
Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.
Our roadmap includes:
- Including interaction testing in the rating, to ensure the generated code performs any requested behaviors.
- Measure Core Web Vitals.
- Measuring the effectiveness of LLM-driven edits on an existing codebase.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for web-codegen-scorer
Similar Open Source Tools

web-codegen-scorer
Web Codegen Scorer is a tool designed to evaluate the quality of web code generated by Large Language Models (LLMs). It allows users to make evidence-based decisions related to AI-generated code by iterating on system prompts, comparing code quality from different models, and monitoring code quality over time. The tool focuses specifically on web code and offers various features such as configuring evaluations, specifying system instructions, using built-in checks for code quality, automatically repairing issues, and viewing results with an intuitive report viewer UI.

reader
Reader is a tool that converts any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/`. It improves the output for your agent and RAG systems at no cost. Reader supports image reading, captioning all images at the specified URL and adding `Image [idx]: [caption]` as an alt tag. This enables downstream LLMs to interact with the images in reasoning, summarizing, etc. Reader offers a streaming mode, useful when the standard mode provides an incomplete result. In streaming mode, Reader waits a bit longer until the page is fully rendered, providing more complete information. Reader also supports a JSON mode, which contains three fields: `url`, `title`, and `content`. Reader is backed by Jina AI and licensed under Apache-2.0.

sage
Sage is a tool that allows users to chat with any codebase, providing a chat interface for code understanding and integration. It simplifies the process of learning how a codebase works by offering heavily documented answers sourced directly from the code. Users can set up Sage locally or on the cloud with minimal effort. The tool is designed to be easily customizable, allowing users to swap components of the pipeline and improve the algorithms powering code understanding and generation.

llm-finetuning
llm-finetuning is a repository that provides a serverless twist to the popular axolotl fine-tuning library using Modal's serverless infrastructure. It allows users to quickly fine-tune any LLM model with state-of-the-art optimizations like Deepspeed ZeRO, LoRA adapters, Flash attention, and Gradient checkpointing. The repository simplifies the fine-tuning process by not exposing all CLI arguments, instead allowing users to specify options in a config file. It supports efficient training and scaling across multiple GPUs, making it suitable for production-ready fine-tuning jobs.

aides-jeunes
The user interface (and the main server) of the simulator of aids and social benefits for young people. It is based on the free socio-fiscal simulator Openfisca.

ai-starter-kit
SambaNova AI Starter Kits is a collection of open-source examples and guides designed to facilitate the deployment of AI-driven use cases for developers and enterprises. The kits cover various categories such as Data Ingestion & Preparation, Model Development & Optimization, Intelligent Information Retrieval, and Advanced AI Capabilities. Users can obtain a free API key using SambaNova Cloud or deploy models using SambaStudio. Most examples are written in Python but can be applied to any programming language. The kits provide resources for tasks like text extraction, fine-tuning embeddings, prompt engineering, question-answering, image search, post-call analysis, and more.

contoso-chat
Contoso Chat is a Python sample demonstrating how to build, evaluate, and deploy a retail copilot application with Azure AI Studio using Promptflow with Prompty assets. The sample implements a Retrieval Augmented Generation approach to answer customer queries based on the company's product catalog and customer purchase history. It utilizes Azure AI Search, Azure Cosmos DB, Azure OpenAI, text-embeddings-ada-002, and GPT models for vectorizing user queries, AI-assisted evaluation, and generating chat responses. By exploring this sample, users can learn to build a retail copilot application, define prompts using Prompty, design, run & evaluate a copilot using Promptflow, provision and deploy the solution to Azure using the Azure Developer CLI, and understand Responsible AI practices for evaluation and content safety.

gpt-cli
gpt-cli is a command-line interface tool for interacting with various chat language models like ChatGPT, Claude, and others. It supports model customization, usage tracking, keyboard shortcuts, multi-line input, markdown support, predefined messages, and multiple assistants. Users can easily switch between different assistants, define custom assistants, and configure model parameters and API keys in a YAML file for easy customization and management.

torchchat
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.

repo-to-text
The `repo-to-text` tool converts a directory's structure and contents into a single text file. It generates a formatted text representation that includes the directory tree and file contents, making it easy to share code with LLMs for development and debugging. Users can customize the tool's behavior with various options and settings, including output directory specification, debug logging, and file inclusion/exclusion rules. The tool supports Docker usage for containerized environments and provides detailed instructions for installation, usage, settings configuration, and contribution guidelines. It is a versatile tool for converting repository contents into text format for easy sharing and documentation.

telemetry-airflow
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

safety-tooling
This repository, safety-tooling, is designed to be shared across various AI Safety projects. It provides an LLM API with a common interface for OpenAI, Anthropic, and Google models. The aim is to facilitate collaboration among AI Safety researchers, especially those with limited software engineering backgrounds, by offering a platform for contributing to a larger codebase. The repo can be used as a git submodule for easy collaboration and updates. It also supports pip installation for convenience. The repository includes features for installation, secrets management, linting, formatting, Redis configuration, testing, dependency management, inference, finetuning, API usage tracking, and various utilities for data processing and experimentation.

mlx-lm
MLX LM is a Python package designed for generating text and fine-tuning large language models on Apple silicon using MLX. It offers integration with the Hugging Face Hub for easy access to thousands of LLMs, support for quantizing and uploading models to the Hub, low-rank and full model fine-tuning capabilities, and distributed inference and fine-tuning with `mx.distributed`. Users can interact with the package through command line options or the Python API, enabling tasks such as text generation, chatting with language models, model conversion, streaming generation, and sampling. MLX LM supports various Hugging Face models and provides tools for efficient scaling to long prompts and generations, including a rotating key-value cache and prompt caching. It requires macOS 15.0 or higher for optimal performance.

TypeGPT
TypeGPT is a Python application that enables users to interact with ChatGPT or Google Gemini from any text field in their operating system using keyboard shortcuts. It provides global accessibility, keyboard shortcuts for communication, and clipboard integration for larger text inputs. Users need to have Python 3.x installed along with specific packages and API keys from OpenAI for ChatGPT access. The tool allows users to run the program normally or in the background, manage processes, and stop the program. Users can use keyboard shortcuts like `/ask`, `/see`, `/stop`, `/chatgpt`, `/gemini`, `/check`, and `Shift + Cmd + Enter` to interact with the application in any text field. Customization options are available by modifying files like `keys.txt` and `system_prompt.txt`. Contributions are welcome, and future plans include adding support for other APIs and a user-friendly GUI.

generative-models
Generative Models by Stability AI is a repository that provides various generative models for research purposes. It includes models like Stable Video 4D (SV4D) for video synthesis, Stable Video 3D (SV3D) for multi-view synthesis, SDXL-Turbo for text-to-image generation, and more. The repository focuses on modularity and implements a config-driven approach for building and combining submodules. It supports training with PyTorch Lightning and offers inference demos for different models. Users can access pre-trained models like SDXL-base-1.0 and SDXL-refiner-1.0 under a CreativeML Open RAIL++-M license. The codebase also includes tools for invisible watermark detection in generated images.

open-deep-research
Open Deep Research is an open-source project that serves as a clone of Open AI's Deep Research experiment. It utilizes Firecrawl's extract and search method along with a reasoning model to conduct in-depth research on the web. The project features Firecrawl Search + Extract, real-time data feeding to AI via search, structured data extraction from multiple websites, Next.js App Router for advanced routing, React Server Components and Server Actions for server-side rendering, AI SDK for generating text and structured objects, support for various model providers, styling with Tailwind CSS, data persistence with Vercel Postgres and Blob, and simple and secure authentication with NextAuth.js.
For similar tasks

web-codegen-scorer
Web Codegen Scorer is a tool designed to evaluate the quality of web code generated by Large Language Models (LLMs). It allows users to make evidence-based decisions related to AI-generated code by iterating on system prompts, comparing code quality from different models, and monitoring code quality over time. The tool focuses specifically on web code and offers various features such as configuring evaluations, specifying system instructions, using built-in checks for code quality, automatically repairing issues, and viewing results with an intuitive report viewer UI.

evalplus
EvalPlus is a rigorous evaluation framework for LLM4Code, providing HumanEval+ and MBPP+ tests to evaluate large language models on code generation tasks. It offers precise evaluation and ranking, coding rigorousness analysis, and pre-generated code samples. Users can use EvalPlus to generate code solutions, post-process code, and evaluate code quality. The tool includes tools for code generation and test input generation using various backends.
For similar jobs

sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.