llama3.java

llama3.java

Practical Llama 3 inference in Java

Stars: 471

Visit
 screenshot

Llama3.java is a practical Llama 3 inference tool implemented in a single Java file. It serves as the successor of llama2.java and is designed for testing and tuning compiler optimizations and features on the JVM, especially for the Graal compiler. The tool features a GGUF format parser, Llama 3 tokenizer, Grouped-Query Attention inference, support for Q8_0 and Q4_0 quantizations, fast matrix-vector multiplication routines using Java's Vector API, and a simple CLI with 'chat' and 'instruct' modes. Users can download quantized .gguf files from huggingface.co for model usage and can also manually quantize to pure 'Q4_0'. The tool requires Java 21+ and supports running from source or building a JAR file for execution. Performance benchmarks show varying tokens/s rates for different models and implementations on different hardware setups.

README:

Llama3.java

Practical Llama 3, 3.1 and 3.2 inference implemented in a single Java file.

This project is the successor of llama2.java based on llama2.c by Andrej Karpathy and his excellent educational videos.

Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.

Features

  • Single file, no dependencies
  • GGUF format parser
  • Llama 3 tokenizer based on minbpe
  • Llama 3 inference with Grouped-Query Attention
  • Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
  • Support for Q8_0 and Q4_0 quantizations
  • Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
  • Simple CLI with --chat and --instruct modes.
  • GraalVM's Native Image support (EA builds here)
  • AOT model pre-loading for instant time-to-first-token

Interactive --chat mode in action:

Presented at Devoxx Belgium, 2024

Setup

Download pure Q4_0 and (optionally) Q8_0 quantized .gguf files from:

The pure Q4_0 quantized models are recommended, except for the very small models (1B), please be gentle with huggingface.co servers:

# Llama 3.2 (3B)
curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf

# Llama 3.2 (1B)
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

# Llama 3.1 (8B)
curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf

# Llama 3 (8B)
curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf

# Optionally download the Q8_0 quantized models
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf

Optional: quantize to pure Q4_0 manually

In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the token_embd.weights/output.weights tensor are quantized with Q6_K, instead of Q4_0.
A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the llama-quantize utility from llama.cpp as follows:

./llama-quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0

Build and run

Java 21+ is required, in particular the MemorySegment mmap-ing feature.

jbang is a perfect fit for this use case, just:

jbang Llama3.java --help

Or execute directly, also via jbang:

chmod +x Llama3.java
./Llama3.java --help

Run from source

java --enable-preview --source 21 --add-modules jdk.incubator.vector LLama3.java -i --model Meta-Llama-3-8B-Instruct-Q4_0.gguf

Optional: Makefile + manually build and run

A simple Makefile is provided, run make to produce llama3.jar or manually:

javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java
jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes .

Run the resulting llama3.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help

GraalVM Native Image

Compile to native via make (recommended):

make native

Or directly:

native-image -H:+UnlockExperimentalVMOptions	-H:+VectorAPISupport -H:+ForeignAPISupport -O3 -march=native --enable-preview --add-modules jdk.incubator.vector --initialize-at-build-time=com.llama4j.FloatTensor -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 -jar llama3.jar -o llama3

Run as Native Image:

./llama3 --model Llama-3.2-1B-Instruct-Q8_0 --chat

AOT model preloading

Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead.

Performance

GraalVM now supports more Vector API operations. To give it a try, you need GraalVM for JDK 24 – get the EA builds from oracle-graalvm-ea-builds or sdkman: sdk install java 24.ea.15-graal.

llama.cpp

Vanilla llama.cpp built with make.

./llama-cli --version                                                                                                                                                                          130 ↵
version: 3862 (3f1ae2e3)
built with cc (GCC) 14.2.1 20240805 for x86_64-pc-linux-gnu

Executed as follows:

./llama-bench -m Llama-3.2-1B-Instruct-Q4_0.gguf -p 0 -n 128

Llama3.java

taskset -c 0-15 ./llama3 \
  --model ./Llama-3-1B-Instruct-Q4_0.gguf \
  --max-tokens 128 \
  --seed 42 \
  --stream false \
  --prompt "Why is the sky blue?"

Hardware specs: 2019 AMD Ryzen 3950X 16C/32T 64GB (3800) Linux 6.6.47.

**Notes
Running on a single CCD e.g. taskset -c 0-15 ./llama3 ... since inference is constrained by memory bandwidth.

Results

License

MIT

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for llama3.java

Similar Open Source Tools

For similar tasks

For similar jobs