Jlama

Jlama

Jlama is a modern LLM inference engine for Java

Stars: 499

Visit
 screenshot

Jlama is a modern Java inference engine designed for large language models. It supports various model types such as Gemma, Llama, Mistral, GPT-2, BERT, and more. The tool implements features like Flash Attention, Mixture of Experts, and supports different model quantization formats. Built with Java 21 and utilizing the new Vector API for faster inference, Jlama allows users to add LLM inference directly to their Java applications. The tool includes a CLI for running models, a simple UI for chatting with LLMs, and examples for different model types.

README:

🦙 Jlama: A modern LLM inference engine for Java

Cute Jlama

Maven Central Version License Discord

🚀 Features

Model Support:

  • Gemma Models
  • Llama & Llama2 & Llama3 Models
  • Mistral & Mixtral Models
  • GPT-2 Models
  • BERT Models
  • BPE Tokenizers
  • WordPiece Tokenizers

Implements:

  • Paged Attention
  • Mixture of Experts
  • Tool Calling
  • Generate Embeddings
  • Classifier Support
  • Huggingface SafeTensors model and tokenizer format
  • Support for F32, F16, BF16 types
  • Support for Q8, Q4 model quantization
  • Fast GEMM operations
  • Distributed Inference!

Jlama requires Java 20 or later and utilizes the new Vector API for faster inference.

🤔 What is it used for?

Add LLM Inference directly to your Java application.

🔬 Quick Start

🕵️‍♀️ How to use as a local client (with jbang!)

Jlama includes a command line tool that makes it easy to use.

The CLI can be run with jbang.

#Install jbang (or https://www.jbang.dev/download/)
curl -Ls https://sh.jbang.dev | bash -s - app setup

#Install Jlama CLI (will ask if you trust the source)
jbang app install --force jlama@tjake

Now that you have jlama installed you can download a model from huggingface and chat with it. Note I have pre-quantized models available at https://hf.co/tjake

# Run the openai chat api and UI on a model
jlama restapi tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4 --auto-download

open browser to http://localhost:8080/

Demo chat

Usage:

jlama [COMMAND]

Description:

Jlama is a modern LLM inference engine for Java!
Quantized models are maintained at https://hf.co/tjake

Choose from the available commands:

Inference:
  chat                 Interact with the specified model
  restapi              Starts a openai compatible rest api for interacting with this model
  complete             Completes a prompt using the specified model

Distributed Inference:
  cluster-coordinator  Starts a distributed rest api for a model using cluster workers
  cluster-worker       Connects to a cluster coordinator to perform distributed inference

Other:
  download             Downloads a HuggingFace model - use owner/name format
  quantize             Quantize the specified model

👨‍💻 How to use in your Java project

The main purpose of Jlama is to provide a simple way to use large language models in Java.

The simplest way to embed Jlama in your app is with the Langchain4j Integration.

If you would like to embed Jlama without langchain4j, add the following maven dependencies to your project:

<dependency>
  <groupId>com.github.tjake</groupId>
  <artifactId>jlama-core</artifactId>
  <version>${jlama.version}</version>
</dependency>

<dependency>
  <groupId>com.github.tjake</groupId>
  <artifactId>jlama-native</artifactId>
  <!-- supports linux-x86_64, macos-x86_64/aarch_64, windows-x86_64 
       Use https://github.com/trustin/os-maven-plugin to detect os and arch -->
  <classifier>${os.detected.name}-${os.detected.arch}</classifier>
  <version>${jlama.version}</version>
</dependency>

Then you can use the Model classes to run models:

 public void sample() throws IOException {
    String model = "tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4";
    String workingDirectory = "./models";

    String prompt = "What is the best season to plant avocados?";

    // Downloads the model or just returns the local path if it's already downloaded
    File localModelPath = SafeTensorSupport.maybeDownloadModel(workingDirectory, model);
    
    // Loads the quantized model and specified use of quantized memory
    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);

    PromptContext ctx;
    // Checks if the model supports chat prompting and adds prompt in the expected format for this model
    if (m.promptSupport().isPresent()) {
        ctx = m.promptSupport()
                .get()
                .builder()
                .addSystemMessage("You are a helpful chatbot who writes short responses.")
                .addUserMessage(prompt)
                .build();
    } else {
        ctx = PromptContext.of(prompt);
    }

    System.out.println("Prompt: " + ctx.getPrompt() + "\n");
    // Generates a response to the prompt and prints it
    // The api allows for streaming or non-streaming responses
    // The response is generated with a temperature of 0.7 and a max token length of 256
    Generator.Response r = m.generate(UUID.randomUUID(), ctx, 0.0f, 256, (s, f) -> {});
    System.out.println(r.responseText);
 }

⭐ Give us a Star!

If you like or are using this project to build your own, please give us a star. It's a free way to show your support.

🗺️ Roadmap

  • Support more and more models
  • Add pure java tokenizers
  • Support Quantization (e.g. k-quantization)
  • Add LoRA support
  • GraalVM support
  • Add distributed inference

🏷️ License and Citation

The code is available under Apache License.

If you find this project helpful in your research, please cite this work at

@misc{jlama2024,
    title = {Jlama: A modern Java inference engine for large language models},
    url = {https://github.com/tjake/jlama},
    author = {T Jake Luciani},
    month = {January},
    year = {2024}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Jlama

Similar Open Source Tools

For similar tasks

For similar jobs