ppl.llm.serving

ppl.llm.serving

None

Stars: 126

Visit
 screenshot

ppl.llm.serving is a serving component for Large Language Models (LLMs) within the PPL.LLM system. It provides a server based on gRPC and supports inference for LLaMA. The repository includes instructions for prerequisites, quick start guide, model exporting, server setup, client usage, benchmarking, and offline inference. Users can refer to the LLaMA Guide for more details on using this serving component.

README:

PPL LLM Serving

Overview

ppl.llm.serving is a part of PPL.LLM system.

SYSTEM_OVERVIEW

We recommend users who are new to this project to read the Overview of system.

ppl.llm.serving is a serving based on ppl.nn for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.

Prerequisites

  • Linux running on x86_64 or arm64 CPUs
  • GCC >= 9.4.0
  • CMake >= 3.18
  • Git >= 2.7.0
  • CUDA Toolkit >= 11.4. 11.6 recommended. (for CUDA)
  • Rust & cargo >= 1.8.0. (for Huggingface Tokenizer)

PPL Server Quick Start

Here is a brief tutorial, refer to LLaMA Guide for more details.

  • Installing Prerequisites(on Debian or Ubuntu for example)

    apt-get install build-essential cmake git
  • Cloning Source Code

    git clone https://github.com/openppl-public/ppl.llm.serving.git
  • Building from Source

    ./build.sh  -DPPLNN_USE_LLM_CUDA=ON  -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'" -DPPL_LLM_ENABLE_GRPC_SERVING=ON

    NCCL is required if multiple GPU devices are used.

    We support Sync Decode feature (mainly for offline_inference), which means model forward and decode in the same thread. To enable this feature, compile with marco -DPPL_LLM_SERVING_SYNC_DECODE=ON.

  • Exporting Models

    Refer to ppl.pmx for details.

  • Running Server

    ./ppl_llm_server \
        --model-dir /data/model \
        --model-param-path /data/model/params.json \
        --tokenizer-path /data/tokenizer.model \
        --tensor-parallel-size 1 \
        --top-p 0.0 \
        --top-k 1 \
        --max-tokens-scale 0.94 \
        --max-input-tokens-per-request 4096 \
        --max-output-tokens-per-request 4096 \
        --max-total-tokens-per-request 8192 \
        --max-running-batch 1024 \
        --max-tokens-per-step 8192 \
        --host 127.0.0.1 \
        --port 23333 

    You are expected to give the correct values before running the server.

    • model-dir: path of models exported by ppl.pmx.
    • model-param-path: params of models. $model_dir/params.json.
    • tokenizer-path: tokenizer files for sentencepiece.
  • Running client: send request through gRPC to query the model

    ./ppl-build/client_sample 127.0.0.1:23333

    See tools/client_sample.cc for more details.

  • Benchmarking

    ./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf

    See tools/client_qps_measure.cc for more details. --request_rate is the number of request per second, and value inf means send all client request with no interval.

  • Running inference offline:

    ./offline_inference \
        --model-dir /data/model \
        --model-param-path /data/model/params.json \
        --tokenizer-path /data/tokenizer.model \
        --tensor-parallel-size 1 \
        --top-p 0.0 \
        --top-k 1 \
        --max-tokens-scale 0.94 \
        --max-input-tokens-per-request 4096 \
        --max-output-tokens-per-request 4096 \
        --max-total-tokens-per-request 8192 \
        --max-running-batch 1024 \
        --max-tokens-per-step 8192 \
        --host 127.0.0.1 \
        --port 23333 

    See tools/offline_inference.cc for more details.

License

This project is distributed under the Apache License, Version 2.0.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ppl.llm.serving

Similar Open Source Tools

For similar tasks

For similar jobs