AI

Version: 25.03

Container Images for Large Language Models

openEuler provides container images to support large language models (LLMs) such as Baichuan, ChatGLM, and iFLYTEK Spark.

The provided container images come with pre-installed dependencies for both CPU and GPU environments, ensuring a seamless out-of-the-box experience.

Pulling the Image (CPU Version)

bash
docker pull openeuler/llm-server:1.0.0-oe2203sp3

Pulling the Image (GPU Version)

bash
docker pull icewangds/llm-server:1.0.0

Downloading the Model

Download the model and convert it to GGUF format.

bash
# Install Hugging Face Hub.
pip install huggingface-hub

# Download the model you want to deploy.
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download baichuan-inc/Baichuan2-13B-Chat --local-dir /root/models/Baichuan2-13B-Chat --local-dir-use-symlinks False

# Convert the model to GGUF format.
cd /root/models/
git clone https://github.com/ggerganov/llama.cpp.git
python llama.cpp/convert-hf-to-gguf.py ./Baichuan2-13B-Chat
# Path to the generated GGUF model: /root/models/Baichuan2-13B-Chat/ggml-model-f16.gguf

Launch

Docker v25.0.0 or above is required.

To use a GPU image, you must install nvidia-container-toolkit. Detailed installation instructions are available in the official NVIDIA documentation: Installing the NVIDIA Container Toolkit.

docker-compose.yaml file content:

yaml
version: '3'
services:
  model:
    image: <image>:<tag>   # Image name and tag
    restart: on-failure:5
    ports:
      - 8001:8000    # Listening port number. Change "8001" to modify the port.
    volumes:
      - /root/models:/models  # LLM mount directory
    environment:
      - MODEL=/models/Baichuan2-13B-Chat/ggml-model-f16.gguf  # Model file path inside the container
      - MODEL_NAME=baichuan13b  # Custom model name
      - KEY=sk-12345678  # Custom API Key
      - CONTEXT=8192  # Context size
      - THREADS=8    # Number of CPU threads, required only for CPU deployment
    deploy: # GPU resources, required only for GPU deployment
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
bash
docker-compose -f docker-compose.yaml up

docker run command:

bash
# For CPU deployment
docker run -d --restart on-failure:5 -p 8001:8000 -v /root/models:/models -e MODEL=/models/Baichuan2-13B-Chat/ggml-model-f16.gguf -e MODEL_NAME=baichuan13b -e KEY=sk-12345678 openeuler/llm-server:1.0.0-oe2203sp3

# For GPU deployment
docker run -d --gpus all --restart on-failure:5 -p 8001:8000 -v /root/models:/models -e MODEL=/models/Baichuan2-13B-Chat/ggml-model-f16.gguf -e MODEL_NAME=baichuan13b -e KEY=sk-12345678 icewangds/llm-server:1.0.0

Testing

Call the LLM interface to test the deployment. A successful return indicates successful deployment of the LLM service.

bash
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer sk-12345678" \
     -d '{
           "model": "baichuan13b",
           "messages": [
             {"role": "system", "content": "You are a openEuler community assistant, please answer the following question."},
             {"role": "user", "content": "Who are you?"}
           ],
           "stream": false,
           "max_tokens": 1024
         }'