# Self-installation of an AI model

This tutorial will guide you through the process of configuring the environment, running a selected Large Language Model (LLM) on an HPC computing cluster, and exposing it via a gateway (LiteLLM proxy) to make it accessible outside the cluster's internal network.

# 1. Prerequisites

Before you start deploying your own model, you must ensure access to the following resources:

  • HPC Cluster Access Service
  • LLM Model Access Service

Both services can be requested via the portal: https://pcss.plcloud.pl (opens new window)

After creating the LLM service, you will receive a unique Service ID (e.g., pl0001-01) and an API Key. The Service ID will serve as your access group, and the API Key will be used for public LiteLLM authorization.

# Singularity & Model Cache Configuration

By default, Singularity saves cache in the home directory, which quickly leads to disk quota exhaustion. To avoid this, run the following commands once to move the cache to the project storage:

# Create directory and link for Singularity
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.singularity
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.singularity ~/.singularity

# Create directory and link for HuggingFace
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.huggingface
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.huggingface ~/.cache/huggingface

# Create directory and link for vLLM
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.vllm
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.vllm ~/.cache/vllm

# Create directory and link for Triton
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.triton
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.triton ~/.triton

# 2. Running the Model on the HPC Cluster

Once you have the resources, you must launch the model on one of the servers.

  1. Submit a job to a computing node with a GPU (e.g., using the SLURM system).
  2. Start the inference engine (backend), such as vLLM, and point it to your chosen model.

While starting the engine, note three key pieces of information:

  • IP Address (<ip>): The internal IP address of the computing node (GPU machine) where the model is running.
  • Port (<port>): The port your inference engine is listening on.
  • Model Name (<model>): The exact path or name of the model accepted by your backend (e.g., openai/gpt-oss-20b).

WARNING

Ensure the inference engine has started correctly and responds to local queries before proceeding to the next step.

# Example SLURM Launch Script

Below is a ready-to-use script that submits a job to a GPU node and starts an OpenAI-compatible server using vLLM and a Singularity container.

#!/bin/bash
#SBATCH --job-name=gpt-oss
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --mem-per-gpu=150G
#SBATCH --partition=proxima
#SBATCH --account=<your_account_id>
#SBATCH --output=logs/vllm_%j_node_%N
#SBATCH --error=logs/vllm_%j_node_%N
#SBATCH --time=168:00:00

singularity run --nv --bind /mnt:/mnt docker://vllm/vllm-openai:latest \
    "openai/gpt-oss-20b" \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1 \
    --enable-auto-tool-choice \
    --tool-call-parser openai \
    --port 8001

# 3. Exposing the Model Externally (Public Endpoint)

Your model is running, but it is currently restricted to the cluster's internal network. To expose it to the world (connect it to the public LiteLLM gateway), use the registration script. This script performs a Health Check, and if successful, connects your model to your Service ID.

Log in to the cluster's login node and execute the following command, replacing the placeholders with your data:

sudo /opt/exp_soft/scripts/add_model.py \
  --ip <gpu_node_ip> \
  --port <engine_port> \
  --model <full_model_name> \
  --service-id <llm_service_id>

# Script Parameter Description

  • --ip: The internal IP address of the computing node where your model is currently running.
  • --port: The specific port number on which the model engine (e.g., vLLM) is listening.
  • --model: The exact identifier or path of the model, for example: openai/gpt-oss-20b.
  • --service-id: Your unique LLM access service identifier, which was generated during Step 1.
What the script does internally

The registration script automates several validation steps to ensure a stable connection:

  1. Health Check: It sends a diagnostic request to the provided IP and Port to verify that the inference engine is active and responding.
  2. Team Verification: It communicates with the main LiteLLM server to confirm that the provided --service-id is valid and active.
  3. Registration: Once validated, it creates a unique alias in the format <service_id>/<model_name> and registers it on the public gateway at https://llm.hpc.psnc.pl.

# 4. Verifying Model Availability

Now that the model is successfully registered on the public gateway (LiteLLM proxy), you should check if it is visible from your account. Since the gateway follows the OpenAI standard, the simplest test is to query the /v1/models endpoint.

curl -X GET "[https://llm.hpc.psnc.pl/v1/models](https://llm.hpc.psnc.pl/v1/models)" \
     -H "Authorization: Bearer <YOUR_API_KEY>"