# Self-installation of an AI model
This tutorial will guide you through the process of configuring the environment, running a selected Large Language Model (LLM) on an HPC computing cluster, and exposing it via a gateway (LiteLLM proxy) to make it accessible outside the cluster's internal network.
# 1. Prerequisites
Before you start deploying your own model, you must ensure access to the following resources:
- HPC Cluster Access Service
- LLM Model Access Service
Both services can be requested via the portal: https://pcss.plcloud.pl (opens new window)
Relevant Documentation
After creating the LLM service, you will receive a unique Service ID (e.g., pl0001-01) and an API Key. The Service ID will serve as your access group, and the API Key will be used for public LiteLLM authorization.
# Singularity & Model Cache Configuration
By default, Singularity saves cache in the home directory, which quickly leads to disk quota exhaustion. To avoid this, run the following commands once to move the cache to the project storage:
# Create directory and link for Singularity
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.singularity
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.singularity ~/.singularity
# Create directory and link for HuggingFace
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.huggingface
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.huggingface ~/.cache/huggingface
# Create directory and link for vLLM
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.vllm
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.vllm ~/.cache/vllm
# Create directory and link for Triton
mkdir -p /mnt/storage_3/home/$USER/{grantid}/project_data/.triton
ln -s /mnt/storage_3/home/$USER/{grantid}/project_data/.triton ~/.triton
# 2. Running the Model on the HPC Cluster
Once you have the resources, you must launch the model on one of the servers.
- Submit a job to a computing node with a GPU (e.g., using the SLURM system).
- Start the inference engine (backend), such as vLLM, and point it to your chosen model.
While starting the engine, note three key pieces of information:
- IP Address (
<ip>): The internal IP address of the computing node (GPU machine) where the model is running. - Port (
<port>): The port your inference engine is listening on. - Model Name (
<model>): The exact path or name of the model accepted by your backend (e.g.,openai/gpt-oss-20b).
WARNING
Ensure the inference engine has started correctly and responds to local queries before proceeding to the next step.
# Example SLURM Launch Script
Below is a ready-to-use script that submits a job to a GPU node and starts an OpenAI-compatible server using vLLM and a Singularity container.
#!/bin/bash
#SBATCH --job-name=gpt-oss
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --mem-per-gpu=150G
#SBATCH --partition=proxima
#SBATCH --account=<your_account_id>
#SBATCH --output=logs/vllm_%j_node_%N
#SBATCH --error=logs/vllm_%j_node_%N
#SBATCH --time=168:00:00
singularity run --nv --bind /mnt:/mnt docker://vllm/vllm-openai:latest \
"openai/gpt-oss-20b" \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser openai \
--port 8001
# 3. Exposing the Model Externally (Public Endpoint)
Your model is running, but it is currently restricted to the cluster's internal network. To expose it to the world (connect it to the public LiteLLM gateway), use the registration script. This script performs a Health Check, and if successful, connects your model to your Service ID.
Log in to the cluster's login node and execute the following command, replacing the placeholders with your data:
sudo /opt/exp_soft/scripts/add_model.py \
--ip <gpu_node_ip> \
--port <engine_port> \
--model <full_model_name> \
--service-id <llm_service_id>
# Script Parameter Description
--ip: The internal IP address of the computing node where your model is currently running.--port: The specific port number on which the model engine (e.g., vLLM) is listening.--model: The exact identifier or path of the model, for example:openai/gpt-oss-20b.--service-id: Your unique LLM access service identifier, which was generated during Step 1.
What the script does internally
The registration script automates several validation steps to ensure a stable connection:
- Health Check: It sends a diagnostic request to the provided IP and Port to verify that the inference engine is active and responding.
- Team Verification: It communicates with the main LiteLLM server to confirm that the provided
--service-idis valid and active. - Registration: Once validated, it creates a unique alias in the format
<service_id>/<model_name>and registers it on the public gateway athttps://llm.hpc.psnc.pl.
# 4. Verifying Model Availability
Now that the model is successfully registered on the public gateway (LiteLLM proxy), you should check if it is visible from your account. Since the gateway follows the OpenAI standard, the simplest test is to query the /v1/models endpoint.
curl -X GET "[https://llm.hpc.psnc.pl/v1/models](https://llm.hpc.psnc.pl/v1/models)" \
-H "Authorization: Bearer <YOUR_API_KEY>"