# Data Management and Transfer
# Data Storage Policies
Home Directory: This is a limited data space with a small capacity limit of 1GB. In practice, it's a "bag" for project directories.
Project Directory: Every user with an active computational (scientific or commercial) grant on the cluster automatically has a directory created in their Home Directory, named after the service in the portal, e.g., "pl0001-01" for each grant in which the user participates. Inside this directory, there are two important subdirectories:
archive: This space is intended for storing data that is not actively used, such as processed calculation results. It is slower than "project_data" and "scratch" but has a significantly larger capacity—it can store tens or, if necessary, hundreds of TBs of data.
project_data: This directory is shared among all users of a given grant. Data in all subdirectories of this directory are protected from accidental deletion by automatic backup mechanisms. PCSS guarantees the amount of data space allocated according to the grant application or commercial contract. This limit can only be used by users. Users can freely create and delete files/directories and manage access rights within this directory.
scratch: This space is dedicated to the grant and is used for performing computations/storing input data. The usage rules are the same as for the current scratch space.
Data in the "project_data" directory will be available for the entire duration of the grant, extended by some additional time, allowing for data archiving or transfer to a new grant space. Currently, data is available for 6 months after the grant ends, with an additional 6 months during which data can be retrieved on request. The new data structure is as follows (example for a service/grant with identifier pl0001-01):
~<username> : home directory
--> pl0001-01
---> project_data : shared data space for grant pl0001-01
---> scratch : data space for computations for pl0001-01
---> archive : shared archival data space for grant pl0001-01
In the future, additional directories will be introduced, providing easier access to the archiving system or the ability to exchange data with the service box.pionier.net.pl.
Note: The project_data and scratch directories are symbolic links to the actual mounted storage system location. Users are asked to use only relative paths (e.g., ~username/grant_id/project_data) instead of physical mount paths, as they may change. The new data structure allows dynamic transfer of grants between different storage systems "on the fly," so there is no guarantee that data will be stored on the same physical system for the entire duration of the grant.
# Local Storage
Most of the servers that make up the Eagle supercomputer do not have local disks installed—data should be stored on shared file systems according to the instructions .
Since shared disk resources are not the best solution for certain types of tasks (e.g., AI training), some servers have been equipped with fast local NVMe disks. Currently, these are the servers in the proxima partitions (equipped with H100 GPUs) and proxima-cpu. These disks are available under the path /mnt/local and are writable by all users.
Note 1: Local disks have a limited capacity of either 8 TB or 15 TB .
Note 2: The contents of local disks are only visible on the specific servers they are installed on. Users must remember to copy important data to one of the shared directories (archive, project_data, scratch) after completing their computations. Local disk contents are automatically deleted when the server restarts.
Note 3: Data stored on local disks may be deleted after the job completes, regardless of whether the server is restarted or not.
How to submit a job on a server with local disks:
- Submit the job as usual (e.g., to the standard or tesla partition), adding the
-constraint=local_ssdoption. - Submit the job directly to the proxima (if a GPU is required) or proxima-cpu partition by adding
-p <partition name>option
# Local NVMe SSD Storage on Compute Nodes
We would like to inform you about the deployment of a new automated local NVMe SSD space management mechanism on compute nodes. These changes are intended to improve node stability and ensure fair access to resources for all users.
# What Has Changed?
From now on, in order to use the local SSD storage available on compute nodes, users must explicitly request this resource in their SLURM job script and declare the required temporary storage size.
The system will automatically:
- allocate a node equipped with a local SSD,
- create a dedicated isolated working directory for the job,
- apply the appropriate storage quota,
- clean up the directory after the job finishes.
For each job, a dedicated directory is created:
/mnt/local/job_${SLURM_JOB_ID}_scratch/
This directory is automatically protected by the quota mechanism and should be used for temporary data generated during computations.
# Requesting Local SSD Storage
To use local SSD storage, add the following directives to your sbatch script:
#SBATCH --constraint=local_ssd
#SBATCH --tmp=500G
Where:
--constraint=local_ssd
requests a node equipped with local SSD storage,--tmp=500G
specifies the required temporary storage size on the local SSD.
# Limits and Rules
# Shared-Node Jobs (Default Mode)
If the job does not reserve the entire node:
- the maximum available temporary space is 1 TB,
- valid examples include:
--tmp=500G--tmp=1T--tmp=1000G
If the --tmp parameter is omitted, the system automatically assigns:
--tmp=100G
# Exclusive-Node Jobs
If the job uses:
#SBATCH --exclusive
then the full local SSD capacity can be requested, up to:
7 TB
Requests larger than 1 TB are allowed only together with the --exclusive flag.
# Job Submission Validation
Jobs that do not meet the above requirements will be automatically rejected during the sbatch submission stage.
Examples of invalid submissions include:
- requesting more than 1 TB without
--exclusive, - invalid
--tmpvalues, - requests exceeding allowed limits.
# Important Notes
Important
Local SSD storage is intended only for temporary job data.
- Data stored under
/mnt/local/is not persistent and may be removed after the job finishes. - Users are strongly encouraged to copy important results to persistent storage systems such as:
/project_data/archive
- The quota mechanism is enforced automatically and prevents exceeding the declared storage limit.
# Examples
# Example 1 — Shared-Node Job with 500 GB SSD Space
#!/bin/bash
#SBATCH --partition=proxima
#SBATCH --nodelist=gpu73
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --constraint=local_ssd
#SBATCH --tmp=500G
echo "SSD working directory:"
echo $TMPDIR
srun ./app
# Example 2 — Exclusive-Node Job with 4 TB SSD Space
#!/bin/bash
#SBATCH --partition=proxima
#SBATCH --nodelist=gpu73
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --constraint=local_ssd
#SBATCH --exclusive
#SBATCH --tmp=4T
echo "SSD working directory:"
echo $TMPDIR
srun ./large_simulation
# Example 3 — Testing Local SSD Functionality
The following example can be used to verify that the local SSD mechanism, quota handling, and automatic scratch directory creation work correctly on the compute node.
#!/bin/bash
#SBATCH --job-name=test_local_ssd
#SBATCH --partition=proxima # Required by your configuration
#SBATCH --nodelist=gpu73 # Force execution on node gpu73
#SBATCH --nodes=1 # Minimum number of requested nodes
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1 # Required by the first Lua validation rule
#SBATCH --constraint=local_ssd # Enable local SSD allocation logic
#SBATCH --tmp=5G # Request 5 GB of local SSD space (safe small test value)
#SBATCH --time=00:05:00
#SBATCH --reservation=devel
echo "=== SSD TEST START ==="
echo "Running on node: $SLURMD_NODENAME"
echo "Job ID: $SLURM_JOB_ID"
# Scratch directory automatically created by the Prolog
SCRATCH_DIR="/mnt/local/job_${SLURM_JOB_ID}_scratch"
echo "Checking scratch directory: $SCRATCH_DIR"
if [ -d "$SCRATCH_DIR" ]; then
echo "Directory exists. Attempting to create a 1 GB test file..."
# Create a 1 GB test file to verify write permissions and quota handling
dd if=/dev/zero of="${SCRATCH_DIR}/test_file.bin" bs=1M count=1024
if [ $? -eq 0 ]; then
echo "SUCCESS: File written successfully!"
ls -lh "${SCRATCH_DIR}/test_file.bin"
else
echo "WRITE ERROR: Please verify permissions or quota configuration."
fi
else
echo "FATAL ERROR: Directory $SCRATCH_DIR does not exist."
echo "The Prolog mechanism may have failed."
fi
echo "=== SSD TEST END ==="
# After the job finishes, the Epilog mechanism should automatically
# remove the scratch directory and its contents.
# What this test verifies
This script validates:
- automatic creation of the scratch directory,
- write access to the local SSD,
- quota enforcement mechanism,
- correct operation of the Prolog/Epilog cleanup workflow,
- local filesystem availability on the selected compute node.
# TaskProlog Mechanism — Automatic LOCAL_SCRATCH Variable
To simplify the usage of local SSD storage, the cluster now uses Slurm's built-in TaskProlog mechanism to automatically create and export a dedicated environment variable for each job.
TaskProlog is a special Slurm mechanism that executes a script with regular user privileges immediately before the job starts. Its primary purpose is to modify the job environment dynamically.
As part of the local SSD workflow, the TaskProlog mechanism automatically defines:
$LOCAL_SCRATCH
which points to the isolated local SSD directory assigned to the job, for example:
/mnt/local/job_<JOB_ID>_scratch/
This means users no longer need to manually construct paths based on SLURM_JOB_ID.
Recommended usage:
cp input.dat $LOCAL_SCRATCH/
cd $LOCAL_SCRATCH
Using $LOCAL_SCRATCH makes job scripts cleaner, more portable, and resistant to future changes in scratch directory naming conventions.
# Summary
| Scenario | Maximum SSD Space |
|---|---|
| Shared-node job | 1 TB |
| Exclusive-node job | 7 TB |
Default allocation without --tmp | 100 GB |
# Recycle Bin Mechanism
A recycle bin mechanism is enabled for user home directories and project directories located on storage_6.
The recycle bin directory is named:
.recyclebininternal
and is automatically created inside:
- user home directories,
project_datadirectories located onstorage_6.
# How It Works
When files or directories are deleted, the data is not removed immediately from the filesystem. Instead, it is moved to the recycle bin and retained there for a defined retention period:
- Home directories (
home) → 2 days - Project directories (
project_data) → 7 days
After this retention period expires, the data is automatically and permanently removed from the recycle bin.
# Important Notes
Storage Usage
Files stored in the recycle bin still count toward the user's quota.
This applies both to:
- home directory quotas,
- project directory quotas.
As a result, deleting files may not immediately reduce the reported disk usage.
# Immediate Space Recovery
If you need to quickly free up storage space, you must manually remove files from the recycle bin.
Example:
rm -rf ~/.recyclebininternal/*
or for a project directory:
rm -rf /project_data/.recyclebininternal/*
Warning
Files removed from .recyclebininternal are permanently deleted and cannot be recovered.
# Example Workflow
# Remove file
rm large_file.dat
# File is moved automatically to:
~/.recyclebininternal/
# Check recycle bin usage
du -h -s .recyclebininternal/
# Data Transfer Methods
Please copy data to project_data in interactive mode on node.
srun --pty /bin/bash
All large operations should not be performed on the access node because the consequences will be felt by all users in the form of slow UI performance
Second thing is that in interactive mode nodes are connected via infiniband network what significantly improves file transfer
This operation can be further accelerated by using rclone utility installed on the cluster, copying data in parallel
Below I send an example of usage:
rclone copy <source path > <target path> --progress --multi-thread-streams=N
During testing, we found that above N=8, there is no longer any additional benefit from faster data transfer.