GPU Workstation – Administration and User Guide

This document provides a comprehensive overview of the GPU workstation’s administration, configuration, and user guidelines. Administrators: * Pascal Tribel: pascal.tribel@ulb.be * Gian Marco Paldino: gian.marco.paldino@ulb.be * Cédric Simar: cedric.simar@ulb.be

  • 1. Onboarding and First-Time Access

Before you can connect to the server, you must have an account created for you. Please follow these steps:

  1. Contact the Administrators: Send an email to one of the administrators listed below to request a username.
  2. Provide Your Public SSH Key: In the same email, you must include your public SSH key (id_rsa.pub or similar). Password-based login is disabled for security, so key-based authentication is required. For instructions on how to create a key-pair, you can look here.

Once your account has been created, you can proceed to the next step.

  • 2. Announcement and Issue Reporting

All announcements regarding the GPU workstation (e.g., maintenance, reboots) are posted on the GPU Workstation Teams channel. This channel is also the designated place for reporting any issues you encounter.

Link: https://teams.microsoft.com/l/channel/19%3a91677eba27174f00bc02a1e6dd82675d%40thread.tacv2/Gpu%2520Workstation?groupId=99445ba5-ff9f-4bab-a0f3-ecd5d73dd8fe&tenantId=30a5145e-75bd-4212-bb02-8ff9c0ea4ae9

  • 3. Connecting to the Workstation

Before connecting for the first time, you should contact the administrators to ask for a username. You should also send your public key to the administrators.

  • 3.1. VPN Connection Requirement (Off-Campus Access)

To access the GPU workstation from outside the ULB campus, you must first connect to the university’s VPN. A valid ULB NETID is required for authentication.

  • Windows & macOS

  1. Navigate to vpn.ulb.ac.be in your web browser.
  2. Log in using your ULB NETID and password.
  3. Download and install the VPN client software provided for your operating system.
  4. Launch the client and connect using the following details:
  • Address: vpn.ulb.ac.be
  • Login: Your NETID

Once connected, you will be able to access all protected ULB resources, including the GPU workstation.

  • Linux

The recommended client for Linux is openconnect.

  1. Install the client: bash     sudo apt update && sudo apt install openconnect -y
  2. Connect to the VPN: bash     sudo openconnect –protocol=gp vpn.ulb.ac.be You will be prompted to enter your NETID and password.

Important: You must keep this terminal window open for the duration of your session. Closing it will terminate the VPN connection.

  • 3.2. Standard SSH Connection

You can connect to the workstation via SSH using its fully qualified domain name (recommended) or its IP address.

Using Hostname (Recommended):

ssh yourlogin@ulb-gpu.info.ulb.ac.be

Using IP Address:

ssh yourlogin@10.149.16.180

  • 3.3. Streamlining Connections with an SSH Config File

For a more efficient workflow, you can configure an SSH client file to create aliases and automate connection settings like hostname, user, and port forwarding.

Create or edit the file at ~/.ssh/config and add an entry for the GPU server.

Example Configuration:

# GPU Workstation Alias

Host gpu-workstation

    HostName ulb-gpu.info.ulb.ac.be

    User your_login

    Port 22

    ForwardX11 yes

    Compression yes

 

    # Forward your assigned Jupyter port (e.g., 9120) to your local machine

    LocalForward 9120 localhost:9120

 

    # Example: Forward Spark UI port

    LocalForward 4040 localhost:4040

How to Use: * With the configuration above, you can simply connect by typing: ssh gpu-workstation. * This configuration also works seamlessly with scp and sftp for file transfers. * LocalForward maps a port on your local machine to a port on the remote server, which is essential for accessing services like Jupyter notebooks in your local browser. * ForwardX11 yes enables GUI application forwarding, and Compression yes can speed up the connection over slower networks.

  • 3.4. Port Redirection (Manual)

If you are not using an SSH config file, you can manually redirect ports using the -L flag. This example redirects port 8388 on the server to port 9999 on your local machine:

ssh -L 9999:localhost:8388 yourlogin@ulb-gpu.info.ulb.ac.be

  • 4. System Specifications

  • 4.1. Hardware

    • Processors: 2 x Intel Xeon E5-2640 V4 (10 cores, 2.4 GHz, up to 3.4 GHz Turbo)
    • Motherboard: Asus Z10PG-D24 Server Board (Chipset Intel® C612 PCH)
    • RAM: 24 x 32GB DDR4 ECC (Total 768GB) – Quad channel
    • GPUs: 8 x Asus GeForce GTX 1080 Ti (Ref: TURBO-GTX1080TI-11G)
  • Storage Disks:
  • 1 x 2TB NVMe SSD (boot drive)
  • 6 x 2TB Seagate Enterprise HDDs (ST2000NX0403)
  • Networking:
  • 2 x Intel I210AT Gigabit ports
  • Carte réseau PEB-10G/57840-2T – 2 x 10G LAN ports
  • Power Supply: 3 x 1600W Redundant Power Supply
  • Chassis: Boîtier ESC8000 G3 4U Cover
  • 4.2. Software

  • Operating System: Ubuntu Desktop 22.04.4 LTS
  • Linux Kernel: 5.15.0-113-generic x86_64
  • NVIDIA Driver: 470.161.03
  • CUDA Version: 11.4
  • GCC Version: 9.4.0
  • 5. User Management

  • 5.1. Adding New Users

New user creation requires a desired username (typically the netid) and a public SSH key.

Onboarding Commands:

username=”[NetID]”

sudo adduser $username

sudo usermod -aG docker $username

sudo chage -d 0 $username

sudo -u $username mkdir /home/$username/.ssh

echo “[SSH key]” | sudo -u $username tee /home/$username/.ssh/authorized_keys

The administrator will set a temporary password, which the user must change upon first login.

  • 5.2. Active Users

  • nversbra: Nassim Versbraegen
  • gbonte: Gianluca Bontempi
  • gpaldino: Gian Marco Paldino
  • dlunghi: Daniele Lunghi
  • yamoling: Yannick Molinghen
  • ptribel: Pascal Tribel (added 2023-10-14)
  • amor0060: Alejandro Morales Hernández (added 2023-10-24)
  • lcordeir: Loïc Cordeiro Fonseca (added 2025-02-16)
  • 6. Docker Usage Guide

To maintain a stable environment and prevent configuration conflicts, all computations should be run inside a dedicated Docker container. Each user must build a personal clone of the base gpu_ubuntu1804 image.

  • 6.1. Base Image Configuration

The base image includes pre-configured Python 3.6 and R 3.6.3 environments with common data science and deep learning libraries.

  • Python: Includes Miniconda, Keras 2.4.0, TensorFlow 2.3, PyTorch, and libraries like pandas, scikit-learn, and matplotlib.
  • R: Includes r-base, r-base-dev, and libraries like ggplot2, randomForest, and the IRKernel for Jupyter.
  • 6.2. Building Your Personal Container

Create a Dockerfile with the following template. Obtain your userID and userGID by running the id command on the workstation.

FROM gpu_ubuntu1804:base

 

MAINTAINER Theo Verhelst <tverhels@ulb.ac.be>

 

ARG userPort=…

ARG userName=…

ARG userGID=…

ARG userID=…

 

# Create a non-root user for security

RUN groupadd -g $userGID $userName

RUN useradd -u $userID -d /home/$userName -ms /bin/bash -g $userGID -G sudo,$userName

USER $userName

WORKDIR /home/$userName

ENV PATH=/opt/miniconda3/bin:$PATH

 

# Create and set permissions for a shared data volume

RUN mkdir /home/$userName/shared_data

RUN chown $userName:$userName /home/$userName/shared_data

VOLUME /home/$userName/shared_data

EXPOSE $userPort

Build the image with this command, replacing $USER with your username:

nvidia-docker build –rm=true -t gpu_ubuntu1804:$USER .

  • 6.3. Running Your Container

Use this command to launch your container in interactive mode:

nvidia-docker run –rm -it \

    –gpus “device=#GPU_ID#” \

    –memory=#Memory_limit# \

    –memory-swap=#Swap_limit# \

    -p #Jupyter_port#:#Jupyter_port# \

    -v /path/to/host/data:/home/#userName#/shared_data \

    gpu_ubuntu1804:$USER

Parameters: * #GPU_ID#: The numeric ID of the GPU allocated to you. * #Memory_limit#: The amount of RAM for the container (e.g., 16G). * #Swap_limit#: Total RAM + swap (e.g., 17G for 16G RAM and 1G swap). * #Jupyter_port#: Your assigned Jupyter port. * /path/to/host/data: An absolute path on the host machine to mount inside the container.

  • 6.4. GPU and Jupyter Port Allocations

GPU ID User(s) Jupyter Port(s)
0 gbonte 8890
1 jdestefa, cnachteg, ptribel 9017, 9208, 9119
2 blebicho, ibosch 9010
3 tverhels, jsal0013 8989, 9120
4 csimar
5, 6, 7 Reserved for experiments
yleborgn 8889
blextrait 9002
bvanderp 8123
lcordeir 9121
  • 6.5. Optimizing GPU Memory Allocation (TensorFlow)

By default, TensorFlow pre-allocates all available GPU memory. To enable dynamic memory growth, add the appropriate code snippet to the beginning of your scripts.

Python:

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices(‘GPU’)

if gpus:

    try:

        for gpu in gpus:

            tf.config.experimental.set_memory_growth(gpu, True)

    except RuntimeError as e:

        print(e)

  • 7. Storage and Data Management

  • 7.1. Disk Partitions and Mounts

Docker data resides on /media/hdd4. The other large disks are for general data storage.

Device Mount point Filesystem Size
/dev/sda1 /media/hdd1 ext4 1.8T
/dev/sdb1 /media/hdd2 ext4 1.8T
/dev/sdc1 /media/hdd3 ext4 1.8T
/dev/sdd /media/hdd4 ext4 1.8T
  • 7.2. Accessing Large Datasets

For optimal performance, store large datasets on the /media/ drives rather than your home directory. Create a symbolic link for easy access:

# Example: Link a dataset on hdd1 to your home directory

ln -s /media/hdd1/my_large_dataset ~/datasets“`

 

## 8. Troubleshooting

 

### 8.1. SSH Connection Issues

**Error:** `ssh: Could not resolve hostname XXX: Name or service not known`

 

**Cause:** Your shell cannot find the `~/.ssh/config` file.

 

**Solution:** Specify the path to your config file manually using the `-F` flag:

“`bash

ssh -F ~/.ssh/config gpu-workstation

  • 8.2. NVIDIA Driver Mismatch

Error Symptom: NVRM: API mismatch error in dmesg or /var/log/syslog.

Cause: The NVIDIA driver and kernel module versions are out of sync, typically after a system update.

Solution: 1. Update System: sudo apt update && sudo apt upgrade 2. Regenerate Initramfs: sudo update-initramfs -u -k all 3. Reboot: sudo reboot now

  • 8.3. Docker Errors

Error: “Got permission denied while trying to connect to the Docker daemon socket”

Solution: Your user account is not in the docker group. Contact an administrator to be added.

Error: “Error response from daemon: driver failed programming external connectivity”

Solution: The port you are trying to use is already allocated. Stop the existing container with nvidia-docker stop <container_name> or choose a different port.

  • 9. System Monitoring

Use these commands to check the status of system hardware:

  • CPU Temperature: sensors (from the lm-sensors package)
  • GPU Temperature and Usage: nvidia-smi
  • 10. Maintenance Log Highlights

  • 09/12/2022: Upgraded NVIDIA driver to 470.161.03 to resolve API mismatch.
  • 08/12/2022: Removed a faulty RAM stick that was preventing boot.
  • 12/2021: Upgraded total system RAM from 256GB to 768GB.
  • 06/2021: Upgraded primary NVMe disk from 512GB to 2TB and installed Ubuntu 20.04.