Learn › Linux for Production Support › Kavitha turns a GPU server into an ML platform

Linux for Production Support Ch 31 / 32 Advanced

🤖

Kavitha turns a GPU server into an ML platform

nvidia-smi, CUDA, Ollama, Jupyter — running AI models on your own hardware

⏱ 14 min 6 commands 5 takeaways

🤖

In this chapter

Kavitha

ML engineer, setting up a GPU server

The story

Kavitha had been an ML engineer for 3 years using Google Colab and Jupyter notebooks. She could train models, run experiments, and deploy to cloud services. But when her company wanted to run models on their own GPU servers to save costs, she had to learn Linux GPU management from scratch.

Her first attempt: 4 hours to get a model running. Her tenth attempt: 8 minutes. Here is what she learned.

CHECKING GPU STATUS

# Is a GPU present?
lspci | grep -i nvidia          # NVIDIA GPUs
lspci | grep -i amd             # AMD GPUs

# Are NVIDIA drivers installed?
nvidia-smi                      # if this works, drivers are installed
# Shows: GPU name, memory, driver version, CUDA version, running processes

# Detailed GPU info:
nvidia-smi -q                   # full details
nvidia-smi --query-gpu=name,memory.total,memory.free,utilization.gpu --format=csv

# Watch GPU usage live (like top for GPUs):
watch -n 1 nvidia-smi

# Check CUDA version:
nvcc --version                  # if CUDA toolkit is installed
python3 -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

INSTALLING NVIDIA DRIVERS

# Check what driver version your GPU needs:
ubuntu-drivers devices          # recommends the right driver

# Install recommended driver:
sudo ubuntu-drivers autoinstall

# Or install a specific version:
sudo apt install nvidia-driver-535

# After install, reboot is required:
sudo reboot

# Verify after reboot:
nvidia-smi

INSTALLING CUDA AND PYTORCH

# Install PyTorch with CUDA support (check pytorch.org for latest command):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify GPU is usable in PyTorch:
python3 << 'EOF'
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
print("GPU name:", torch.cuda.get_device_name(0))
print("GPU memory:", torch.cuda.get_device_properties(0).total_memory // 1024**3, "GB")
EOF

RUNNING MODELS WITH OLLAMA (EASIEST PATH)

# Install Ollama (automatically uses GPU if available):
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model:
ollama run llama3.2             # downloads and runs — GPU used automatically

# Run as a service:
sudo systemctl enable ollama && sudo systemctl start ollama

# Use the API:
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'

# Check if Ollama is using GPU:
nvidia-smi                      # should show ollama using GPU memory

GPU MEMORY MANAGEMENT

# See what is using GPU memory:
nvidia-smi

# Kill a stuck process using GPU:
sudo kill -9 $(nvidia-smi | awk '/python/{print $5}')

# Clear GPU memory (if a Python script crashed and left memory allocated):
# The only reliable way is to kill the process that owns the memory
# nvidia-smi shows the PID

# Out of memory? Reduce batch size or use quantisation:
# Instead of model.to('cuda'), use:
# model = AutoModelForCausalLM.from_pretrained(name, load_in_8bit=True)

MULTIPLE GPUS

# See all GPUs:
nvidia-smi --list-gpus

# Run a Python script on a specific GPU:
CUDA_VISIBLE_DEVICES=0 python3 train.py    # use only GPU 0
CUDA_VISIBLE_DEVICES=1 python3 serve.py    # use only GPU 1
CUDA_VISIBLE_DEVICES=0,1 python3 train.py  # use GPU 0 and 1

# In Python:
import torch
device = torch.device('cuda:0')    # use GPU 0
device = torch.device('cuda:1')    # use GPU 1

RUNNING MODELS AS SERVICES

# /etc/systemd/system/ollama-serve.service
[Unit]
Description=Ollama LLM Server
After=network.target

[Service]
Type=simple
User=ollama
Environment=OLLAMA_HOST=0.0.0.0:11434
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

sudo systemctl enable ollama-serve && sudo systemctl start ollama-serve

JUPYTER ON A REMOTE SERVER

# Install and start Jupyter:
pip3 install jupyterlab
jupyter lab --no-browser --port=8888 --ip=0.0.0.0

# Access from your laptop via SSH tunnel:
ssh -L 8888:localhost:8888 user@gpu-server
# Open http://localhost:8888 on your laptop

MONITORING GPU TEMPERATURE AND POWER

# GPU temperature (important — GPUs throttle above 83C):
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader

# Power usage:
nvidia-smi --query-gpu=power.draw --format=csv,noheader

# Set power limit (useful for servers where you pay per watt):
sudo nvidia-smi -pl 200         # limit to 200W (default may be 300W+)

# Watch everything:
watch -n 1 'nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw --format=csv'

Kavitha's GPU server went from a confusing box to a production ML platform in one week. She wrote a setup script that installs drivers, CUDA, PyTorch, and Ollama on a fresh Ubuntu server in under 20 minutes. That script is now in the company's runbook.

Key takeaways

nvidia-smi is your GPU equivalent of top — shows memory usage, GPU utilisation, temperature, and running processes

CUDA_VISIBLE_DEVICES=0 python3 script.py pins a process to a specific GPU — essential on multi-GPU servers

Ollama is the fastest path to running LLMs on a GPU — install in one command, automatically uses GPU if available

watch -n 1 nvidia-smi gives a live GPU dashboard — use during model runs to catch memory or temperature issues

SSH tunnel (ssh -L 8888:localhost:8888 user@server) lets you use Jupyter on a remote GPU from your laptop browser

Commands from this chapter

$ nvidia-smi

GPU status dashboard — memory, utilisation, temperature, running processes

$ watch -n 1 nvidia-smi

Live GPU monitoring — refresh every second

$ CUDA_VISIBLE_DEVICES=0 python3 train.py

Run script on GPU 0 only

$ curl -fsSL https://ollama.ai/install.sh | sh && ollama run llama3.2

Install Ollama and run a model in two commands

$ python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Verify PyTorch can see the GPU

$ sudo nvidia-smi -pl 200

Set GPU power limit to 200W — reduce electricity cost