Kavitha turns a GPU server into an ML platform
nvidia-smi, CUDA, Ollama, Jupyter — running AI models on your own hardware
Kavitha had been an ML engineer for 3 years using Google Colab and Jupyter notebooks. She could train models, run experiments, and deploy to cloud services. But when her company wanted to run models on their own GPU servers to save costs, she had to learn Linux GPU management from scratch.
Her first attempt: 4 hours to get a model running. Her tenth attempt: 8 minutes. Here is what she learned.
CHECKING GPU STATUS
# Is a GPU present?
lspci | grep -i nvidia # NVIDIA GPUs
lspci | grep -i amd # AMD GPUs# Are NVIDIA drivers installed?
nvidia-smi # if this works, drivers are installed
# Shows: GPU name, memory, driver version, CUDA version, running processes# Detailed GPU info:
nvidia-smi -q # full details
nvidia-smi --query-gpu=name,memory.total,memory.free,utilization.gpu --format=csv# Watch GPU usage live (like top for GPUs):
watch -n 1 nvidia-smi# Check CUDA version:
nvcc --version # if CUDA toolkit is installed
python3 -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"INSTALLING NVIDIA DRIVERS
# Check what driver version your GPU needs:
ubuntu-drivers devices # recommends the right driver# Install recommended driver:
sudo ubuntu-drivers autoinstall# Or install a specific version:
sudo apt install nvidia-driver-535# After install, reboot is required:
sudo reboot# Verify after reboot:
nvidia-smiINSTALLING CUDA AND PYTORCH
# Install PyTorch with CUDA support (check pytorch.org for latest command):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118# Verify GPU is usable in PyTorch:
python3 << 'EOF'
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())
print("GPU name:", torch.cuda.get_device_name(0))
print("GPU memory:", torch.cuda.get_device_properties(0).total_memory // 1024**3, "GB")
EOFRUNNING MODELS WITH OLLAMA (EASIEST PATH)
# Install Ollama (automatically uses GPU if available):
curl -fsSL https://ollama.ai/install.sh | sh# Run a model:
ollama run llama3.2 # downloads and runs — GPU used automatically# Run as a service:
sudo systemctl enable ollama && sudo systemctl start ollama# Use the API:
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'# Check if Ollama is using GPU:
nvidia-smi # should show ollama using GPU memoryGPU MEMORY MANAGEMENT
# See what is using GPU memory:
nvidia-smi# Kill a stuck process using GPU:
sudo kill -9 $(nvidia-smi | awk '/python/{print $5}')# Clear GPU memory (if a Python script crashed and left memory allocated):
# The only reliable way is to kill the process that owns the memory
# nvidia-smi shows the PID# Out of memory? Reduce batch size or use quantisation:
# Instead of model.to('cuda'), use:
# model = AutoModelForCausalLM.from_pretrained(name, load_in_8bit=True)MULTIPLE GPUS
# See all GPUs:
nvidia-smi --list-gpus# Run a Python script on a specific GPU:
CUDA_VISIBLE_DEVICES=0 python3 train.py # use only GPU 0
CUDA_VISIBLE_DEVICES=1 python3 serve.py # use only GPU 1
CUDA_VISIBLE_DEVICES=0,1 python3 train.py # use GPU 0 and 1# In Python:
import torch
device = torch.device('cuda:0') # use GPU 0
device = torch.device('cuda:1') # use GPU 1RUNNING MODELS AS SERVICES
# /etc/systemd/system/ollama-serve.service
[Unit]
Description=Ollama LLM Server
After=network.target[Service]
Type=simple
User=ollama
Environment=OLLAMA_HOST=0.0.0.0:11434
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=5[Install]
WantedBy=multi-user.targetsudo systemctl enable ollama-serve && sudo systemctl start ollama-serveJUPYTER ON A REMOTE SERVER
# Install and start Jupyter:
pip3 install jupyterlab
jupyter lab --no-browser --port=8888 --ip=0.0.0.0# Access from your laptop via SSH tunnel:
ssh -L 8888:localhost:8888 user@gpu-server
# Open http://localhost:8888 on your laptopMONITORING GPU TEMPERATURE AND POWER
# GPU temperature (important — GPUs throttle above 83C):
nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader# Power usage:
nvidia-smi --query-gpu=power.draw --format=csv,noheader# Set power limit (useful for servers where you pay per watt):
sudo nvidia-smi -pl 200 # limit to 200W (default may be 300W+)# Watch everything:
watch -n 1 'nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total,power.draw --format=csv'Kavitha's GPU server went from a confusing box to a production ML platform in one week. She wrote a setup script that installs drivers, CUDA, PyTorch, and Ollama on a fresh Ubuntu server in under 20 minutes. That script is now in the company's runbook.
nvidia-smi is your GPU equivalent of top — shows memory usage, GPU utilisation, temperature, and running processes
CUDA_VISIBLE_DEVICES=0 python3 script.py pins a process to a specific GPU — essential on multi-GPU servers
Ollama is the fastest path to running LLMs on a GPU — install in one command, automatically uses GPU if available
watch -n 1 nvidia-smi gives a live GPU dashboard — use during model runs to catch memory or temperature issues
SSH tunnel (ssh -L 8888:localhost:8888 user@server) lets you use Jupyter on a remote GPU from your laptop browser