The Definitive Guide to Running Ollama (DeepSeek & Llama 3) on Amazon Linux 2023 (GPU Enabled)

Introduction

Running your own Large Language Model (LLM) is the ultimate power move for data privacy and control. No more API rate limits, no more monthly subscription fees per seat, and no data leaving your VPC.

While most guides focus on Ubuntu, serious AWS engineers know that Amazon Linux 2023 (AL2023) is the superior choice for production workloads on AWS. It is lighter, faster, and more secure. However, installing proprietary Nvidia drivers on AL2023 can be tricky because it dropped the old amazon-linux-extras repository.

In this guide, we will set up a private AI server running Ollama with the latest DeepSeek-R1 and Llama 3 models on an AWS EC2 instance with GPU acceleration.

Choosing the Right EC2 Instance

The most critical decision is matching your instance VRAM (Video RAM) to the model size you want to run. If the model doesn’t fit in VRAM, it offloads to the CPU, making it painfully slow.

The Budget King: g4dn.xlarge
- Specs: 4 vCPUs, 16GB RAM, 16GB NVIDIA T4 GPU.
- Cost: ~$0.526/hour (On-Demand US-East-1).
- Best For: Llama 3 (8B), DeepSeek-R1 Distill (7B, 8B, 14B).
- Verdict: This is the sweet spot for most users. The 16GB VRAM comfortably fits the 7B and 8B models with plenty of room for context.
The Performance Pick: g5.xlarge
- Specs: 4 vCPUs, 16GB RAM, 24GB NVIDIA A10G GPU.
- Cost: ~$1.006/hour.
- Best For: DeepSeek-R1 Distill (32B).
- Verdict: If you need the smarter 32B parameter model (approx 20GB size), you need this instance.
Storage: Do not stick with the default 8GB root volume. Change it to 50GB gp3 during launch.

Step 1: Environment Setup (Amazon Linux 2023)

Once you have launched your instance and SSH’d in, the first task is preparing the OS. AL2023 requires specific commands to get the Nvidia drivers working.

Update the system packages:

sudo dnf update -y

Add the Nvidia CUDA Repository: Unlike Amazon Linux 2, we use config-manager to add the repo directly.

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo

Install the Nvidia Drivers: This command installs the kernel module drivers and the necessary dependencies.

sudo dnf module install nvidia-driver:latest-dkms -y

Install CUDA Toolkit (Optional but Recommended):

sudo dnf install cuda-toolkit -y

Critical Step: Reboot your instance now to load the drivers.

sudo reboot

After the reboot, reconnect and verify the GPU is active:

nvidia-smi

You should see a table listing your Tesla T4 or A10G GPU.

Step 2: Installing Ollama

Ollama makes running LLMs as easy as running Docker containers. The installation script works perfectly on AL2023.

Run the install script:

curl -fsSL https://ollama.com/install.sh | sh

Verify the service is running:

sudo systemctl status ollama

Verify GPU Recognition: Check the logs to ensure Ollama found your Nvidia GPU. journalctl -u ollama --no-pager | grep "Nvidia" You want to see messages confirming “Nvidia GPU detected”.

Step 3: Running DeepSeek & Llama 3

Now for the fun part. Let’s pull and run the models.

Option A: DeepSeek R1 (The Reasoning Model) DeepSeek R1 has taken the world by storm with its reasoning capabilities. We will use the 7B “distilled” version, which is fast and efficient.

Bash

ollama run deepseek-r1:7b

Wait for the download (~4.7GB) to finish. You will be dropped into a chat prompt.

Option B: Llama 3 (The Standard) Meta’s Llama 3 8B is a fantastic general-purpose model.

Bash

ollama run llama3

Pro Tip: To exit the chat, type /bye.

Step 4: Exposing Ollama to the Internet (Securely)

By default, Ollama only listens on localhost (127.0.0.1). If you want to connect to this API from your laptop or a custom UI, you need to expose it.

Edit the systemd service:

sudo systemctl edit ollama.service

Add the Environment Variable: Add these lines in the editor window that opens: Ini, TOML[Service] Environment="OLLAMA_HOST=0.0.0.0:11434"
Apply changes and restart:

sudo systemctl daemon-reload sudo systemctl restart ollama

Update Security Groups (Crucial Security Step):
- Go to your AWS Console > EC2 > Security Groups.
- Add an Inbound Rule for Custom TCP Port 11434.
- Source: Select “My IP”. Do not leave this open 0.0.0.0/0 unless you want hackers using your expensive GPU to generate spam.

Optional: Adding a UI (Open WebUI)

If you prefer a ChatGPT-like interface over the terminal, Open WebUI is the best choice. It runs in Docker.

Install Docker (if you haven’t yet):

sudo dnf install docker -y sudo systemctl enable --now docker sudo usermod -aG docker ec2-user newgrp docker

Run Open WebUI:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

You can now access the UI at http://<your-ec2-public-ip>:3000.

Conclusion & Cost Management

You now have a powerful, private AI server running on Amazon Linux 2023. This setup gives you complete data privacy and the ability to switch between models like DeepSeek and Llama 3 in seconds.

FAQ Section (Add this to the end of your post)

1. Can I run DeepSeek-R1 (671B) on these instances? No. The full 671B parameter model requires massive H100 clusters. However, Ollama provides “Distilled” versions (7B, 8B, 14B, 32B) which perform exceptionally well and fit perfectly on the EC2 instances recommended in this guide.

2. Why use Amazon Linux 2023 instead of Ubuntu? Amazon Linux 2023 (AL2023) is optimized for AWS infrastructure, offering faster boot times and better security integration than generic Ubuntu images. While Ubuntu has more tutorials, AL2023 is the “native” choice for serious AWS engineers.

3. How much disk space do I need? Allocating at least 50GB is recommended. The OS and Nvidia drivers take ~10GB. The base DeepSeek-R1 (Distill-Llama-8B) model is ~5.2GB, and the 32B model is ~20GB. You need extra headroom for swap space and logs.

4. My “Ollama run” command is slow. Why? Check if you are running in CPU mode. Run nvidia-smi to ensure your GPU is detected. If you see the GPU but Ollama is still slow, check the logs (journalctl -u ollama) to see if it failed to load the CUDA runner and fell back to CPU.