March 9, 2026

How to Run Local LLM Models on Your Machine — Ollama Guide

A practical guide to running powerful AI models locally with Ollama, covering installation, model selection, API usage, and real-world workflows for developers and businesses.

aillmollamalocal-aideveloper-tools

By Keegan Kelly

Why Run LLMs Locally?

Cloud-based AI services like ChatGPT and Claude are powerful, but they come with trade-offs: recurring API costs, data leaving your network, rate limits, and dependency on third-party uptime. For many use cases — code generation, document analysis, internal tooling — running a model on your own hardware makes more sense.

Local LLMs give you full control over your data, zero per-token costs after setup, and the ability to run inference offline. With tools like Ollama, getting started takes less than five minutes.

What Is Ollama?

Ollama is an open-source tool that makes it dead simple to download, run, and manage large language models on your local machine. It handles model downloading, quantization formats, GPU acceleration, and exposes a local API — all from a single command-line interface.

Think of it as Docker for LLMs. You pull a model, run it, and interact with it through your terminal or any HTTP client.

Installation

macOS

Download the installer from ollama.com/download or install via Homebrew:

brew install ollama

Linux

Run the one-line install script:

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download. Ollama runs natively on Windows with GPU support for NVIDIA cards.

After installation, verify it's working:

ollama --version

Pulling and Running Your First Model

Ollama's model library includes all the major open-source models. To download and start chatting with one:

ollama run llama3.2

This pulls the Llama 3.2 model (if not already downloaded) and drops you into an interactive chat session. Type your prompt, hit enter, and the model responds directly in your terminal.

Recommended Models by Use Case

| Model | Size | Best For | |---|---|---| | llama3.2 | 3B | General chat, lightweight tasks | | llama3.3 | 70B | High-quality reasoning, complex tasks | | codellama | 7B–34B | Code generation and review | | mistral | 7B | Fast general-purpose inference | | deepseek-coder-v2 | 16B–236B | Advanced code generation | | gemma2 | 9B–27B | Google's efficient general model | | phi3 | 3.8B–14B | Microsoft's compact powerhouse | | qwen2.5-coder | 7B–32B | Strong multilingual code model |

Pull any model with:

ollama pull mistral
ollama pull codellama:13b

The tag after the colon specifies the parameter size. Larger models produce better output but require more RAM and VRAM.

Hardware Requirements

Local LLMs are memory-hungry. Here's a rough guide:

7B models — 8GB RAM minimum, runs on most modern laptops
13B models — 16GB RAM recommended
34B models — 32GB+ RAM or a GPU with 24GB VRAM
70B models — 64GB+ RAM or multiple GPUs

If you have an Apple Silicon Mac (M1/M2/M3/M4), you're in a great position — Ollama leverages the unified memory architecture and Metal GPU acceleration out of the box. A MacBook Pro with 32GB of unified memory comfortably runs 13B–34B models.

For NVIDIA GPUs, Ollama uses CUDA automatically. AMD GPU support is available on Linux via ROCm.

Using the Ollama API

Ollama exposes a local REST API on http://localhost:11434 that's compatible with the OpenAI API format. This is where things get powerful for developers.

Chat Completions

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Explain Docker in 3 sentences." }
  ]
}'

Generate Endpoint

For simpler single-turn prompts:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write a Python function to reverse a linked list",
  "stream": false
}'

Using with Python

import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Summarize this document..."}],
    "stream": False
})

print(response.json()["message"]["content"])

Or use the official Ollama Python library:

pip install ollama

import ollama

response = ollama.chat(model="llama3.2", messages=[
    {"role": "user", "content": "What are the benefits of TypeScript over JavaScript?"}
])
print(response["message"]["content"])

Practical Workflows

Code Review Assistant

Run a code-specialized model and pipe files directly to it:

cat app/page.tsx | ollama run codellama "Review this React component for performance issues"

Local RAG (Retrieval-Augmented Generation)

Combine Ollama with an embedding model and a vector database like ChromaDB to build a private search-and-answer system over your own documents:

Generate embeddings with ollama pull nomic-embed-text
Store embeddings in ChromaDB or pgvector
Query relevant chunks and pass them as context to your chat model

This is ideal for internal knowledge bases, documentation search, or customer support tools — all without sending a single byte to the cloud.

IDE Integration

Most modern editors support Ollama as a backend for AI-assisted coding:

Cursor — Supports local models via OpenAI-compatible endpoints
Continue.dev — Open-source VS Code/JetBrains extension built for local models
Cody — Sourcegraph's coding assistant supports Ollama

Managing Models

A few essential commands for day-to-day use:

ollama list              # Show downloaded models
ollama show llama3.2     # View model details and parameters
ollama rm mistral        # Delete a model to free disk space
ollama cp llama3.2 my-model  # Copy a model for customization

Custom Modelfiles

Create a Modelfile to customize model behavior:

FROM llama3.2
SYSTEM "You are a senior full-stack developer. Respond with concise, production-ready code. Always use TypeScript."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Then build and run it:

ollama create my-dev-assistant -f Modelfile
ollama run my-dev-assistant

This lets you create purpose-built assistants for specific tasks — code review bots, writing editors, data analysis helpers — each with their own system prompts and parameters.

Key Takeaways

Ollama makes local AI accessible — One command to install, one command to run a model
Your data stays private — No API calls, no cloud dependencies, no per-token billing
Start small — A 7B model on a modern laptop is surprisingly capable for most tasks
Use the API — The real power is integrating local models into your existing tools and workflows
Custom Modelfiles — Build specialized assistants tailored to your team's needs

Local AI is no longer a novelty — it's a practical tool for developers and businesses who care about privacy, cost, and control. If you're looking to integrate AI into your workflow or build custom AI-powered tools, get in touch to discuss your project. You can also explore the tools I use for development and AI workflows.