How to Run Local LLM Models on Your Machine — Ollama Guide
A practical guide to running powerful AI models locally with Ollama, covering installation, model selection, API usage, and real-world workflows for developers and businesses.
By Keegan Kelly
Why Run LLMs Locally?
Cloud-based AI services like ChatGPT and Claude are powerful, but they come with trade-offs: recurring API costs, data leaving your network, rate limits, and dependency on third-party uptime. For many use cases — code generation, document analysis, internal tooling — running a model on your own hardware makes more sense.
Local LLMs give you full control over your data, zero per-token costs after setup, and the ability to run inference offline. With tools like Ollama, getting started takes less than five minutes.
What Is Ollama?
Ollama is an open-source tool that makes it dead simple to download, run, and manage large language models on your local machine. It handles model downloading, quantization formats, GPU acceleration, and exposes a local API — all from a single command-line interface.
Think of it as Docker for LLMs. You pull a model, run it, and interact with it through your terminal or any HTTP client.
Installation
macOS
Download the installer from ollama.com/download or install via Homebrew:
brew install ollama
Linux
Run the one-line install script:
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com/download. Ollama runs natively on Windows with GPU support for NVIDIA cards.
After installation, verify it's working:
ollama --version
Pulling and Running Your First Model
Ollama's model library includes all the major open-source models. To download and start chatting with one:
ollama run llama3.2
This pulls the Llama 3.2 model (if not already downloaded) and drops you into an interactive chat session. Type your prompt, hit enter, and the model responds directly in your terminal.
Recommended Models by Use Case
| Model | Size | Best For |
|---|---|---|
| llama3.2 | 3B | General chat, lightweight tasks |
| llama3.3 | 70B | High-quality reasoning, complex tasks |
| codellama | 7B–34B | Code generation and review |
| mistral | 7B | Fast general-purpose inference |
| deepseek-coder-v2 | 16B–236B | Advanced code generation |
| gemma2 | 9B–27B | Google's efficient general model |
| phi3 | 3.8B–14B | Microsoft's compact powerhouse |
| qwen2.5-coder | 7B–32B | Strong multilingual code model |
Pull any model with:
ollama pull mistral
ollama pull codellama:13b
The tag after the colon specifies the parameter size. Larger models produce better output but require more RAM and VRAM.
Hardware Requirements
Local LLMs are memory-hungry. Here's a rough guide:
- 7B models — 8GB RAM minimum, runs on most modern laptops
- 13B models — 16GB RAM recommended
- 34B models — 32GB+ RAM or a GPU with 24GB VRAM
- 70B models — 64GB+ RAM or multiple GPUs
If you have an Apple Silicon Mac (M1/M2/M3/M4), you're in a great position — Ollama leverages the unified memory architecture and Metal GPU acceleration out of the box. A MacBook Pro with 32GB of unified memory comfortably runs 13B–34B models.
For NVIDIA GPUs, Ollama uses CUDA automatically. AMD GPU support is available on Linux via ROCm.
Using the Ollama API
Ollama exposes a local REST API on http://localhost:11434 that's compatible with the OpenAI API format. This is where things get powerful for developers.
Chat Completions
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Explain Docker in 3 sentences." }
]
}'
Generate Endpoint
For simpler single-turn prompts:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a Python function to reverse a linked list",
"stream": false
}'
Using with Python
import requests
response = requests.post("http://localhost:11434/api/chat", json={
"model": "llama3.2",
"messages": [{"role": "user", "content": "Summarize this document..."}],
"stream": False
})
print(response.json()["message"]["content"])
Or use the official Ollama Python library:
pip install ollama
import ollama
response = ollama.chat(model="llama3.2", messages=[
{"role": "user", "content": "What are the benefits of TypeScript over JavaScript?"}
])
print(response["message"]["content"])
Practical Workflows
Code Review Assistant
Run a code-specialized model and pipe files directly to it:
cat app/page.tsx | ollama run codellama "Review this React component for performance issues"
Local RAG (Retrieval-Augmented Generation)
Combine Ollama with an embedding model and a vector database like ChromaDB to build a private search-and-answer system over your own documents:
- Generate embeddings with
ollama pull nomic-embed-text - Store embeddings in ChromaDB or pgvector
- Query relevant chunks and pass them as context to your chat model
This is ideal for internal knowledge bases, documentation search, or customer support tools — all without sending a single byte to the cloud.
IDE Integration
Most modern editors support Ollama as a backend for AI-assisted coding:
- Cursor — Supports local models via OpenAI-compatible endpoints
- Continue.dev — Open-source VS Code/JetBrains extension built for local models
- Cody — Sourcegraph's coding assistant supports Ollama
Managing Models
A few essential commands for day-to-day use:
ollama list # Show downloaded models
ollama show llama3.2 # View model details and parameters
ollama rm mistral # Delete a model to free disk space
ollama cp llama3.2 my-model # Copy a model for customization
Custom Modelfiles
Create a Modelfile to customize model behavior:
FROM llama3.2
SYSTEM "You are a senior full-stack developer. Respond with concise, production-ready code. Always use TypeScript."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
Then build and run it:
ollama create my-dev-assistant -f Modelfile
ollama run my-dev-assistant
This lets you create purpose-built assistants for specific tasks — code review bots, writing editors, data analysis helpers — each with their own system prompts and parameters.
Key Takeaways
- Ollama makes local AI accessible — One command to install, one command to run a model
- Your data stays private — No API calls, no cloud dependencies, no per-token billing
- Start small — A 7B model on a modern laptop is surprisingly capable for most tasks
- Use the API — The real power is integrating local models into your existing tools and workflows
- Custom Modelfiles — Build specialized assistants tailored to your team's needs
Local AI is no longer a novelty — it's a practical tool for developers and businesses who care about privacy, cost, and control. If you're looking to integrate AI into your workflow or build custom AI-powered tools, get in touch to discuss your project. You can also explore the tools I use for development and AI workflows.