Skip to content
← Back to overview

Published on

Local AI with Ollama — when it actually makes sense

A short field note on where local models do and don’t belong in a product or client setting.

“Run it locally” sounds appealing: no API bills, no data leaving your servers, no rate limits. But running locally isn’t a free pass — you swap one set of worries for another.

This short piece covers three scenarios where Ollama (or a comparable local runtime) genuinely adds value, and three where it mostly hurts.

When local works well

  1. Sensitive data that can’t leave the building. For example, a support bot over internal docs. The latency of an 8B model on an M-series Mac is typically fine for one user at a time.
  2. Fast iteration during development. Using a local model for prompt tweaks and chain testing saves a lot of cloud round-trips and money. Switch to a stronger model only for the regression tests you take seriously.
  3. Edge scenarios. An agent running on a fieldworker’s laptop, offline. Here you lose little with a smaller model and gain a lot from the offline ability.

When local hurts you

  1. When you actually need top-tier reasoning. No local 7B model today matches a large frontier model. For complex extraction, long context, or multi-step reasoning, cloud usually remains the right call.
  2. When you need to scale throughput. One user at a time is doable; ten parallel users quickly require GPU hardware you don’t want to operate.
  3. When the operational burden shifts to you. Updating models, watching RAM, keeping drivers happy — if your organization can’t or won’t do this, a hosted model is cheaper than it looks.

A pragmatic pattern

What All Open often does in client work: hybrid. Local for classification, routing, retrieval and short summaries. Cloud for the final, accountable step. That saves meaningful cost and latency without giving up reasoning where it really matters.

No revolution — just well-set-up engineering.