Code, Content, and Career with Brian Hogan logo

Code, Content, and Career with Brian Hogan

Archives
March 31, 2026, 5:49 p.m.

Issue 51 - Run a Local LLM, and discover why LLMs are unpredictable

Dive into running local large language models with Ollama for data privacy and cost efficiency, and prep for their unpredictability.

Code, Content, and Career with Brian Hogan Code, Content, and Career with Brian Hogan

LLMs like Claude and ChatGPT can be costly and require you to trust others with your data. You can cut costs and keep your data private by running LLMs on your own computer, and integrate them directly into your apps and workflows.

Before you build those workflows, you should understand how unpredictable LLMs can be, so you can protect yourself from side effects and strange behavior.

This issue explores both running local LLMs and understanding their unpredictability.

Run LLMs Locally from the Command Line with Ollama

Online AI tools like ChatGPT and Claude require accounts, subscriptions, and they require you to send your data to external servers. If you want to experiment with large language models (LLMs) privately, on your own hardware, and without ongoing costs, Ollama lets you run open source models yourself, directly from the command line or through its API.

In this tutorial, you'll install Ollama, download and run models, have interactive conversations, use models to process files and text from the command line, and manage the models on your machine. To complete this tutorial, you need at least 8GB of RAM for smaller models and 16GB or more for larger models.

Installing Ollama

On macOS, install Ollama with Homebrew:

$ brew install ollama

Then start the Ollama service:

$ brew services start ollama

On Linux, run the official install script:

$ curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and sets it up as a systemd service that starts automatically.

On Windows, download and run the installer from Ollama's download page. Ollama runs as a background service automatically after installation. Use Ollama from PowerShell, Command Prompt, or Windows Terminal.

The Ollama commands in the rest of this tutorial work the same on all platforms.

Verify Ollama is running:

$ ollama --version

You'll see output showing the version number:

ollama version is 0.18.2

With Ollama running, download and run your first model.

Running Your First Model

Explore the Ollama library of open source models to download and run. For your first test, use Meta's Llama 3.2 model. This general-purpose model runs well on most machines.

Run the following command to download and run the model:

$ ollama run llama3.2

You'll see a progress bar as it downloads:

pulling manifest
pulling dde5aa3fc5ff: 100% ▕█████████████████████▏ 2.0 GB
pulling 966de95ca8a6: 100% ▕█████████████████████▏ 1.4 KB
pulling fcc5a6bec9da: 100% ▕█████████████████████▏ 7.7 KB
pulling a70ff7e570d9: 100% ▕█████████████████████▏ 6.0 KB
pulling 56bb8bd477a5: 100% ▕█████████████████████▏   96 B
pulling 34bb5ab01051: 100% ▕█████████████████████▏  561 B
verifying sha256 digest
writing manifest
success

Once the download finishes, you'll see an interactive prompt where you can start chatting with the model:

>>> Send a message (/? for help)

Type a question or prompt:

>>> What is a shell pipeline?

The model responds directly in your terminal:

A shell pipeline is a sequence of commands connected by the pipe operator (|),
where the output of one command becomes the input of the next. For example,
`ls -la | grep ".txt" | wc -l` lists files, filters for text files, and
counts them.

Type /bye to exit the interactive session.

Interactive mode is useful, but the real power of a local LLM is integrating it into your command-line workflow. You can pipe text directly to Ollama using the ollama run command with a prompt. For example, you can use this to process files as well. To summarize a README file:

$ cat README.md | ollama run llama3.2 "Summarize this document in three bullet points"

You can also generate content. For example, to generate a commit message from a diff:

$ git diff --staged | ollama run llama3.2 "Write a concise git commit message for these changes"

Since Ollama runs locally, none of this data leaves your machine.

Managing models

Models often come in different size variants. For example, llama3.2 defaults to its 3-billion parameter version, but you can run the smaller 1-billion parameter version with llama3.2:1b. Larger parameter counts produce better output but need more memory. Browse all available variants on a model's page in the Ollama library. Smaller models respond faster but produce less nuanced output.

To download a model without starting a conversation, use ollama pull. The following command downloads the codellama mode:

$ ollama pull codellama

pulling manifest
pulling 3a43f93b78ec: 100% ▕███████████████████████████▏ 3.8 GB
pulling 8c17c2ebb0ea: 100% ▕███████████████████████████▏ 7.0 KB
pulling 590d74a5569b: 100% ▕███████████████████████████▏ 4.8 KB
pulling 2e0493f67d0c: 100% ▕███████████████████████████▏   59 B
pulling 7f6a57943a88: 100% ▕███████████████████████████▏  120 B
pulling 316526ac7323: 100% ▕███████████████████████████▏  529 B
verifying sha256 digest
writing manifest
success

Models take up disk space, so you'll want to manage them. Use ollama list to see the models you've downloaded:

$ ollama list

Your models display in a list:

NAME               ID              SIZE      MODIFIED
codellama:latest   8fdf8f752f6e    3.8 GB    2 minutes ago
llama3.2:latest    a80c4f17acd5    2.0 GB    5 minutes ago

To review detailed information about a model, including its parameters and license, run the ollama show command and specify the model:

$ ollama show llama3.2

You'll see output similar to the following:

  Model
    architecture        llama
    parameters          3.2B
    context length      131072
    embedding length    3072
    quantization        Q4_K_M

  Parameters
    stop    "<|start_header_id|>"
    stop    "<|end_header_id|>"
    stop    "<|eot_id|>"

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT
    ...

Remove a model you no longer need using ollama rm:

$ ollama rm codellama

To determine how much space your models use, check the models directory. On macOS and Linux, use the du command:

$ du -sh ~/.ollama/models

On Windows, you'll find models in %USERPROFILE%\.ollama\models. Use File Explorer to determine the size.

You're not limited to using the CLI to interact with your models. There's an API you can use as well.

Using the API

Ollama runs a local REST API server on port 11434, which means you can interact with it using curl or any HTTP client. This is useful for scripting and integrating Ollama with other tools.

Send a prompt and get a response by using the /api/generate endpoint:

$ curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What does chmod 755 do?", 
  "stream": false
}'

The -s flag for curl silences the progress output from curl, and the stream: false parameter tells the Ollama API to return the full response at once instead of streaming tokens. The response comes back as JSON with the model's output in the response field.

The /api/generate endpoint handles one-off prompts. Ollama also provides a /api/chat endpoint for conversations. Unlike ollama run, the API does not remember previous messages on its own. To have a back-and-forth conversation, you send the full message history with each request.

Start with a single question:

$ curl -s http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "stream": false,
  "messages": [
    {"role": "user", "content": "What is a FIFO in Unix?"}
  ]
}'

The response includes the model's answer in a message object. To ask a follow-up, you include your original question, the model's answer, and your new question. You set the role to user for your messages and assistant for the model's previous responses:

$ curl -s http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "stream": false,
  "messages": [
    {"role": "user", "content": "What is a FIFO in Unix?"},
    {"role": "assistant", "content": "A FIFO is a named pipe..."},
    {"role": "user", "content": "How do I create one?"}
  ]
}'

Because you sent the earlier exchange, the model knows what "one" refers to and can give a relevant answer. Each request you make builds on the previous messages you include. Of course, you wouldn't actually use it through curl; you'd interact with the API programmatically.

Any tool or programming language that can make HTTP requests uses your local models through this API.

Ollama gives you a way to run large language models locally, from the command line, without accounts or API keys. Your prompts and data stay on your machine.

Explore the full model library at Ollama's website, or run ollama help to see all available commands.

Things To Explore

  • LazyDocker is a terminal UI for managing Docker images and containers. It's especially helpful for those tasks you don't do very often and have to look up.
  • How to turn anything into a router is a response to a recent ban in the US on foreign-made routers.

Your LLM is Unpredictable

Predictability is crucial when you're building workflows. Consider a tool like Vale, which lets you define style rules for your content and check your documentation against those rules. If you run Vale on your document today and then run it again tomorrow, you'll get the same results, as long as the content hasn't changed. You'll see the same warnings, the same suggestions, and the same line numbers. This predictability is so fundamental to how "linting" tools like Vale work that you never think about it.

But if you swap in an LLM for that same task, that predictability disappears. And if you don't understand why, you'll build workflows you can't trust.

Two concepts from distributed systems explain this gap: determinism and idempotence. They're the reason your traditional tools feel reliable, and your LLM-powered tools sometimes don't. Understanding both concepts will help you make better decisions about when and how to use LLMs in your workflows.

Determinism: same input, same output

A deterministic process produces the same result every time you give it the same input. A calculator always returns 4 when you type 2 + 2. A Markdown linter always flags the same heading-level violation; markdownlint doesn't wake up one morning and decide that your level 2 heading is fine today but not tomorrow. The rules are fixed. The output is predictable.

LLMs don't work this way. Ask an LLM to "rewrite this paragraph in active voice," and you'll get a good result. Ask the same LLM the same question with the same paragraph again, and you'll get a different suggestion. Both versions are correct, but they won't be the same.

This happens because of how LLMs generate text. At each step, the model picks from a set of probable next words. A setting called temperature controls how much randomness goes into that selection. Higher temperature means more variation. Lower temperature means more consistency. But even at the lowest settings, identical output across runs isn't guaranteed. Model updates, API changes, and other factors behind the scenes can shift results without warning.

LLMs are designed to work this way. That variation is what makes them useful for creative tasks. But it creates real problems when you need consistency.

Idempotence: safe to repeat

Determinism is about getting the same output. Idempotence is about getting the same effect. An idempotent operation can run multiple times without changing the result beyond the first execution. The system's state stays the same whether the operation runs once or ten times.

Think of a light switch with an explicit "on" button. Press it once, and the light turns on. Press it five more times, and nothing else happens. The light doesn't get brighter. It doesn't turn on a second light. No side effects pile up. The operation is safe to repeat.

In writing and coding tools, this shows up everywhere. Run a code formatter on an already-formatted file, and the file doesn't change. Run Vale on a clean document, and you get the same empty report. These tools are safe to re-run because repeating the operation doesn't alter anything beyond what the first run already did.

LLMs can't guarantee idempotence because they are, by design, non-deterministic. You have no guarantee that running the same action over and over gives the same result. On top of that, tool calls might rely on APIs that change data in other systems. Rerunning the request might cause the LLM to create duplicate data or cause other side effects you don't find until it's too late.

Once you see how traditional tools rely on determinism and idempotence, the cracks in LLM-powered workflows become easier to spot.

Editing and linting require determinism

Imagine that your team implements an LLM-based review step for documentation, and each draft requires two people to review it. Both reviewers run the same draft through the LLM. The first reviewer's report flags passive voice in paragraph three and recommends shortening the conclusion. The second reviewer's report says the intro needs a stronger hook and that paragraph three is fine. Same document, same prompt, different feedback. Now the two reviewers are working from different checklists, and the writer has no way to tell which set of suggestions reflects the actual standard.

Scale this across a full doc set and things get even more inconsistent. Every reviewer gets different guidance on different runs, and over time, the docs drift apart in style without anyone making a conscious decision about the style or consistency.

With a deterministic tool like Vale, you write a rule once, and the tool enforces it the same way for everyone, every time. That shared consistency is what keeps documentation feeling like it was written by one team instead of six strangers.

Agents and automation need idempotence and determinism

Idempotence becomes critical when you move beyond single prompts into automated workflows and LLM-powered agents. Agents are systems where an LLM makes decisions and takes actions on your behalf: editing files, publishing pages, creating issues in a project tracker, and sending notifications.

In any automated system, retries happen. A network request fails, so the system tries again. A process crashes halfway through and restarts from the beginning. A queue delivers the same message twice. This is normal. Distributed systems are built with the expectation that things will fail and need to be retried.

When the operation being retried is idempotent, retries are harmless. Setting a page's status to "published" ten times still results in one published page. The system's state doesn't change after the first run. But LLM-powered actions often lack this safety. An agent that "adds an editorial comment to this doc" might add the same comment three times if the task gets retried. An agent that "creates a JIRA ticket for this issue" might create duplicate tickets. An agent that sends a Slack notification might spam a channel. Each retry produces additional side effects.

Non-determinism makes this worse. If the agent's LLM re-evaluates a situation during a retry, it might reach a different conclusion than it did the first time. Now you don't just have duplicate actions; you have contradictory actions. The agent might approve a change on the first run and request revisions on the retry.

The more autonomous your agent is, the more these issues compound. An agent that reads and summarizes is low-risk. An agent that edits, publishes, or communicates on your behalf needs careful design.

You can still use LLMs effectively in your workflows. You just need to build in the guardrails now that you know about the issues you'll face.

Build the safeguards

When it comes to editing and linting tasks, try the following approaches:

  • Treat LLM suggestions as drafts. A human still needs to review and approve changes. This catches the determinism problem before it reaches your published docs.
  • Use deterministic tools for enforcement and LLMs for exploration. Let tools like Vale and markdownlint handle the rules. Use LLMs when you want creative rewrites, alternative phrasings, or help thinking through structure.
  • Lower the temperature when consistency matters. If you're calling an LLM through an API, reducing the temperature setting gets you closer to repeatable output. It won't be perfectly deterministic, but it narrows the range of variation.
  • Pin your model version when possible. If the API lets you specify an exact model version, do it. This prevents surprise changes when the provider updates the model behind the scenes.
  • Don't re-run the LLM on already-edited content. If the first pass was good, stop there. Running the LLM again will produce new changes, not confirm the old ones. Track what's been reviewed and approved so your workflow doesn't loop back over finished work.

When you're developing agents, try putting guardrails in place:

  • Check before acting. Before the agent creates a ticket, have it check whether one already exists. Before it posts a comment, verify that one hasn't already been posted. This can help make the operation idempotent by preventing duplicate side effects.
  • Use unique identifiers. Assign an ID to each task so the system can recognize duplicates. If a retry comes through with the same ID, skip it.
  • Separate the "decide" step from the "do" step. Let the LLM analyze and recommend. Then use deterministic code to execute the action. This way, retries on the execution side don't re-trigger the LLM's decision-making, which could produce a different recommendation each time. You get idempotent execution even when the decision-maker isn't deterministic.
  • Log what happened. Keep a record of actions taken so you can detect and clean up duplicate side effects when they inevitably slip through.

The biggest safeguard you can put in place is to avoid using an LLM for tasks that require determinism and idempotency.

Instead of turning an LLM loose on your review process, use an LLM coding assistant to design and build a deterministic system that you then hook into your pipeline. You could even use something like Temporal to create deterministic workflows. They even have a skill that helps you write Temporal applications.

Determinism and idempotence aren't abstract theories. They explain a very practical gap between the tools you've trusted for years and the LLM-powered tools you're adopting now. LLMs have neither of these properties by default, which means you need to build those guarantees into the systems around them.

Use deterministic tools for enforcement. Use LLMs for creativity, exploration, and first drafts. Or use them to build your own deterministic tools and workflows and avoid the issues entirely. When you do automate with agents, design your operations so retries are safe. And keep a human in the loop for anything that matters.

Parting Thoughts

Here are some things to try or think about over the next month:

  1. Research various local models and understand how their sizes and quantization affect results, speed, and performance.
  2. Explore Open WebUI as an interface for Ollama.
  3. Identify non-deterministic issues in a workflow you've developed. How would you fix it?

That's all for this month. Thanks for reading.

I'd love to talk with you about this issue on BlueSky, Mastodon, Twitter, or LinkedIn. Let's connect!

Please support this newsletter and my work by encouraging others to subscribe and by buying a friend a copy of Write Better with Vale, tmux 3, Exercises for Programmers, Small, Sharp Software Tools, or any of my other books.

You just read issue #51 of Code, Content, and Career with Brian Hogan. You can also browse the full archives of this newsletter.

Share this email:
Share on Twitter Share on LinkedIn Share on Hacker News Share on Reddit Share via email Share on Mastodon Share on Bluesky
Twitter
LinkedIn
Mastodon
Bluesky
Powered by Buttondown, the easiest way to start and grow your newsletter.