Posted on

Playing with open source LLMs

Every 6 months or so, I decide to leave my cave and check out what the cool kids are doing with AI. Apparently the latest trend is to use fancy command line tools to write code using LLMs. This is a very nice change, since it suddenly makes AI compatible with my allergy to getting out of the terminal.

A low poly AI art of an hermit in a cave
Me, browsing HN from my cave (by Stable Diffusion)

The most popular of these tools seems to be Claude Code. It promises to be able to build in total autonomy, being able to use search code, write code, run tests, lint, and commit the changes. While this sounds great on paper, I’m not keen on getting locked into vendor tools from an unprofitable company. At some point, they will either need to raise their prices, enshittify their product, or most likely do both.

So I went looking for what the free and open source alternatives are.

Picking a model

There’s a large amount of open source large language models on the market, with new ones getting released all the time. However, they are not all ready to be used locally in coding tasks, so I had to try a bunch of them before settling on one.

deepseek-r1:8b

Deepseek is the most popular open source model right now. It was created by the eponymous Chinese company. It made the news by beating numerous benchmarks while being trained on a budget that is probably lower than the compensation of some OpenAI workers. The 8b variant only weights 5.2 GB and runs decently on limited hardware, like my three years old Mac.

This model is famous for forgetting about world events from 1989, but also seems to have a few issues when faced with concrete coding tasks. It is a reasoning model, meaning it “thinks” before acting, which should lead to improved accuracy. In practice, it regularly gets stuck indefinitely searching where it should start and jumping from one problem to the other in a loop. This can happen even on simple problems, and made it unusable for me.

mistral:7b

Mistral is the French alternative to American and Chinese models. I have already talked about their 7b model on this blog. It is worth noting that they have kept updating their models, and it should now be much more accurate than two years ago.

Mistral is not a reasoning model, so it will jump straight to answering. This is very good if you’re working with tasks where speed and low compute use are a priority. Sadly, the accuracy doesn’t seem good enough for coding. Even on simple tasks, it will hallucinate functions or randomly delete parts of the code I didn’t want to touch.

qwen3:8b

Another model from China, qwen3 was created by the folks at Alibaba. It also claims impressive benchmark results, and can work as both a reasoning or non-thinking model. It was made with modern AI tooling in mind, by supporting MCPs and a framework for agentic development.

This model actually seems to work as expected, providing somewhat accurate code output while not hanging in the reasoning part. Since it runs decently on my local setup, I decided to stick to that model for now.

Setting up a local API with Ollama

Ollama is now the default way to download and run local LLMs. It can be simply installed by downloading it from their website.

Once installed, it works like Docker for models, by giving us access to commands like pull, run, or rm. Ollama will expose an API on localhost which can be used by other programs. For example, you can use it from your Python programs through ollama-python.

A low poly AI art of a llama
My new pet (by Stable Diffusion)

Pair programming with aider

The next piece of software I installed is aider. I assume it’s pronounced like the French word, but I could not confirm that. Aider describes itself as a “pair programming” application. Its main job is to pass context to the model, let it write the output to files, run linters, and commit the changes.

Getting started

It can be installed using the official Python package or via Homebrew if you use Mac. Once it is installed, just navigate to your code repository and launch it:

export OLLAMA_API_BASE=http://127.0.0.1:11434
aider --model ollama_chat/qwen3:8b

The CLI should automatically create some configuration files and add them to the repo’s .gitignore.

Usage

Aider isn’t meant to be left alone in complete autonomy. You’ll have to guide the AI through the process of making changes to your repository.

To start, use the /add command to add files you want to focus on. Those files will be passed entirely to the model’s context and the model will be able to write in them.

You can then ask questions using the /ask command. If you want to generate code, a good strategy can be to starting by requesting a plan of actions.

When you want it to actually write to the files, you can prompt it using the /code command. This is also the default mode. There’s no absolute guarantee that it will follow a plan if you agreed on one previously, but it is still a good idea to have one.

The /architect command seems to automatically ask for a plan, accept it, and write the code. The specificity of this command is that it lets you use different models to plan and write the changes.

Refactoring

I tried coding with aider in a few situations to see how it performs in practice.

First, I tried making it do a simple refactoring on Itako, which is a project of average complexity. When pointed to the exact part of code where the issues happened, and explained explicitly what to do, the model managed to change the target struct according to the instructions. It did unexpectedly change a function that was outside the scope of what I asked, but this was easy to spot.

On paper, this looks like a success. In practice, the time spent crafting a prompt, waiting for the AI to run and fixing the small issue that came up immensely exceeds the 10 minutes it would have taken me to edit the file myself. I don’t think coding that way would lead me to a massive performance improvement for now.

Greenfield project

For a second scenario, I wanted to see how it would perform on a brand-new project. I quickly set up a Python virtual environment, and asked aider to work with me at building a simple project. We would be opening a file containing Japanese text, parsing it with fugashi, and counting the words.

To my surprise, this was a disaster. All I got was a bunch on hallucination riddled python that wouldn’t run under any circumstances. It may be that the lack of context actually made it harder for the model to generate code.

Troubleshooting

Finally, I went back to Itako, and decided to check how it would perform on common troubleshooting tasks. I introduced a few bugs to my code and gathered some error messages. I then proceeded to simply give aider the files mentioned by the error message and just use /ask to have it explain the errors to me, without requiring it to implement the code.

This part did work very well. If I compare it with Googling unknown error messages, I think this can cut the time spent on the issue by half This is not just because Google is getting worse every day, but the model having access to the actual code does give it a massive advantage.

I do think this setup is something I can use instead of the occasional frustration of scrolling through StackOverflow threads when something unexpected breaks.

What about the Qwen CLI?

With everyone jumping on the trend of CLI tools for LLMs, the Qwen team released its own Qwen Code. It can be installed using npm, and connects to a local model if configured like this:

export OPENAI_API_KEY="ollama"
export OPENAI_BASE_URL="http://localhost:11434/v1/"
export OPENAI_MODEL="qwen3:8b"

Compared to aider, it aims at being fully autonomous. For example, it will search your repository using grep. However, I didn’t manage to get it to successfully write any code.

The tool seems optimized for larger, online models, with context sizes up to 1M tokens. Our local qwen3 context only has a 40k tokens context size, which can get overwhelmed very quickly when browsing entire code repositories.

Even when I didn’t run out of context, the tool mysteriously failed when trying to write files. It insists it can only write to absolute paths, which the model doesn’t seem to agree with providing. I did not investigate the issue further.