Self-hosting GenAI code assistants

I’ve been playing with Claude Code for a few months now, and have been very impressed – but sometimes hit the limits of even the Pro Max plan when I’m multitasking.

Which got me to thinking… I have a DGX Spark that is idle while not training or fine-tuning LLM models for a side hustle, which can handle pretty large models (albeit not at thousands of token/s generation like commercial offerings). Maybe it would be fun to see how far I can get with self-hosted OSS solutions so I could experiment with things like Ralph and larger projects with BMAD and not break the bank.

So I experimented a bit this weekend with OSS GenAI code assistants and models using this stack:

While good for small snippets, it definitely couldn’t execute a detailed plan very well. It was pretty decent in “plan” mode iterating over a phased dev plan for a to-do list backend in FastAPI, but when I flipped it into “build” mode, it was skipping phases, creating duplicate code – and at one time, it deleted a chunk of the work it had completed and had to rewrite it (tbh I think it was trying to back out a phase it had performed out of order).

I also learned that ollama, by default, only allows an 8k context via the API unless you override it, which is necessary for these kinds of applications (GLM 4.7 has a 200k context; at 8k, it couldn’t even handle analyzing a UV starter codebase). The loaded model + 200k context took ~80GB.

Next I plan to tinker with Allen AI’s SERA model, announced in this post. Maybe even try out the biggest Deepseek, Qwen and Devstral models I can squeeze onto the Spark…

Self-hosting GenAI code assistants

You may also like...

Categories

You may also like...

Notes from “The Myths of Computer Security”

The hammer and the nail

TIL observing the Pokémon GO phenomenon

Categories