Self-hosting GenAI code assistants

I’ve been playing with Claude Code for a few months now, and have been very impressed – but sometimes hit the limits of even the Pro Max plan when I’m multitasking.

Which got me to thinking… I have a DGX Spark that is idle while not training or fine-tuning LLM models for a side hustle, which can handle pretty large models (albeit not at thousands of token/s generation like commercial offerings). Maybe it would be fun to see how far I can get with self-hosted OSS solutions so I could experiment with things like Ralph and larger projects with BMAD and not break the bank.

So I experimented a bit this weekend with OSS GenAI code assistants and models using this stack:

While good for small snippets, it definitely couldn’t execute a detailed plan very well.  It was pretty decent in “plan” mode iterating over a phased dev plan for a to-do list backend in FastAPI, but when I flipped it into “build” mode, it was skipping phases, creating duplicate code – and at one time, it deleted a chunk of the work it had completed and had to rewrite it (tbh I think it was trying to back out a phase it had performed out of order).

I also learned that ollama, by default, only allows an 8k context via the API unless you override it, which is necessary for these kinds of applications (GLM 4.7 has a 200k context; at 8k, it couldn’t even handle analyzing a UV starter codebase).  The loaded model + 200k context took ~80GB.

Next I plan to tinker with Allen AI’s SERA model, announced in this post. Maybe even try out the biggest Deepseek, Qwen and Devstral models I can squeeze onto the Spark…