I’ve been playing with Claude Code for a few months now, and have been very impressed – but sometimes hit the limits of even the Pro Max plan when I’m multitasking.
Which got me to thinking… I have a DGX Spark that is idle while not training or fine-tuning LLM models for a side hustle, which can handle pretty large models (albeit not at thousands of token/s generation like commercial offerings). Maybe it would be fun to see how far I can get with self-hosted OSS solutions so I could experiment with things like Ralph and larger projects with BMAD and not break the bank.
So I experimented a bit this weekend with OSS GenAI code assistants and models using this stack:
While good for small snippets, it definitely couldn’t execute a detailed plan very well. It was pretty decent in “plan” mode iterating over a phased dev plan for a to-do list backend in FastAPI, but when I flipped it into “build” mode, it was skipping phases, creating duplicate code – and at one time, it deleted a chunk of the work it had completed and had to rewrite it (tbh I think it was trying to back out a phase it had performed out of order).
I also learned that ollama, by default, only allows an 8k context via the API unless you override it, which is necessary for these kinds of applications (GLM 4.7 has a 200k context; at 8k, it couldn’t even handle analyzing a UV starter codebase). The loaded model + 200k context took ~80GB.
Next I plan to tinker with Allen AI’s SERA model, announced in this post. Maybe even try out the biggest Deepseek, Qwen and Devstral models I can squeeze onto the Spark…