Google's Tunix hackathon produced reproducible recipes for adding chain-of-thought reasoning to small models (1B–2B params) using SFT + preference optimization + GRPO, all runnable in 9 hours on constrained hardware.
Summary
Developers can now train reasoning capabilities into small models without massive compute budgets. This shifts reasoning training from black-box frontier models to DIY post-training workflows with published techniques, enabling domain-specific reasoning (medical, legal, chemistry, robotics) on accessible infrastructure.
Why it matters
Developers can now train reasoning capabilities into small models without massive compute budgets. This shifts reasoning training from black-box frontier models to DIY post-training workflows with published techniques, enabling domain-specific reasoning (medical, legal, chemistry, robotics) on accessible infrastructure.
Implementation verdict
Replaces waiting for frontier model reasoning with self-service post-training on Gemma 2B/3 1B. Requires Tunix library (open-source), Kaggle TPU access (free), curated reasoning datasets (~33k–70k samples), and custom reward functions (LLM-as-judge or TF-IDF). Winner techniques are battle-tested; ready to try now with published code and Colab tutorials.
Sources
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.