eval-suite lm-studio gpt-oss local-inference benchmark

Run gpt-oss evals locally with LM Studio uv

Execute OpenAI's AIME 2025 eval suite against gpt-oss-20b running locally via LM Studio using uv for dependency management, yielding detailed HTML/JSON results with 45.4% accuracy on 240 prompts.

Summary

Developers can now benchmark reasoning models offline without API calls, capturing full prompt/response traces for debugging. Local eval iteration replaces cloud-dependent testing workflows.

Why it matters

Developers can now benchmark reasoning models offline without API calls, capturing full prompt/response traces for debugging. Local eval iteration replaces cloud-dependent testing workflows.

Implementation verdict

Replaces manual OpenAI API eval runs with self-hosted benchmarking. Requires LM Studio, Python 3.13, uv, and 4+ hour runtime for full 240-prompt suite. Worth trying now if you need local model introspection; increase context length from default 4096 to avoid mid-run failures.

Sources

1.uv run for the benchmark. This means I get all of the dependencies installed automatically without having to worry about setting up a virtual environment myself
2.the eval suite needs an OpenAI-compatible API to talk to. LM Studio runs one on port 1234
3.the above command runs 240 prompts and can take several hours
4.score is the most important number - the eval suite assigns a 1 for each correct answer and a 0 for incorrect answers and then displays the average
5.Reached context length of 4096 tokens with model (arch: gpt-oss) that does not currently support mid-generation context overflow

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs