An inference performance lab

Frontier open models, served at speed.

Numinous Inference runs the best open weight models on dedicated Blackwell capacity, engineered for very high throughput, steady uptime, and a fair price per token. Frontier models, your own checkpoints, and the performance work to make them fast.

The lineup
GLM 5.2

Frontier coding and long horizon agentic workflows. Project scale reasoning that holds context across a full build.

744B MoE40B active1M contextFP8
Kimi K2.6

Long horizon coding, visual interface generation, and agent swarms that run a task end to end.

~1T MoE32B active256K contextFP8
Throughput
Engineered for the highest tokens per second, with limited quality degradation.

Blackwell silicon, quality-preserving quantization, speculative decoding, and a tuned serving stack. Two lanes per model: a low latency fast lane for the highest tps, and a high concurrency standard lane for throughput at scale.

GLM 5.2
~400tps
Fast lane · low latency
~200tps
Standard · high concurrency
Kimi K2.6
~450tps
Fast lane · low latency
~220tps
Standard · high concurrency
Why Numinous
01

Speed

Fast and standard lanes on Blackwell. Speculative decoding and continuous batching push the highest tokens per second in production.

02

Reliability

Dedicated capacity, graceful failover, and a base that never cold starts. Uptime you can route to.

03

Price

Prefix caching makes repeated context nearly free to serve. The savings land in your bill, not ours.

Beyond the lineup
Custom models

Your weights, our stack

Bring your own fine tunes and private checkpoints, or any open weight architecture. We stand it up on the same high throughput stack.

Model building

Checkpoint to production

We take a model from raw weights to a fast, reliable endpoint: quantization, serving optimization, and throughput tuning on Blackwell.

Evaluation

Measured, not assumed

Rigorous benchmarking and quality parity testing, with latency distributions and methodology you can trust.

8x B200 nodesquality-preserving quants1M contextfast + standard lanesprefix cache on

Ready when you are.

Point your traffic at the best open models, served fast.