The problem we kept running into: every inference provider is either fast-but-expensive (Together, Fireworks — you pay for always-on GPUs) or cheap-but-DIY (Modal, RunPod — you configure vLLM yourself and deal with slow cold starts). Neither felt right for teams that just want to ship.
Suryaa spent years building GPU orchestration infrastructure at TensorDock and production systems at Palantir. I led ML infrastructure and Linux kernel development for Space Force and NASA contracts where the stack had to actually work under pressure. When we started building AI products ourselves, we kept hitting the same wall: GPU infrastructure was either too expensive or too much work.
So we built IonAttention — a C++ inference runtime designed specifically around the GH200's memory architecture. Most inference stacks treat GH200 as a compatibility target (make sure vLLM runs, use CPU memory as overflow). We took a different approach and built around what makes the hardware actually interesting: a 900 GB/s coherent CPU-GPU link, 452GB of LPDDR5X sitting right next to the accelerator, and 72 ARM cores you can actually use.
Three things came out of that that we think are novel: (1) using hardware cache coherence to make CUDA graphs behave as if they have dynamic parameters at zero per-step cost — something that only works on GH200-class hardware; (2) eager KV block writeback driven by immutability rather than memory pressure, which drops eviction stalls from 10ms+ to under 0.25ms; (3) phantom-tile attention scheduling at small batch sizes that cuts attention time by over 60% in the worst-affected regimes. We wrote up the details at cumulus.blog/ionattention.
On multimodal pipelines we get better performance than big players (588 tok/s vs. Together AI's 298 on the same VLM workload). We're honest that p50 latency is currently worse (~1.46s vs. 0.74s) — that's the tradeoff we're actively working on.
Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at https://ionrouter.io.
You can try the playground at https://ionrouter.io/playground right now, no signup required, or drop your API key in and swap the base URL — it's one line. We built this so teams can see the power of our engine and eventually come to us for their finetuned model needs using the same solution.
We're curious what you think, especially if you're running finetuned or custom models — that's the use case we've invested the most in. What's broken, what would make this actually useful for you?
1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it
2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!
Just curious how close we are to a world where I can fine tune for my (low volume calls) domain and then get it hosted. Right now this is not practical anywhere I've seen, at the volumes I would be doing it at (which are really hobby level).
If so, it would be great to provide more models through OpenRouter. This looks interesting but not enough to make me go through the trouble of setting up a separate account, funding it, etc.
for smaller start ups, it's easier to go through one provider (OpenRouter) instead of having the hassle of managing different endpoints and accounts. you might get access to many more users that way.
mid to large companies might want to go directly to the source (you) if they want to really optimize the last mile but even that is debatable for many.
The purpose of IonRouter is to let people publicly see the speed of our engine firsthand. It makes the sales pipeline a lot easier when a prospect can just go try it themselves before committing. Signup is low friction ($10 minimum to load, and we preload $0.10) so you can test right away.
That said, we do plan to offer this as a usage-based service within our own cloud. We own every layer of the stack— inference engine, GPU orchestration, scheduling, routing, billing, all of it. No third-party inference runtime, no off-the-shelf serving framework. So there's no reason for us to go through a middleman.
No plans to be an OpenRouter provider right now.
One thing I don’t get is why would anyone use a direct service that does the same thing as others when there are services such as openrouter where you can use the same model from different providers? I would understand if your landing page mentioned fine-tuning only and custom models, but just listing same open source models, tps and pricing wouldn’t tell me how you’re different from other providers.
I remember using banana.dev a few years ago and it was very clear proposition that time (serverless GPU with fast cold start)
I suppose positioning will take multiple iterations before you land on the right one. Good luck!
I do think we will lean harder into the hosting of fine-tuned models though, this is a good insight.
I'm wondering, how can one host a fine-tuned model on your platform?
I wasn't able to find any information on how to do that
> When you use the Service, we collect and store: > Input prompts and parameters submitted to the API
For how long and what for apart from the transient compliance/safety checks?
What I want from an LLM is smart, super cheap, fast, and private. I wonder if we will ever get there. Like having a cheap Cerebras machine at home with oss 400B models on it.
Man you had me panicking there for a second. Per token?!? Turns out, it’s per million according to their site.
Cool concept. I used to run a Fortune 500’s cloud and GPU instances hot and ready were the biggest ask. We weren’t ready for that, cost wise, so we would only spin them up when absolutely necessary.
Compare to providers like Fireworks and even with the openrouter 5% charge it's not competitive
(Also technically qwen3 8b w/ novita being first place but barely)
A privacy policy that's at least as good as Vertex.ai at Google.
Otherwise it's a non-starter at any price.
Keeping chat content around for 30 days might as well mean "forever." Anyone at the company can steal your customers chats.
My agreements with customers would prevent me from using any service that did that.
Is this a result of renting more expensive gpus?
Also piece of feedback: it kind of sucks to have glm/minimax/kimi on separate api endpoints. I assume it's a game you play to get lower latency on routing for popular models but from a consumer perspective it's not great.