I’m excited to share what we’ve been working on at nCompass Technologies: an AI inference* platform that gives you a scalable and reliable API to access any open-source AI model — with no rate limits. We don't have rate limits as optimizations we made to our AI model serving software enable us to support a high number of concurrent requests without degrading quality of service for you as a user.
If you’re thinking, well aren’t there a bunch of these already? So were we when we started nCompass. When using other APIs, we found that they weren’t reliable enough to be able to use open source models in production environments. To resolve this, we're building an AI inference engine that enable you, as an end user, to reliably use open source models in production.
Underlying this API, we’re building optimizations at the hosting, scheduling and kernel levels with the single goal of minimizing the number of GPUs required to maximize the number of concurrent requests you can serve, without degrading quality of service.
We’re still building a lot of our optimizations, but we’ve released what we have so far via our API. Compared to vLLM, we currently keep time-to-first-token (TTFT) 2-4x lower than vLLM at the equivalent concurrent request rate. You can check out a demo of our API here:
https://www.loom.com/share/c92f825ac0af4ab18296a16546a75be3
As a result of the optimizations we’ve rolled out so far, we’re releasing a few unique features on our API:
1. Rate-Limits: we don’t have any
Most other API’s out there have strict rate limits and can be rather unreliable. We don’t want API’s for open source models to remain as a solution for prototypes only. We want people to use these APIs like they do OpenAI’s or Anthropic’s and actually make production grade products on top of open source models.
2. Underserved models: we have them
There are a ton of models out there, but not all of them are readily available for people to use if they don’t have access to GPUs. We envision our API becoming a system where anyone can launch any custom model of their choice with minimal cold starts and run the model as a simple API call. Our cold starts for any 8B or 70B model are only 40s and we’ll keep improving this.
Towards this goal, we already have models like `ai4bharat/hercule-hi` hosted on our API to support non-english language use cases and models like `Qwen/QwQ-32B-Preview` to support reasoning based use cases. You can find the other models that we host here: https://console.ncompass.tech/public-models for public ones, and https://console.ncompass.tech/models for private ones that work once you've created an account.
We’d love for you to try out our API by following the steps here: https://www.ncompass.tech/docs/llm_inference/quickstart. We provide $100 of free credit on sign up to run models, and like we said, go crazy with your requests, we’d love to see if you can break our system :)
We’re still actively building out features and optimizations and your input can help shape the future of nCompass. If you have thoughts on our platform or want us to host a specific model, let us know at hello@ncompass.tech.
Happy Hacking!
* it's called inference because the process of taking a query, running it through the model and providing a result is referred to as "inference" in the AI / machine learning world. It's as opposed to "training" or "finetuning" which are processes used to actually develop the AI models that you then run "inference" on.
This would:
- let you boast about your cool proprietary optimizations
- naturally get better over time just from applying public algorithmic improvements
- show up hosts that refuse to do the same
- give you a good incentive to keep on top of your own efficiency and competitiveness over time
- be a good response to users who vaguely know that AI takes "a lot" of energy -- it's actually gotten a lot better, but how much better?
Happy to chat if it would help to have a neutral academic voice involved.
We're currently working on providing a more extensive interface to show users a variety of performance metrics of the models they're running. Having efficiency metrics would be a great addition.
I think additionally an important facet of these tests would be providing clarity on the details of the tests to make them reproducible. I find that sometimes reported stats don't quite translate to real-world experiences. It can feel like results are presented using the workloads that look best on a system, so a standardized/reproducible approach would be best.
We're always keen to chat to as many users/experts/academics/enthusiasts as possible. Please feel free to reach me at diederik.vink@ncompass.tech and we can set up a time to meet!
So this means that you end up either having many decodes wait for prefills to complete or you end up scheduling decodes with prefills. Both scenarios result in slower decodes which is why we're seeing an increase in the ITL. This is the main tradeoff we've made.
But across all users on our system, the throughput is better because doing more prefills or a large number of grouped decodes has better utilization of the GPU.
The idea is that this works for someone who wants to build a product that is consistent across users in terms of initial response but can trade-off some E2E latency. It ensures that no one is waiting for a long time before getting the first response.
When looking at a variety of workloads, we realized that prioritizing finishing a query (priotizing decodes) lead to underutilization of the GPU. We noticed there tended to not be enough requests that are concurrently running (because prefill wasn't prioritized) to meaningfully utilize the memory bandwidth with available decodes. This lead to a system that was unfortunately neither compute nor memory bound.
By running mixed batches that prioritize prefills we still compute some decode tokens in our spare capacity, but ensure compute is as saturated as possible. This additionally leads to a buildup of decodes, so that when we are primarily computing decode we're pushing our memory bandwidth as much as we can.
Of course there is still plenty of improvements that can be made on this front. Finding a dynamic balance between prefill and decode that allows us to have both the memory bandwidth and compute being pushed to their limits is the goal from a scheduling perspective. There are a whole host of factors such as the model architecture, input-token:output-token ratio, underlying hardware, KV-cache allocation (and many more) that all play into the pressure placed on memory and compute, so there's definitely still exploration to be done!
2. I don't see the 50% cheaper option. According to your pricing page, 16B+ models will cost $0.90, which is the same price for Together.ai and fireworks.ai
If a fully self-serve system is something you would like to see, we would love to hear more!
2. Could you please elaborate on the 50% cheaper option? If you're referring to the line on our website, that is due to our efficiency at scale. This efficiency benefit allows us to provide the models at the price that we do without implementing rate limits to manage our costs. Additionally, this 50% more efficient GPU utilization also benefits anyone looking to use our infrastructure for on-prem solutions.
Ok so how does #2 help me do this?
All of this crap was happily replaced with JavaScript frameworks in later years. Yes, back in the early-2000s, your browser might literally download executable code just to render a custom button.
Also regarding the no rate limits, we agree this is a real challenge and it's part of why we're interested in building this as well. I think the clever GPU utilization tricks are exactly what we're building out and also looking forward to see what the various issues we're going to run into at such scale.
Compared to the replicate/modal solutions our big focus is to ensure you don't experience rate limits. We want to ensure you get a good quality of service no matter what.
When it comes to requesting and running specific models, we won't ask you to pay extra just because there's lower demand for that specific model (which it sounds like other providers are doing). We manage scaling up and down instances for the models on your behalf to make sure you get good performance at a fair price point, so you don't have to worry about making the costs work.
We've put a wrong hyperlink on the website, but we've fixed that now, thanks for letting us know.
Regarding us being able to reliably host models versus setting up a website largely comes down to our technical background. All of us are hardware engineers so our front-end capabilities are not our strong suit :). But our experience as hardware engineers makes us confident in hosting the models themselves. Both 8B and 70B models, if cached, do actually load in exactly 40s, but please feel free to try out the system and see for yourself!
Are you planning to support any image or video generation models, or focusing on text for now?
Although we're currently only supporting text models, we do definitely have image and video generation models in our roadmap as these are very compute intensive models meaning they would benefit greatly from optimizations. We'd love to hear more about any specific models you're hoping to run! Please feel free to message us with further details (diederik.vink@ncompass.tech).
If I had to hazard a guess, it would be that their system architecture (# of chips and chip architecture itself) might not be designed for a high concurrency situation.