Agents fail all the time, especially when you try to use them for something actually useful. Current solution approaches suck: prompting has intrinsic limits and supervised fine-tuning requires big explicit datasets that are hard to collect.
Two months ago, the DeepSeek R1 paper outlined a way to post-train LLMs with (almost) pure reinforcement learning. We took up their research and built a fine-tuning platform around that.
You let us intercept your agent's data flow, and we deliver you a fine-tuned open-source model, that is trained on the agent's specific task. Instead of providing big datasets of explicit fine-tuning samples, you provide a reward function, judging the model's outputs.
Here are examples of what this can be used for:
Coding Agent: We fine-tuned a coding agent that was constantly making syntax errors and failed to handle semantic edge cases properly. By providing a reward function that evaluated code against the compiler, the agent learned not to produce these errors. The fine-tuned model reduced critical bugs by 40% with just 20 training samples.
MCP Tool Specialization: Imagine you have a custom set of internal tools using the MCP protocol, but your agent keeps selecting the wrong tool or passing incompatible parameters. You could fine-tune with a reward function that scores tool selection and parameter matching.
Browser Agent Navigation: If you're building a browser agent that struggles with complex web UIs or specific sites, you could fine-tune it to better understand UI elements and navigation patterns. With a reward function that scores successful task completion (like "find the best price for this product" or "complete this multi-step form"), you could train an agent that better identifies clickable elements, understands form validation errors, and navigates through complex SPAs without getting stuck.
VLA Robot Control: If you're using vision-language models to control robotic arms or other hardware, you could fine-tune for your specific actuator setup. With a reward function based on high-level task completion, you could train a Vision-Langauge-Action (VLA) model that translates natural language commands like "move the red block behind the blue cylinder" into actuator controls for your specific hardware.
As you see from these examples, the current paradigm is best suited for "verifiable domains”, where it is possible to give an explicit function judging the model’s outputs. However, up next, we will also support an "alignment mode", where you don't have to provide a reward function but provide high-level feedback on past failure runs of your agent. Just tag where things went wrong, and we'll handle the rest. This makes it even easier to improve your agents without needing to write formal reward functions.
Our platform is not itself open source, but it fine-tunes open-source language models. I.e. it is an alternative to the reinforcement fine-tuning API from OpenAI, but with Qwen, LLama, Deepseek, etc., and more customizability on the reward model. We charge users for the training and for their inference/interaction with the model later on ($0 monthly flat fee + training cost + inference cost).
The platform is self-serving and open to use at https://augento.ai/dashboard. We’ll give you $20 in training credits, which should be enough for connecting your agent and delivering some observable improvement on your use case.
We’d love to hear your thoughts and feedback!
If I have an application that uses OpenAI models then this service can act as a proxy between the my application and the actual OpenAI service. It logs all of the requests that get sent to the OpenAI api. At some later time, I can go through and choose a subset of the API calls and mark them (I'm guessing as good or bad) and these get converted into a training set. I then have to create a value function as its own API that I run on my own servers somewhere (like fly.io). Then I start a training run, which I assume will use some open source AI model to regenerate responses to the training set derived from my initial OpenAI api calls. It then takes the generated responses from that open source model, sends them to my value function api which scores them, and then uses that score to apply some RL magic to the base open source model. At the end of this process I have an open source model that has been RL trained based on the captured api calls as well as the scoring from the value function.
I suppose the argument here is, a RL trained open source model will perform your task better than the base OpenAI model. So your target market is, people already using OpenAI api, they have the desire and funds to experiment with RL, they have the capability of defining a value function, they are able to sift through their api calls to identify the ones that aren't performing well and isolate them, and they are willing to swap out their OpenAI model with an open source model that is RL trained if it can be shown it is more accurate.
I would guess this market exists and the need is real. Defining a value function is much easier than building the infrastructure to RL a variety of open source models. So someone who wants to do this may appreciate paying for someone else who has already set up the infrastructure. And they don't want to host their own model (their already paying for OpenAI model hosting) so maybe they have no problem paying you for inference as well.
Whether or not this succeeds as a business really depends on how effective RL is for the clients you find. There are two paths here, RL is wildly successful and therefore so are you. Or RL fine-tuning is unable to keep up with foundation model advancements and clients will learn it is better to wait it out on the big fellas rather than go through the time-consuming and costly process.
For the folks who are already technical in this vertical, especially ones that leverage a low cardinality architecture (one or two models, small subset of tasks, etc), this type of thing is quite easy to build yourself first as a working prototype and then only slightly more difficult to productionize & automate.
I have some in-house infra that does similar work: monitors inputs and outputs from models, puts them in a UI for a human to score/rank, preps a DPO dataset for training, kicks off training run. The total amount of calendar time I spent from prototype to production was roughly two person weeks. Changing the human intervention mechanism to an automated reward function would be an hour or two worth of work. If I had to make this work for all types of users, tasks, and models — no shot I'd have the time personally to pull that off with any reasonable velocity.
With that said, having a nice UI with great observability into the whole process is a pretty big value-add to get out of the box as well.
(EDIT: for clarity, not affiliated all with the OP project/org)
Providing finetuning as a service works because the friction with finetuning is operational (getting the GPUs, preparing the training...), so the vendor can take care of that and give you an API. The work becomes straightforward and doesn't require much preparation - give us some examples and we'll provide you a model that works well with these and hopefully generalizes.
RL as a service is much trickier in my opinion. The friction is not only operational. Getting RL to work (at least from my probably deprecated 10-year-old knowledge) is much harder because the real friction is in building the right reward function. I've skimmed your docs, and you don't say much about reward functions other than the obvious.
I think to get this to work, you need to improve your docs and examples a lot, and maybe focus on some recurrent use cases (e.g., customer support agent) with clear reward functions. Perhaps provide some building block reward functions and some UI/tools to help create them. Basically, find a way to remove the real friction on how to use RL in my agent - the reward function part.
In any case, congrats again on the launch. We're building an LLMOps platform (see my profile), there might be collaboration/integration potential, write me if you think that's interesting.
Seems like the most powerful agents will make use of some form of RL or advanced learning.
I'm not from an ML/DL background but these ideas are fascinating and I've begun self-teaching myself some RL.
I'm curious as to how long this took to build and any advice for someone wanting to learn more about RL in this context?
Thanks!
I’ll jump in this weekend.
Part of me wishes I did CS instead of learning SWE. There’s so much to uncover in RL and jumping straight in at the top feels like the wrong strategy to learn effectively.
I love the idea, love the platform. I’ll be keeping a close eye on how you guys go.
If you need a Technical Product Manager, let me know! I’m currently an Artificial Intelligence Lead at a hardware-enabled SaaS company but genuinely believe RL and agents will be the next step towards AGI.
I have a few questions. 1. I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like? 2. Any idea on inference costs?
Also just to reiterate what others say but the option of exporting weights would definitely make it more appealing (although it sounds like that's in the roadmap).
> I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like?
Yeah, you could probably call it serverless inference. However, due to the fact that all fine-tuned models are trained on the same base model(s), we have some interesting optimizations we can apply over standard "serverless" model deployment. The biggest is that we can keep the base model loaded in VRAM and only swap the trained weight deltas per request. This gives us sub-second cold-start times for inference in the average case.
> Any idea on inference costs?
Right now, we’re pricing inference at $0.5/M input tokens, $2.5/M output tokens. That’s in a similar price range but a bit lower than gpt-4o/Claude 3.5, which we consider the main models we’re "competing" with. As it’s our goal to democratize access to models/agents in the long run, we hope that we can drop the prices for inference further, which should be enabled by some other optimizations we’re currently planning.
Where do you want to access #ext-customers?
The organization you select is where you’ll find this channel in Slack. Admins will get a chance to review everything before you start collaborating.
Tip: Add this Slack Connect channel to the organization that’s already connected with P2P Industries, or where you have similar channels.
And the other aspect as someone already specified is it seems to only work with single agent workflows.
(EDIT: Would you use DPO? Do you have experience with it or needs?)
GPT4o is quite bad in this, as there are not too many JSONata snippets on the internet. We collected 20 coding problems; the reward function then just assigned a scalar value based on whether the code output of the model was syntactically correct or not (Most interestingly, we found that by optimizing the syntax, it also got better at getting the semantics correct)
I think the discrepancy between our result with direct RL and your experience with RLHF comes from the fact that RLHF is built around non-verifiable/subjective domains, where intrinsically, the reward signal obtained by the HF-proxy is weak(er), i.e. for the same training scenario/prompt you need more samples to get to the same gradient.
I think the demo could be more exciting, the voice of the person talking sounds like he's bored haha
"What works well for HN is raw and direct, with zero production values. Skip any introductions and jump straight into showing your product doing what it does best. Voiceover is good, but no marketing slickness—no fancy logos or background music!"
I guess there's zero production values and zero production values...
IMO, the most promising approach to this is something along the lines of MA-RLHF (https://arxiv.org/abs/2410.02743) but adapted to the real world, i.e., spitting up the reward model to grade individual actions inside the trajectory to reduce the “attention distance” between the reward and the decision.
Noob question - from my understanding, SoTA proprietary models already provide APIs for fine tuning, I'd say it's only a matter of time before they provide RL based APIs, no?
My long-term take is that the agent economy will be around a few labs providing (partially open-source) foundational models where you don’t want to be part of the competition, as this will be the AI equivalent of the high-frequency tradings arms race). And above that will sit an infrastructure layer, specializing these very models to the users domains. OpenAI/Anthropic/… RL finetuning will be a part of that infrastructure layer, but so will open-source-model alternatives like ours.
I was working at a startup doing end to end training for modified BERT architectures and everything from buying a GPU - basically impossible right now, we ended up looking at sourcing franken cards _from_ China.
To the power and heat removal - you need a large factories worth of power in the space of a small flat.
To pre-training something that's not been pre-trained before - say hello to throwing out more than 80% of pretraining runs because of a novel architecture.
Was designed to burn money as fast as possible.
Without hugely deep pockets, with a contract from NVidia, and with a datacenter right next to a nuclear power plant you can't compete at the model level.
https://aws.amazon.com/blogs/machine-learning/customize-deep...
And charge for it?
The blog article you are referring to uses another method to fine-tune models that many other big platforms like Together AI (and even OpenAI themselves) are already supporting: Supervised Fine Tuning (SFT). We are doing Reinforcement Learning using GRPO instead. SFT has the big caveat that it requires good prompt-completion datasets to work, which are rare/hard to curate for many use cases. For GRPO, you (the programmer) don’t even need to know what the correct answer is as long as you can decide if it’s a good answer (P?NP) at its heart, essentially.