You can try an initial run (two minute set up) to get a feel for the product for free here: app.propolis.tech/#/launch. Or watch our demo video: https://www.tella.tv/video/autonomous-qa-system-walkthrough-...
The Problem
Both Matt and I have been thinking about software quality for the last 10 years. While at Airtable Matt worked on the infrastructure team responsible for deploys and thought a lot about how to catch bugs before users did. Deterministic tests are incredibly effective at ensuring pre-defined behavior continues to function, but it's hard to get meaningful coverage & easy to "stub/mock" so much that it's no longer representative of real usage.
I like to pitch what we're building now as a set of “users” you can treat like a canary group without worrying about impacting real users.
What we do: Propolis runs "swarms" of browser agents that collaborate to come up with user journeys, flag points of friction, and propose e2e tests that can then be run more cheaply on any trigger you'd like. We have customers from public companies to startups running "swarms" regularly to massively increase the breadth of their automated testing + running the produced tests as part of their CI pipeline to ensure that more specific flows stay working without needing to worry about updating playwright/selenium tests.
One thing that really excites me about this approach is how flexible "checks" can be since they're evaluated partially via LLM, for example we've caught bugs related to the quality of non-deterministic output (think a shopping assistant recommending a product that the user then searches for and can’t find).
Pricing and Availability
It's production-ready today at $1000/month unlimited-use + active support for early users willing to give feedback and request features. We're also happy to work with you for capped-use / hobby plans at lower prices if you'd like to use it for smaller or personal projects.
We'd love to hear from the HN community - especially curious if folks have thoughts on what else autonomous agents could validate beyond bugs and functional correctness. Try it out and let us know what you think!
Are your agents good at testing other agents? e.g. I want your agent to ask our agent a few questions and complete a few UI interactions with the results.
How do you handle testing onboarding flows? e.g. I want your agent to create a new account in our app (https://www.definite.app/) and go thru the onboarding flow (e.g. add Stripe and Hubspot as integrations).
I'd say this is one of our strong suits I think, specifically the UIs tend to be easy to navigate for browser agents, and the LLM as a judge offers pretty good feedback on chat quality and it can inform later actions. (I'd be remiss not to mention though that a good LLM eval framework like Braintrust is probably the best first line though)
> How do you handle testing onboarding flows?
We can step through most onboarding flows if you start from logged out state & give the context it'll need (i.e. a stripe test card, etc.) That said though, setting up integrations that require multi-page hops is still a pain point in our system and leaves a lot to be desired.
Would love to talk more about your specific case and see if we can help! founders@propolis.tech
We are also building a Web QA agent at https://kodefreeze.com. We are focused on the small and medium sized companies and are offering free usage during our trial period!
    Error loading video
    Please try refreshing the page
Maybe you need more QA?
When I open my browser console, I see this:
    Capturing error: Error: WHEP request failed: 500 - {"message":"\"message\" is required!","error":"Server Error"}The thing that really got me was catching bugs in non-deterministic output. We've been struggling with this on LLM features where traditional assertions just don't work. Having agents actually judge quality instead of looking for exact matches is such an obvious solution in hindsight.
Quick question though - how do you handle auth flows with MFA or OAuth redirects?
Human can find and report broken UI easily by using common sense.
Even though it is simple for human. Computer has no common sense and I am a machine learning expert. I tried and mostly failed to build a broken UI detector in my previous company. They had automated plugin upgradable process. That periodically broke UI.
I tried to detect it my taking long screenshot, and you could select a image as working version, then later finding diff between 2 images. I kind of worked but not satisfactory.
To elaborate a little bit on the "canary" comment --
For a while at Airtable I was on the infra team that managed the deploy (basically click run and then sit and triage issues for a day), One of my first contributions on the team was adding a new canary analysis framework that made it easier to catch and rollback bugs automatically. Two things always bothered me about the standard canary release process:
1) It necessarily treats some users as lower value, and thus more acceptable to risk exposing bugs to (this makes sense for things like free-tier, etc. but the more you segment out, the less representative and thus less effective your canary is). When every customer interaction matters (as is the case for so many types of businesses) this approach is harder to justify
2) Low frequency / high impact bugs are really difficult to catch in canary analysis. While it’s easy to write metrics that catch glaring drops/spikes in metrics, more subtle high impact regressions are much harder and often require user reports (which we did not factor in as part of our canary). Example: how do you write a canary metric that auto rolls back when an enterprise account owner (small % of overall users) logs in and a broken modal prevents them from interacting with your website.
I view what we’re building at Propolis as an answer to both of these things. I envision a deploy process (very soon) that lets us roll out to simulated traffic and canary on THAT before you actually hit real users (and then do a traditional staged release, etc.)
Canaries are lightweight and shallow once they exist. Building a canary from the ground up is still beyond us, but if you don’t want to kill an actual bird that is pretty much the only way to go.
The pricing sounds quite enterprisey, the risk there is that people will tend towards building their own.
testing for abuse stuff ive always found quite difficult, since to work well, you need to both create some real resources so you can delete/clean them up, and also you need to create a new test identity, since your abuse detection system should be deny listing found bad actors. the difficulty is that those sessions probably want to be open for like a week, so they can process both payments and refunds.
can the agents check their email? other notification methods?
> can the agents check their email? other notification methods?
Yes to email (for paying customers agents spin up with unique addresses), no to other notifications, but as soon as a paying customer has a use case for SMS, etc. we'll build it.
I’m curious if you’d also move into API testing too using the same discovery/attempt approach.
They're also smart enough to not be frazzled by things having changed, they still have their objectives and will work to understand whether the functionality is there or not. Beauty of non-determinism!
let's chat - founders@propolis.tech
Once again, great product.
In the off chance it misses specific tests - we have tools to let you build them directly with ai support, either by giving them objectives or dropping in a video of the actions you're taking!