We've set up a special no-signup version for the HN community at https://hn.webhound.ai - just click "Continue as Guest" to try it without signing up.
Here's a demo: https://youtu.be/fGaRfPdK1Sk
We started building it after getting tired of doing this kind of research manually. Open 50 tabs, copy everything into a spreadsheet, realize it's inconsistent, start over. It felt like something an LLM should be able to handle.
Some examples of how people have used it in the past month:
Competitor analysis: "Create a comparison table of internal tooling platforms (Retool, Appsmith, Superblocks, UI Bakery, BudiBase, etc) with their free plan limits, pricing tiers, onboarding experience, integrations, and how they position themselves on their landing pages." (https://www.webhound.ai/dataset/c67c96a6-9d17-4c91-b9a0-ff69...)
Lead generation: "Find Shopify stores launched recently that sell skincare products. I want the store URLs, founder names, emails, Instagram handles, and product categories." (https://www.webhound.ai/dataset/b63d148a-8895-4aab-ac34-455e...)
Pricing tracking: "Track how the free and paid plans of note-taking apps have changed over the past 6 months using official sites and changelogs. List each app with a timeline of changes and the source for each." (https://www.webhound.ai/dataset/c17e6033-5d00-4e54-baf6-8dea...)
Investor mapping: "Find VCs who led or participated in pre-seed or seed rounds for browser-based devtools startups in the past year. Include the VC name, relevant partners, contact info, and portfolio links for context." (https://www.webhound.ai/dataset/1480c053-d86b-40ce-a620-37fd...)
Research collection: "Get a list of recent arXiv papers on weak supervision in NLP. For each, include the abstract, citation count, publication date, and a GitHub repo if available." (https://www.webhound.ai/dataset/e274ca26-0513-4296-85a5-2b7b...)
Hypothesis testing: "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site and show the most relevant posts with timestamps and engagement metrics." (https://www.webhound.ai/dataset/42b2de49-acbf-4851-bbb7-080b...)
The first version of Webhound was a single agent running on Claude 4 Sonnet. It worked, but sessions routinely cost over $1100 and it would often get lost in infinite loops. We knew that wasn't sustainable, so we started building around smaller models.
That meant adding more structure. We introduced a multi-agent system to keep it reliable and accurate. There's a main agent, a set of search agents that run subtasks in parallel, a critic agent that keeps things on track, and a validator that double-checks extracted data before saving it. We also gave it a notepad for long-term memory, which helps avoid duplicates and keeps track of what it's already seen.
After switching to Gemini 2.5 Flash and layering in the agent system, we were able to cut costs by more than 30x while also improving speed and output quality.
The system runs in two phases. First is planning, where it decides the schema, how to search, what sources to use, and how to know when it's done. Then comes extraction, where it executes the plan and gathers the data.
It uses a text-based browser we built that renders pages as markdown and extracts content directly. We tried full browser use but it was slower and less reliable. Plain text still works better for this kind of task.
We also built scheduled refreshes to keep datasets up to date and an API so you can integrate the data directly into your workflows.
Right now, everything stays in the agent's context during a run. It starts to break down around 1000-5000 rows depending on the number of attributes. We're working on a better architecture for scaling past that.
We'd love feedback, especially from anyone who's tried solving this problem or built similar tools. Happy to answer anything in the thread.
Thanks! Moe
It does say that extraction can take hours, but I was expecting it would be more of an 80/20 kind of thing, with a lot of data found quickly, then a long tail of searching to fill in gaps. Is my expectation wrong?
I worry for two related reasons. One, inefficient gathering of data is going to churn and burn more resources than necessary, both on your systems and on the sites being hit. Secondly, although this free opportunity is an amazing way to show off your tool, I fear the pricing of an actual run is going to be high.
We use Gemini 2.5 Flash which is already pretty cheap, so inference costs are actually not as high as they would seem given the number of steps. Our architecture allows for small models like that to operate well enough, and we think those kinds of models will only get cheaper.
Having said all that, we are working on improving latency and allowing for more parallelization wherever possible and hope to include that in future versions, especially for enrichment. We do think that one of the weaknesses of the product is for mass collection - it's better at finding medium sized datasets from siloed sources and less good at getting large comprehensive datasets, but we're also considering approaches that incorporate more traditional scraping tactics for finding these large datasets.
1. List building – finding targeted job titles with their contact information.
2. List research – finding contact and company details of given people.
3. List verification – manually checking if the data is correct, sometimes even calling the contact person to confirm.
Apollo is a big competitor for their 'B2B leads(*dataset)' business because it is much cheaper. A tool like this could have a huge impact on their business.
Curious: Have you compared it with manual research? How accurate is it?
Interestingly, we're working with B2B clients right now where we use Webhound to curate and then act as the "validation" layer ourselves. The agent lets us offer these datasets way cheaper with live updates, but still with human oversight.
First, you're using Firecrawl as your crawling infrastructure, but Firecrawl explicitly blocks Reddit. Yet one of your examples mentions "Check if user complaints about Figma's performance on large files have increased in the last 3 months. Search forums like Hacker News, Reddit, and Figma's community site..."
How are you accomplishing this? The comment about whether it's legal to crawl Reddit remains unanswered in this thread.
Second, you're accepting credit cards without providing any Terms of Service. This seems like a significant oversight for a YC company.
Third, as another commenter mentioned, GPT-5 can already do this faster and more effectively, and Claude has similar capabilities. I'm struggling to see the value proposition here beyond a thin wrapper around existing LLM capabilities with some agent orchestration. We're beyond assuming prompts are useful IP nowadays, or am I wrong?
Perhaps most concerning is the lack of basic account management features - there's no way to delete an account after creation. I'd say I'd like clarification, but there's no way I couldn't just code this up with Codex to run locally and do it myself (with a local crawler that can actually crawl Reddit even).
Regarding Reddit, we have our own custom handler for Reddit URLs which uses the Reddit API, which we are billed for when we exceed free limits.
For Terms of Service, you're right, that is definitely an oversight on our part. We just published both our Terms of Service and Privacy Policy on the website.
When it comes to comparing with GPT-5 and Claude, we do believe that our prompting, agent orchestration, and other core parts of the product such as parallel search results analysis and parallel agents are improvements on just GPT-5 and Claude, while also allowing it to run at much cheaper costs on significantly smaller models. Our v1 which we built months ago was essentially the same as what GPT-5 thinking with web search currently does, and we've since made the explicit choice to focus on data quality, user controllability, and cost efficiency over latency. So while yes, it might give faster results and work better for smaller datasets, both we and our users have found Webhound to work better for siloed sources and larger datasets.
Regarding account deletion, that is also a fair point. So far we've had people email us when they want their account deleted, but we will add account deletion ASAP.
Criticism like this helps us continue to hold ourselves to a high standard, so thanks for taking the time to write it up.
Great decision to make it without a login so people can test.
Here is what I liked:
- The agent told me exactly what's happening, which sources it is checking, and the schema.
- The agent correctly identified where to look at, and how to obtain the data.
- Managing expectations: Webhound is extracting data Extraction can take multiple hours. We'll send you an email when it's complete.
Minor point:
- There is no pricing on the main domain, just the HN one https://hn.webhound.ai/pricing
Good luck!
We were heavily inspired by tools like Cursor - basically tried to prioritize user control and visibility above everything else.
What we discovered during iteration was that our users are usually domain experts who know exactly what they want. The more we showed them what was happening under the hood and gave them control over the process, the better their results got.
As an aside, we are about to launch something like similar at rtrvr.ai but having AI Web Agents navigate pages, fill forms and retrieve data. We are able to get our costs down to negligible by doing headless, serverless browsers and our own grounds up DOM construction/actuation (so no FireCrawl costs). https://www.youtube.com/watch?v=gIU3K4E8pyw
It's probably the best research agent that uses live search. Are you using Firecrawl, I assume?
We're soon launching a similar tool (CatchALL by NewsCatcher) that does the same thing but on a much larger scale because we already index and pre-process millions of pages daily (news, corporate, government files). We're seeing so much better results compared to parallel.ai for queries like "find all new funding announcements for any kind of public transit in California State, US that took place in the past two weeks"
However, our tool will not perform live searches, so I think we're complementary.
i'd love to chat.
We’re optimising for large enterprises and government customers that we serve, not consumers.
Even the most motivated people, such as OSINT or KYC analysts, can only skim through tens, maybe hundreds of web pages. Our tool goes through 10,000+ pages per minute.
An LLM that has to open each web page to process the context isn’t much better than a human.
A perfect web search experience for LLM would be to get just the answer, aka the valid tokens that can be fully loaded into context with citations.
Many enterprises should leverage AI workflows, not AI agents.
Nice to have // must have. Existing AI implementations are failing because it’s hard to rely on results; therefore, they’re used for nice-to-haves.
Most business departments know precisely what real-world events can impact their operations. Therefore, search is unnecessary; businesses would love to get notifications.
The best search is no search at all. We’re building monitors – a solution that transforms your catchALL query into a real-time updating feed.
I'll give a few examples of how they use the tool.
Example 1 -- real estate PE that invests in multi-family residential buildings. Let's say they operate in Texas and want to get notifications about many different events. For example, they need to know about any new public transport infrastructure that will make specific area more accessible -> prices wil go up.
There are hundreds of valid records each month. However, to derive those records, we usually have to sift through tens of thousands of hyper-local news articles.
Example 2 -- Logistics & Supply Chain at F100 Tracking of all the 3rd party providers, any kind of instability in the main regions, disruptions at air and marine ports, political discussions around the regulation that might affect them, etc. There are like 20-50 events, and all of them are multi-lingual at global scale.
thousands of valid records each week, millions of web pages to derive those from.
I am concerned about your pricing, as "unlimited" anything seems to be fading away from most LLM providers. Also, I don't think it makes sense for B2B clients who have no problem paying per usage. You are going to find customers that want to use this to poll for updates daily, for example.
Are you using proxies for your text-based browser? I am curious how you are circumventing web crawling blocking.
We've been having similar thoughts about pricing and offering unlimited, but since it is feasible for us in the short term due to credits we enjoy offering that option to early users, even if it may be a bit naive.
Having said that, we are currently working on a pilot with a company whom we are offering live updates, and they are paying per usage since they don't want to have to set it up themselves, so we can definitely see the demand there. We also offer an API for companies that want to reliably query the same thing at a preset cadence, which is also usage based.
For crawling we use Firecrawl. They handle most of the blocking issues and proxies.
current experience: https://imgur.com/a/2BB1mAA
I ask for the school board website for every public school district in the bay area (like boarddocs, etc.), and it mostly returned useless links to the page listing the board members.
I asked ChatGPT5-thinking to do the same and it completed the request correctly, and outputted a CSV with a better schema in a couple of minutes.
We're working on better query interpretation, but in the meantime you could try being more specific like "find BoardDocs or meeting document websites for each district" to guide it better. Also, you can usually figure out how it interpreted your request by looking at the entity criteria, those are all the criteria a piece of data needs to meet to make it in the set.
> It uses a text-based browser we built
Can you tell us more about this. How does it work?
A few design decisions we made that turned out pretty interesting:
1. We gave it an analyze results function. When the agent is on a search results page, instead of visiting each page one by one, it can just ask "What are the pricing models?" and get answers from all search results in parallel.
2. Long web pages get broken into chunks with navigation hints so the agent always knows where it is and can jump around without overloading its context ("continue reading", "jump to middle", etc.).
3. For sites that are commonly visited but have messy layouts or spread out information, we built custom tool calls that let the agent request specific info that might be scattered on different pages and consolidates it all into one clean text response.
4. We're adding DOM interaction via text in the next couple of days, so the agent can click buttons, fill forms, enter keys, but everything still comes back as structured text instead of screenshots.
My original interpretation was that you had built a full blown browser, something akin to a Chromium/Firefox fork
On the Data tab it says "no schema defined yet."
The Schema tab doesn’t seem to have a way to create a schema.
Most of the other tabs (except for Sources) looked blank.
I did see the chat on the right and the "51 items" counter at the top, but I couldn’t find any obvious way to view the results in a grid or table.
That's really strange, it sounds like Webhound for some reason deleted the schema after extraction ended, so although your data should still be tied to the session it just isn't being displayed. Definitely not the expected behavior.
Quickly hit your limits but on a complex dataset requiring looking at a lot of unstructured data on a lot of different web page, it seems to do really well!https://hn.webhound.ai/dataset/c6ca527e-1754-4171-9326-11cc8...
Working on better task classification upfront to route simple requests more directly.
This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).
Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?
This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.
Appreciate you bringing this up.
If it isn't doing that in your session, you can usually just step in and tell it to and it will follow your instructions.
> I noticed you mentioned that "MCP stands for model context protocol." My current understanding, based on the initial problem description and the articles I've been reviewing, is that MCP refers to "Managed Care Plan." This is important because the entire schema and extraction plan are built around "Managed Care Plans."
Session ID: fcd1edb8-7b3c-480e-a352-ed6528556a63
I have to ask, how's that going? Genuinely curious to know!
Seems like y'all are doing well with it!
How’s it different from Parallel Web Systems?
I wanted to upgrade!
But your "upgrade to Pro" button on the Account page gets stuck on "Processing..."
Instead of just search query → final result (though you can do that too), you can step in and guide it. Tell it exactly where to look, what sources to check, how to dig deeper, how to use its notepad.
We've found this gets you way better results that actually match what you're looking for, as well as being a more satisfying user experience for people who already know how they would do the job themselves. Plus it lets you tap into niche datasets that wouldn't show up with just generic search queries.
I was actually building a version of this using NonBioS.ai, but this is already pretty well done, so will just use this instead.