When you're asking AI chatbots for answers, they're data-mining you
145 points
by rntn
7 hours ago
| 19 comments
| theregister.com
| HN
roscas
7 hours ago
[-]
Always good to remember people of this.

But not just AI bots or interfaces. Everything is saved and never deleted.

Remember Facebook? "We will never delete anything" that is their business.

So anything that you put on those "services" is gone out of your hands. But we still have an option, is to stop using these ads company and let them die.

Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

Ok, it is not as chatgpt5 but it can help you so much, that you might not even need chatgpt.

reply
Phemist
6 hours ago
[-]
Indeed, and asking facebook to delete the data or to not use it for AI training is just another data point indicating you care about it. Your preferences will eventually be stripped through redesigns, refactors, careless usage or facebooks crooked idea of consent. The data will remain and be used again.
reply
lowwave
5 hours ago
[-]
It is better to NOT delete facebook, but spam your profile with other data and just leave it.
reply
everybodyknows
1 hour ago
[-]
This, BTW, is the only way, last I checked, to at all obfuscate Zillow's listing photos of the inside of a house that you have since bought. No multi-delete.
reply
Phemist
5 hours ago
[-]
Maybe, but that depends on facebook's ability to filter that data.. The filtering should be be easy for my inactive-for-10-years FB account that suddenly uploads a bunch of garbage data. Mixing in genuine data seems antithetical especially considering the garbage may be filtered out.
reply
kibwen
2 hours ago
[-]
Ironically, this is a completely uncontroversial use case where AI excels.
reply
actionfromafar
5 hours ago
[-]
And/or change friends to random spam accounts first, then unfriend your real friends.
reply
Sophira
4 hours ago
[-]
There are also things like Oobabooga's text-generation-webui[0] which can present a similar interface to ChatGPT for local models.

I've had great success in running Qwen3-8B-GGUF[1] on my RTX 2070 SUPER (8GB VRAM) using Oobabooga (everyone just calls it via the author's name, it's much catchier) so this is definitely doable on consumer hardware. Specifically, I run the Q4_K_M model as Oobabooga loads all of its layers into the GPU by default, making it nice and snappy. (Testing has shown that I can actually load up to the Q6_K model before some layers have to be loaded into the CPU, but I have to manually specify that all those layers should be loaded into the GPU, as opposed to leaving it auto-determined.)

It does obviously hallucinate more often than ChatGPT does, so care should be taken. That said, it's really nice to have something local.

There's a subreddit for running text gen models locally that people might be interested in: https://www.reddit.com/r/LocalLlama

[0] https://github.com/oobabooga/text-generation-webui

[1] https://huggingface.co/Qwen/Qwen3-8B-GGUF

reply
dylan604
4 hours ago
[-]
Facebook doesn't just get data from direct input from users though. So if people stop using FB, that's a good first step, that does not stop the firehose of data.
reply
2d520075
3 hours ago
[-]
It would be more apt if this was a "Concerned Citizens of <city-name>" facebook group, not ycombinator's Hackernews.

If you are here and you require this reminder I would like to think that you are very lost.

reply
throwaway29246
4 hours ago
[-]
> Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

A privilege that is limited to the top 1%. It may come as a surprise, but most people don't have 32GB of VRAM [0]. The rest of us with normal people hardware are stuck with AI cloud providers or good old searching, which is a lot harder now that those same AI providers have ruined search results.

[0] There are some lightweight models you can run on normal people hardware, but they are just too unreliable even for casual usage and are likely to waste more of your time than they save.

reply
lm28469
6 hours ago
[-]
That's why you should use multiple accounts and bullshit about 30% of what you post. LLMs are godsent for that, they poison their own well.
reply
SoftTalker
6 hours ago
[-]
I assume that companies like Facebook know pretty well which accounts are really the same person. Even if you are careful about keeping cookies in separate browser profiles, your machine can be fingerprinted, your posting habits and writing style can be fingerprinted, and Facebook/Google have the resources to do it.
reply
mgh2
5 hours ago
[-]
The risk are the externalities to actual users who don't know the difference and get affected by your 30% bs
reply
BolexNOLA
5 hours ago
[-]
I recently set up LM Studio and have run open AI's 20b model locally using an AMD 9070 + 9800x3d. I honestly assumed it would be way more work than it was to set it up. It has limitations, but given it took me all of 5min and I can easily attach docs for it to reference as it all runs locally...it's fantastic. I've got a Claude model I've been messing with too.
reply
notpushkin
6 hours ago
[-]
> Always good to remember people of this.

You mean “remind”?

reply
glitchc
5 hours ago
[-]
Everyone knows this. Every layperson I talk to is aware that these companies are siphoning their information. When free email was introduced over two decades ago, the behaviour was the same. Everyone knew Microsoft and Google could read your emails. Then, like now, people think it's worth it. It is too useful a tool to have and the price is palatable.

What people don't want to do is sign up for yet another subscription. There's immense subscription fatigue among the general population, especially in tough economic times such as now.

reply
rafark
4 hours ago
[-]
Agreed. Not only do I think it’s worth it, i actually like that I can contribute. I’m getting so much good value for free I think it’s fair. It’s a win-win situation. The AIs get better and I get better answers.
reply
random3
3 hours ago
[-]
This is a funny take. I love your optimism, but it's so extremely naive, it should have a name.
reply
rafark
1 hour ago
[-]
It’s not naive. The value these ai chatbots provide to me is extremely high.

I’ve been writing code for many years but one of the areas I wanted to improve was debugging, I’ve always printed variables but last month I decided to start using a debugger instead of logging to the console. For the past weeks I’ve only been using breakpoints and the resume program function because the step-into, over, out functions have always been confusing to me. An hour ago I sent Gemini images of my debugger and explained my problem and it actually told me what to do and it actually explained to me what the step-* functions did and it told me what to do step by step (I sent it a new screenshot after each step and told it to explain to me what was going on).

I now have a much better understanding of how debuggers work thanks to Gemini.

I’m fine with google getting my data, the value I just got was immense.

reply
smjburton
6 hours ago
[-]
> The more data you give any of the AI services, the more that information can potentially be used against you.

It may seem obvious, but Sam Altman also recently emphasized that the information you share with ChatGPT is not confidential, and could potentially be used against you in court.

[1] https://www.pcmag.com/news/altman-your-chatgpt-conversations...

[2] https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no...

reply
djeastm
3 hours ago
[-]
Hasn't that always been the case? Phone companies providing records of calls and text messages, etc? Anything stored on someone else's servers is going to be something they have a duty to provide to police/courts, assuming they fall under that jurisdiction.
reply
Jalad
5 hours ago
[-]
This is always true though. Any data that a cloud company has against you can be subpoenad

It would be weird for him not to be transparent about that

reply
ceroxylon
5 hours ago
[-]
What about the people who did not opt to share or index their chats, and the companies that claim to not train on user chats?

https://privacy.anthropic.com/en/articles/10023555-how-do-yo...

> We do not actively set out to collect personal data to train our models

The 'snarky tech guy' tone of the article is a bit like nails on a chalkboard.

reply
hazKu4
5 hours ago
[-]
(At least to me) that language doesn’t feel particularly reassuring… especially given the duplicitous nature of data collection - i.e. “we don’t sell your data” translates to “we create a sophisticated advertising profile about you, and monetize that”
reply
boesboes
5 hours ago
[-]
That line is about data they find on internet. soooo completely not relevant
reply
Kim_Bruning
6 hours ago
[-]
Earlier discussion on the "ChatGPT chats in google" angle:

https://news.ycombinator.com/item?id=44778764

Interesting how much traction

     "[x] Make this chat discoverable (allows it to be shown in web searches)" 
gets in news articles.

People don't seem to have the same intuition for the web that they used to!

reply
falcor84
6 hours ago
[-]
> So, kids, let's not be asking any AI chatbot whether you should divorce your husband, how to cheat on your taxes, or if you should try to get your boss fired. That information will be kept, it may be revealed in a security breach, and, if so, it will come back to bite you in the buns.

Just as a PSA - there's nothing unique to AIs here - whenever you ask a question of anyone, in any way, they then have the memory of you having asked it. A lot of sitcoms and comedic plays have the plot premise build upon such questions that a person voiced then eventually reaching (either accurately or inaccurately) the person they were hiding the question from.

And as someone who's into spy stories, I know that a big part of tradecraft is of formulating your questions in a way that divulges the least about your actual intentions and current information.

If anything, LLM-driven AIs are the first technology that in principle allow you to ask a complex question that would be immediately forgotten. The thing is that you need to be running the AI yourself; if you ask an AI controlled by another entity, then you're trusting that entity with your question, regardless of whether there's an AI on the way.

reply
frakt0x90
6 hours ago
[-]
Books are also technology that allow you to answer complex questions without recording the question.
reply
Jalad
5 hours ago
[-]
Not necessarily though, it depends on where you got the book from (Amazon, the library?), and what your question is
reply
shadowgovt
3 hours ago
[-]
In general, libraries actually do go out of their way to minimize the ways circulation history can be used against card-holders.

This isn't airtight, but it'a a point of principle for most libraries and librarians and they've gone to the mat over this. https://www.newtactics.org/tactics/protecting-right-privacy-...

reply
Theodores
3 hours ago
[-]
This was a surprisingly big thing back in the early 2000s with The War Against Terror. I think that it was mostly for reasons of 'chilling effect', but the media made everyone aware that the Department of Homeland Security were paying attention to what books people took out of the library.

What was curious about this was that, at the time, there were few dangerous books in libraries. Catcher in the Rye and 1984 was about it. You wouldn't find a large print copy of Che Guevara's Guerrilla Warfare, for instance.

I disagree about how libraries minimise the risk of anyone knowing who is reading what. On the web where so much is tracked by low intelligence marketing people, there is more data than anything that anyone can deal with. In effect, nobody is able to follow you that easily, only machines, with data that humans can't make sense of.

Meanwhile, libraries have had really good IT systems for decades, with everything tracked in a meaningful form with easy lookups. These systems are state owned, therefore it is no problem for a three letter agency to get the information they want from a library.

reply
shadowgovt
1 hour ago
[-]
Libraries don't tend to have consolidated, centralized IT. As a result, TLAs have to actually make subpoenas to the databanks maintained by individual, regional library groups, and The ALA offers guidelines on how to respond to those (https://www.ala.org/advocacy/privacy/lawenforcement/guidelin...).

This, of course, doesn't mean your information is irretrievable by TLAs. But the premise of "tap every library to bypass the legal protections against data harvesting" is much trickier when applied to libraries than when applied to, say, Google. They also aren't meaningfully "state-owned" any more than the local Elk's Club is state-owned; the vast majority of libraries are, at most, a county organ, and it is the particular and peculiar style of governance in the United States that when the Feds come knocking on a county's door, they can also tell them to come back with a warrant. That's if the library is actually government-affiliated at all; many are in fact private organizations that were created by wealthy donors at some point in the past (New York Public Library and the Carnegie Library System are two such examples).

Many libraries also purposefully discard circulation data so as to minimize the surface area of what can be subpoena'd. New York Public Library for example, as a matter of policy, purges the circulation data tied to a person's account soon after each loaned item is returned (https://www.nypl.org/help/about-nypl/legal-notices/privacy-p...).

reply
y0eswddl
5 hours ago
[-]
The questions and info you ask friends doesn't end up in a massive data profile on you stored in somebody's cloud to be used for future manipulation/marketing/profiling...
reply
3-cheese-sundae
3 hours ago
[-]
They do, if they're asked over one of the many popular non-secure chat platforms.

I feel like most people don't wait until their friends are in the room to ask them questions or exchange info.

reply
makeworld
3 hours ago
[-]
Notably, Anthropic does not do this with Claude.

https://docs.anthropic.com/en/docs/claude-code/data-usage

reply
avmich
3 hours ago
[-]
I have an issue with "stupidity" suggestion. Clicking "Agree" without full analysis is tried and true Internet tradition, it's so sad somebody assumes it's serious and attempts to use it. We should have legal protections against wringing quasi-agreements from customers and then using them against.
reply
jdthedisciple
26 minutes ago
[-]
From what I know, only people who DELIBERATELY SHARED their chats and IGNORED THE WARNING that it makes them public had their chats appear in search engine results.

Which makes this article quite misleading.

reply
nachox999
4 hours ago
[-]
We need a tool that create random fake data for the data-mining web apps
reply
Qem
2 hours ago
[-]
I never interacted with the AI Meta bundled to whatsapp fearing this.
reply
tietjens
1 hour ago
[-]
I’m pretty certain just using WhatsApp is enough.
reply
nottorp
5 hours ago
[-]
> "How to Use a Microwave Without Summoning Satan,"

Oh, nice idea. We should all ask that.

reply
mystraline
5 hours ago
[-]
Wait, you can summon Satan with a microwave?!

Lemee ask ShatGPT how to do that!

reply
unethical_ban
5 hours ago
[-]
Duck.ai claims to anonymize AI chats and says its conversations are not used for training. It is my go to for casual usage.

Otherwise, I use local for complex for potentially controversial questions.

reply
thisisit
4 hours ago
[-]
If you ask a layperson the answer is - "Yes, and?". If its free, very few people care. Sure you can run a local instance and yes, it might be as simple as downloading Ollama but not many will do it or even have a powerful enough computer to run it.

Worst yet you might individually make a choice to do that but others might not care. They might share email/chats with you to a chatbot to parse it or "make it think like them" and then the chatbot has info about you. So, as much as I understand this sentiment this seems like a losing battle.

reply
dialup_sounds
4 hours ago
[-]
Why should they care?
reply
shadowgovt
3 hours ago
[-]
This is also true of search engines, social media, and various other interactive systems. Google's initial search-algorithm breakthrough was the realization that they had a massive source of data for search result correctness in the form of the behavior of users querying their site.

In general, it's wise to assume that all web interactions are a two-way street between the user and the service provider.

reply
akomtu
1 hour ago
[-]
Unlike previous technologies, chatbots know what users think at the most intimate level. Chatbots know, but currently cannot make sense of this knowledge. The near term goal, I believe, is to build simple, but accurate models of the users psyche to serve them ads better. Instead of crude labels like "user 456 loves cars", corpos will have a compact psyche model of that user that will predict his reactions with 95% accuracy. This model will know that user better than he knows himself. And for a brief moment in history, while AI is good enough to predict us, but not replace us, the adtech corpos will make bank.
reply
andrepd
6 hours ago
[-]
What can you do online these days without being data mined? Browsing gemini?
reply
em3rgent0rdr
6 hours ago
[-]
Download stuff in bulk (for instance the entire wikipedia torrent) and then peruse it on you own computer.
reply
Squeeeez
5 hours ago
[-]
If you are not using an OS which has something like windows recall enabled, or that weird stardict with online lookup with automatic lookup on select which came up recently.

I wonder how far back this has been going on. Did ICQ, IRC server hosters, BBSes do similar things?

reply
reactordev
5 hours ago
[-]
No, back then storage was a premium so everything aside from config, accounts, and billing was ephemeral. It really wasn’t until Cloud came along that storage made it so you could keep everything. About the time of the social media boom.

It wasn’t until around 2014 that I stopped building routes that did:

    DELETE FROM <table> WHERE id = ? ON DELETE CASCADE;
reply
timeon
3 hours ago
[-]
> windows recall enabled

Just curious what other OS has something similar? MacOS maybe?

reply
y0eswddl
5 hours ago
[-]
reply
boesboes
5 hours ago
[-]
What a terrible, utter bullshit article. Full of half truths and fear mongering. smh.
reply
AlexandrB
4 hours ago
[-]
> fear mongering

The last 10 years of tech "innovation" is basically what the article describes but happening to other tech products[1]. So, why is this fear mongering? It's basically inevitable unless:

a. There's legislation. But I would bet on legislation for the opposite - storing chats forever - instead.

b. AI moves to on-device where users have control of their own data. Also unlikely considering how much tech loves web technologies and recurring revenue streams.

[1] https://www.cam.ac.uk/research/news/menstrual-tracking-app-d...

reply
actionfromafar
4 hours ago
[-]
All hail centralized cloud services?
reply
panny
6 hours ago
[-]
I would expect this, but it doesn't seem to be the case.

If I ask for search.brave.com to give me a list of gini coefficients for the top ten countries by GDP, it can't do it. However, if I tell it the data is available on the CIA world factbook, it can then spit that info out promptly. However, if I close the context and ask again, it hasn't learned this information and once again is unable to provide the list.

It didn't datamine me. It had no better idea where to find this information the second time I asked. This is the experience others have stated with other AIs as well. It does not seem special to brave.

reply
Etheryte
6 hours ago
[-]
Data mining doesn't mean the model is instantly updated, that would be prohibitively expensive at scale. It's way easier to batch your data together with a bunch of other data and use it later on. That doesn't even mean it will know where to find the information eventually since models are not one to one with their inputs, because again, size and cost.
reply
panny
6 hours ago
[-]
>Data mining doesn't mean the model is instantly updated

I'm not expecting instant. Even next week it won't be there. It's like how AI never learned to count how many times the letter r appears in strawberry. Like sure, now if you ask brave, it will tell you three, but that is only because that question went viral. It didn't "learn" anything, it was just hard coded for that particular answer. Ask it how many times the letter l appears in smallville and it will get it wrong again.

reply
simgt
6 hours ago
[-]
I didn't think for a second you could be right, so I tried with Claude. L in smallville was correct, then it suggests it'd have gotten l in parallel wrong by answering 3 instead of 2 (buts gets it right in a new chat). Then it suggests it'd get n in millennium wrong by giving the right answer, and gets it wrong in a new chat. https://claude.ai/share/93b46c3b-23a7-40ad-8a2b-ec2ed6c34a19

Thanks, that was enlightening.

reply
t0md4n
6 hours ago
[-]
It wouldn’t be instant, next week or even next month. Pre-training doesn’t happen that frequently and varies between each model provider. As for the strawberry test, this is a tokenization issue that is fundamental to LLM’s, however, most models can now solve this type of question using thinking/code/tools to count the letters.

https://imgur.com/a/NqIJEx6

reply
Etheryte
5 hours ago
[-]
Both OpenAI and Claude average roughly one flagship release a year, and these are some of the best funded companies in the space. The bigger your model, the more expensive it is to train, so you want to do it as rarely as reasonably possible. Every other company will either work with smaller models and/or train even more rarely, aside from fine-tunes and customizations they put on top.
reply
ordersofmag
6 hours ago
[-]
LLM aren't retrained and released on a weekly time-scale. The data mining may only be reflected in the training of the next generation of the model.
reply
qwertytyyuu
6 hours ago
[-]
every week is still way to expensive to do at scale, at best they'll update training data with each model iteration.
reply
add-sub-mul-div
6 hours ago
[-]
Brave isn't data mining you for your benefit, they're doing it for their benefit.
reply
panny
4 minutes ago
[-]
Likewise, I'm not teaching their AI where to find GINI coefficients for their benefit, but for mine. I'd like for their AI to learn something, if only to make my experience better. But there's no learning happening.
reply
hluska
3 hours ago
[-]
You’re expecting models to constantly retrain themselves based on riddles. That’s not very reasonable nor is it even economically feasible right now. At massive scale, I question whether it’s even technically feasible.
reply