FilterHN

When you're asking AI chatbots for answers, they're data-mining you

145 points

by rntn

7 hours ago

| past

| 19 comments

| theregister.com

| HN

▲

roscas

7 hours ago

[-]

Always good to remember people of this.

But not just AI bots or interfaces. Everything is saved and never deleted.

Remember Facebook? "We will never delete anything" that is their business.

So anything that you put on those "services" is gone out of your hands. But we still have an option, is to stop using these ads company and let them die.

Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

Ok, it is not as chatgpt5 but it can help you so much, that you might not even need chatgpt.

▲

Phemist

6 hours ago

[-]

Indeed, and asking facebook to delete the data or to not use it for AI training is just another data point indicating you care about it. Your preferences will eventually be stripped through redesigns, refactors, careless usage or facebooks crooked idea of consent. The data will remain and be used again.

▲

lowwave

5 hours ago

[-]

It is better to NOT delete facebook, but spam your profile with other data and just leave it.

▲

everybodyknows

1 hour ago

[-]

This, BTW, is the only way, last I checked, to at all obfuscate Zillow's listing photos of the inside of a house that you have since bought. No multi-delete.

▲

Phemist

5 hours ago

[-]

Maybe, but that depends on facebook's ability to filter that data.. The filtering should be be easy for my inactive-for-10-years FB account that suddenly uploads a bunch of garbage data. Mixing in genuine data seems antithetical especially considering the garbage may be filtered out.

▲

kibwen

2 hours ago

[-]

Ironically, this is a completely uncontroversial use case where AI excels.

▲

actionfromafar

5 hours ago

[-]

And/or change friends to random spam accounts first, then unfriend your real friends.

▲

Sophira

4 hours ago

[-]

There are also things like Oobabooga's text-generation-webui[0] which can present a similar interface to ChatGPT for local models.

I've had great success in running Qwen3-8B-GGUF[1] on my RTX 2070 SUPER (8GB VRAM) using Oobabooga (everyone just calls it via the author's name, it's much catchier) so this is definitely doable on consumer hardware. Specifically, I run the Q4_K_M model as Oobabooga loads all of its layers into the GPU by default, making it nice and snappy. (Testing has shown that I can actually load up to the Q6_K model before some layers have to be loaded into the CPU, but I have to manually specify that all those layers should be loaded into the GPU, as opposed to leaving it auto-determined.)

It does obviously hallucinate more often than ChatGPT does, so care should be taken. That said, it's really nice to have something local.

There's a subreddit for running text gen models locally that people might be interested in: https://www.reddit.com/r/LocalLlama

[0] https://github.com/oobabooga/text-generation-webui

[1] https://huggingface.co/Qwen/Qwen3-8B-GGUF

▲

dylan604

4 hours ago

[-]

Facebook doesn't just get data from direct input from users though. So if people stop using FB, that's a good first step, that does not stop the firehose of data.

▲

2d520075

3 hours ago

[-]

It would be more apt if this was a "Concerned Citizens of <city-name>" facebook group, not ycombinator's Hackernews.

If you are here and you require this reminder I would like to think that you are very lost.

▲

throwaway29246

4 hours ago

[-]

> Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

A privilege that is limited to the top 1%. It may come as a surprise, but most people don't have 32GB of VRAM [0]. The rest of us with normal people hardware are stuck with AI cloud providers or good old searching, which is a lot harder now that those same AI providers have ruined search results.

[0] There are some lightweight models you can run on normal people hardware, but they are just too unreliable even for casual usage and are likely to waste more of your time than they save.

▲

lm28469

6 hours ago

[-]

That's why you should use multiple accounts and bullshit about 30% of what you post. LLMs are godsent for that, they poison their own well.

▲

SoftTalker

6 hours ago

[-]

I assume that companies like Facebook know pretty well which accounts are really the same person. Even if you are careful about keeping cookies in separate browser profiles, your machine can be fingerprinted, your posting habits and writing style can be fingerprinted, and Facebook/Google have the resources to do it.

▲

mgh2

5 hours ago

[-]

The risk are the externalities to actual users who don't know the difference and get affected by your 30% bs

▲

BolexNOLA

5 hours ago

[-]

I recently set up LM Studio and have run open AI's 20b model locally using an AMD 9070 + 9800x3d. I honestly assumed it would be way more work than it was to set it up. It has limitations, but given it took me all of 5min and I can easily attach docs for it to reference as it all runs locally...it's fantastic. I've got a Claude model I've been messing with too.

▲

notpushkin

6 hours ago

[-]

> Always good to remember people of this.

You mean “remind”?

▲

glitchc

5 hours ago

[-]

Everyone knows this. Every layperson I talk to is aware that these companies are siphoning their information. When free email was introduced over two decades ago, the behaviour was the same. Everyone knew Microsoft and Google could read your emails. Then, like now, people think it's worth it. It is too useful a tool to have and the price is palatable.

What people don't want to do is sign up for yet another subscription. There's immense subscription fatigue among the general population, especially in tough economic times such as now.

▲

rafark

4 hours ago

[-]

Agreed. Not only do I think it’s worth it, i actually like that I can contribute. I’m getting so much good value for free I think it’s fair. It’s a win-win situation. The AIs get better and I get better answers.

▲

random3

3 hours ago

[-]

This is a funny take. I love your optimism, but it's so extremely naive, it should have a name.

▲

rafark

1 hour ago

[-]

It’s not naive. The value these ai chatbots provide to me is extremely high.

I’ve been writing code for many years but one of the areas I wanted to improve was debugging, I’ve always printed variables but last month I decided to start using a debugger instead of logging to the console. For the past weeks I’ve only been using breakpoints and the resume program function because the step-into, over, out functions have always been confusing to me. An hour ago I sent Gemini images of my debugger and explained my problem and it actually told me what to do and it actually explained to me what the step-* functions did and it told me what to do step by step (I sent it a new screenshot after each step and told it to explain to me what was going on).

I now have a much better understanding of how debuggers work thanks to Gemini.

I’m fine with google getting my data, the value I just got was immense.

▲

smjburton

6 hours ago

[-]

> The more data you give any of the AI services, the more that information can potentially be used against you.

It may seem obvious, but Sam Altman also recently emphasized that the information you share with ChatGPT is not confidential, and could potentially be used against you in court.

[1] https://www.pcmag.com/news/altman-your-chatgpt-conversations...

[2] https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no...

▲

djeastm

3 hours ago

[-]

Hasn't that always been the case? Phone companies providing records of calls and text messages, etc? Anything stored on someone else's servers is going to be something they have a duty to provide to police/courts, assuming they fall under that jurisdiction.

▲

Jalad

5 hours ago

[-]

This is always true though. Any data that a cloud company has against you can be subpoenad

It would be weird for him not to be transparent about that

▲

ceroxylon

5 hours ago

[-]

What about the people who did not opt to share or index their chats, and the companies that claim to not train on user chats?

https://privacy.anthropic.com/en/articles/10023555-how-do-yo...

> We do not actively set out to collect personal data to train our models

The 'snarky tech guy' tone of the article is a bit like nails on a chalkboard.

▲

hazKu4

5 hours ago

[-]

(At least to me) that language doesn’t feel particularly reassuring… especially given the duplicitous nature of data collection - i.e. “we don’t sell your data” translates to “we create a sophisticated advertising profile about you, and monetize that”

▲

boesboes

5 hours ago

[-]

That line is about data they find on internet. soooo completely not relevant

▲

Kim_Bruning

6 hours ago

[-]

Earlier discussion on the "ChatGPT chats in google" angle:

https://news.ycombinator.com/item?id=44778764

Interesting how much traction

     "[x] Make this chat discoverable (allows it to be shown in web searches)"

gets in news articles.

People don't seem to have the same intuition for the web that they used to!

▲

falcor84

6 hours ago

[-]

> So, kids, let's not be asking any AI chatbot whether you should divorce your husband, how to cheat on your taxes, or if you should try to get your boss fired. That information will be kept, it may be revealed in a security breach, and, if so, it will come back to bite you in the buns.

Just as a PSA - there's nothing unique to AIs here - whenever you ask a question of anyone, in any way, they then have the memory of you having asked it. A lot of sitcoms and comedic plays have the plot premise build upon such questions that a person voiced then eventually reaching (either accurately or inaccurately) the person they were hiding the question from.

And as someone who's into spy stories, I know that a big part of tradecraft is of formulating your questions in a way that divulges the least about your actual intentions and current information.

If anything, LLM-driven AIs are the first technology that in principle allow you to ask a complex question that would be immediately forgotten. The thing is that you need to be running the AI yourself; if you ask an AI controlled by another entity, then you're trusting that entity with your question, regardless of whether there's an AI on the way.

▲

frakt0x90

6 hours ago

[-]

Books are also technology that allow you to answer complex questions without recording the question.

▲

Jalad

5 hours ago

[-]

Not necessarily though, it depends on where you got the book from (Amazon, the library?), and what your question is

▲

shadowgovt

3 hours ago

[-]

In general, libraries actually do go out of their way to minimize the ways circulation history can be used against card-holders.

This isn't airtight, but it'a a point of principle for most libraries and librarians and they've gone to the mat over this. https://www.newtactics.org/tactics/protecting-right-privacy-...

▲

Theodores

3 hours ago

[-]

This was a surprisingly big thing back in the early 2000s with The War Against Terror. I think that it was mostly for reasons of 'chilling effect', but the media made everyone aware that the Department of Homeland Security were paying attention to what books people took out of the library.

What was curious about this was that, at the time, there were few dangerous books in libraries. Catcher in the Rye and 1984 was about it. You wouldn't find a large print copy of Che Guevara's Guerrilla Warfare, for instance.

I disagree about how libraries minimise the risk of anyone knowing who is reading what. On the web where so much is tracked by low intelligence marketing people, there is more data than anything that anyone can deal with. In effect, nobody is able to follow you that easily, only machines, with data that humans can't make sense of.

Meanwhile, libraries have had really good IT systems for decades, with everything tracked in a meaningful form with easy lookups. These systems are state owned, therefore it is no problem for a three letter agency to get the information they want from a library.

▲

shadowgovt

1 hour ago

[-]

Libraries don't tend to have consolidated, centralized IT. As a result, TLAs have to actually make subpoenas to the databanks maintained by individual, regional library groups, and The ALA offers guidelines on how to respond to those (https://www.ala.org/advocacy/privacy/lawenforcement/guidelin...).

This, of course, doesn't mean your information is irretrievable by TLAs. But the premise of "tap every library to bypass the legal protections against data harvesting" is much trickier when applied to libraries than when applied to, say, Google. They also aren't meaningfully "state-owned" any more than the local Elk's Club is state-owned; the vast majority of libraries are, at most, a county organ, and it is the particular and peculiar style of governance in the United States that when the Feds come knocking on a county's door, they can also tell them to come back with a warrant. That's if the library is actually government-affiliated at all; many are in fact private organizations that were created by wealthy donors at some point in the past (New York Public Library and the Carnegie Library System are two such examples).

Many libraries also purposefully discard circulation data so as to minimize the surface area of what can be subpoena'd. New York Public Library for example, as a matter of policy, purges the circulation data tied to a person's account soon after each loaned item is returned (https://www.nypl.org/help/about-nypl/legal-notices/privacy-p...).

▲

y0eswddl

5 hours ago

[-]

The questions and info you ask friends doesn't end up in a massive data profile on you stored in somebody's cloud to be used for future manipulation/marketing/profiling...

▲

3-cheese-sundae

3 hours ago

[-]

They do, if they're asked over one of the many popular non-secure chat platforms.

I feel like most people don't wait until their friends are in the room to ask them questions or exchange info.

▲

makeworld

3 hours ago

[-]

Notably, Anthropic does not do this with Claude.

https://docs.anthropic.com/en/docs/claude-code/data-usage

▲

avmich

3 hours ago

[-]

I have an issue with "stupidity" suggestion. Clicking "Agree" without full analysis is tried and true Internet tradition, it's so sad somebody assumes it's serious and attempts to use it. We should have legal protections against wringing quasi-agreements from customers and then using them against.

▲

jdthedisciple

26 minutes ago

[-]

From what I know, only people who DELIBERATELY SHARED their chats and IGNORED THE WARNING that it makes them public had their chats appear in search engine results.

Which makes this article quite misleading.

▲

nachox999

4 hours ago

[-]

We need a tool that create random fake data for the data-mining web apps

▲

Qem

2 hours ago

[-]

I never interacted with the AI Meta bundled to whatsapp fearing this.

▲

tietjens

1 hour ago

[-]

I’m pretty certain just using WhatsApp is enough.

▲

nottorp

5 hours ago

[-]

> "How to Use a Microwave Without Summoning Satan,"

Oh, nice idea. We should all ask that.

▲

mystraline

5 hours ago

[-]

Wait, you can summon Satan with a microwave?!

Lemee ask ShatGPT how to do that!

▲

unethical_ban

5 hours ago

[-]

Duck.ai claims to anonymize AI chats and says its conversations are not used for training. It is my go to for casual usage.

Otherwise, I use local for complex for potentially controversial questions.

▲

thisisit

4 hours ago

[-]

If you ask a layperson the answer is - "Yes, and?". If its free, very few people care. Sure you can run a local instance and yes, it might be as simple as downloading Ollama but not many will do it or even have a powerful enough computer to run it.

Worst yet you might individually make a choice to do that but others might not care. They might share email/chats with you to a chatbot to parse it or "make it think like them" and then the chatbot has info about you. So, as much as I understand this sentiment this seems like a losing battle.

▲

dialup_sounds

4 hours ago

[-]

Why should they care?

▲

shadowgovt

3 hours ago

[-]

This is also true of search engines, social media, and various other interactive systems. Google's initial search-algorithm breakthrough was the realization that they had a massive source of data for search result correctness in the form of the behavior of users querying their site.

In general, it's wise to assume that all web interactions are a two-way street between the user and the service provider.

▲

akomtu

1 hour ago

[-]

Unlike previous technologies, chatbots know what users think at the most intimate level. Chatbots know, but currently cannot make sense of this knowledge. The near term goal, I believe, is to build simple, but accurate models of the users psyche to serve them ads better. Instead of crude labels like "user 456 loves cars", corpos will have a compact psyche model of that user that will predict his reactions with 95% accuracy. This model will know that user better than he knows himself. And for a brief moment in history, while AI is good enough to predict us, but not replace us, the adtech corpos will make bank.

▲

andrepd

6 hours ago

[-]

What can you do online these days without being data mined? Browsing gemini?

▲

em3rgent0rdr

6 hours ago

[-]

Download stuff in bulk (for instance the entire wikipedia torrent) and then peruse it on you own computer.

▲

Squeeeez

5 hours ago

[-]

If you are not using an OS which has something like windows recall enabled, or that weird stardict with online lookup with automatic lookup on select which came up recently.

I wonder how far back this has been going on. Did ICQ, IRC server hosters, BBSes do similar things?

▲

reactordev

5 hours ago

[-]

No, back then storage was a premium so everything aside from config, accounts, and billing was ephemeral. It really wasn’t until Cloud came along that storage made it so you could keep everything. About the time of the social media boom.

It wasn’t until around 2014 that I stopped building routes that did:

    DELETE FROM <table> WHERE id = ? ON DELETE CASCADE;

▲

timeon

3 hours ago

[-]

> windows recall enabled

Just curious what other OS has something similar? MacOS maybe?

▲

y0eswddl

5 hours ago

[-]

Start w/

https://ssd.eff.org

https://privacyguides.net

▲

boesboes

5 hours ago

[-]

What a terrible, utter bullshit article. Full of half truths and fear mongering. smh.

▲

AlexandrB

4 hours ago

[-]

> fear mongering

The last 10 years of tech "innovation" is basically what the article describes but happening to other tech products[1]. So, why is this fear mongering? It's basically inevitable unless:

a. There's legislation. But I would bet on legislation for the opposite - storing chats forever - instead.

b. AI moves to on-device where users have control of their own data. Also unlikely considering how much tech loves web technologies and recurring revenue streams.

[1] https://www.cam.ac.uk/research/news/menstrual-tracking-app-d...

▲

actionfromafar

4 hours ago

[-]

All hail centralized cloud services?

▲

panny

6 hours ago

[-]

I would expect this, but it doesn't seem to be the case.

If I ask for search.brave.com to give me a list of gini coefficients for the top ten countries by GDP, it can't do it. However, if I tell it the data is available on the CIA world factbook, it can then spit that info out promptly. However, if I close the context and ask again, it hasn't learned this information and once again is unable to provide the list.

It didn't datamine me. It had no better idea where to find this information the second time I asked. This is the experience others have stated with other AIs as well. It does not seem special to brave.

▲

Etheryte

6 hours ago

[-]

Data mining doesn't mean the model is instantly updated, that would be prohibitively expensive at scale. It's way easier to batch your data together with a bunch of other data and use it later on. That doesn't even mean it will know where to find the information eventually since models are not one to one with their inputs, because again, size and cost.

▲

panny

6 hours ago

[-]

>Data mining doesn't mean the model is instantly updated

I'm not expecting instant. Even next week it won't be there. It's like how AI never learned to count how many times the letter r appears in strawberry. Like sure, now if you ask brave, it will tell you three, but that is only because that question went viral. It didn't "learn" anything, it was just hard coded for that particular answer. Ask it how many times the letter l appears in smallville and it will get it wrong again.

▲

simgt

6 hours ago

[-]

I didn't think for a second you could be right, so I tried with Claude. L in smallville was correct, then it suggests it'd have gotten l in parallel wrong by answering 3 instead of 2 (buts gets it right in a new chat). Then it suggests it'd get n in millennium wrong by giving the right answer, and gets it wrong in a new chat. https://claude.ai/share/93b46c3b-23a7-40ad-8a2b-ec2ed6c34a19

Thanks, that was enlightening.

▲

t0md4n

6 hours ago

[-]

It wouldn’t be instant, next week or even next month. Pre-training doesn’t happen that frequently and varies between each model provider. As for the strawberry test, this is a tokenization issue that is fundamental to LLM’s, however, most models can now solve this type of question using thinking/code/tools to count the letters.

https://imgur.com/a/NqIJEx6

▲

Etheryte

5 hours ago

[-]

Both OpenAI and Claude average roughly one flagship release a year, and these are some of the best funded companies in the space. The bigger your model, the more expensive it is to train, so you want to do it as rarely as reasonably possible. Every other company will either work with smaller models and/or train even more rarely, aside from fine-tunes and customizations they put on top.

▲

ordersofmag

6 hours ago

[-]

LLM aren't retrained and released on a weekly time-scale. The data mining may only be reflected in the training of the next generation of the model.

▲

qwertytyyuu

6 hours ago

[-]

every week is still way to expensive to do at scale, at best they'll update training data with each model iteration.

▲

add-sub-mul-div

6 hours ago

[-]

Brave isn't data mining you for your benefit, they're doing it for their benefit.

▲

panny

4 minutes ago

[-]

Likewise, I'm not teaching their AI where to find GINI coefficients for their benefit, but for mine. I'd like for their AI to learn something, if only to make my experience better. But there's no learning happening.

▲

hluska

3 hours ago

[-]

You’re expecting models to constantly retrain themselves based on riddles. That’s not very reasonable nor is it even economically feasible right now. At massive scale, I question whether it’s even technically feasible.