FilterHN

Ask HN: How are you handling data retention across your stack?

4 points

2 days ago

| 3 comments

For people building SaaS with data across multiple systems (S3, DBs, caches, etc), do you actually have a clean way to manage retention/deletion across all of them? (Especially when each customer has custom policies)

Or is it more a mix of lifecycle rules, cron jobs, and manual cleanup?

How are you doing this today? I feel like this is a blocker in enterprise deals when selling to regulated industries.

▲

sinansaka

1 day ago

[-]

S3 lifecycle policies and scheduled RDBMS jobs are the low hanging fruit here.

I used to work at a data platform team and built a cleaning service that used tags and object hierarchy trees to find and clean old PII data. Not an easy thing to do as our data analytics bucket had over 7PiB of data.

Overall the architecture was based of 3 components: detector, enforcer, cleaner. Detector sifted through the datalake to find PII datasets(llm based), enforcer tracked down ETL of the datasets in our VCS to set appropriate tags/metada(custom coding agent), finally cleaner used search to find and clean the data based on the metadata.

▲

preston-kwei

7 hours ago

[-]

I'm curious about how the LLM-based detector worked. How often did you run into a false positive? (I'm assuming you leaned on the more sensitive side due to the importance of the data)

▲

muzani

1 day ago

[-]

I feel like this should be a service in itself, similar to Heroku or Supabase. Just tick which laws you want to adhere to, upload files to their buckets. Tick another box for audit logs and such and it'll ask you where you need your human in the loop and which buttons those humans need to press. So a bit like Carta or Deel in that sense.

I've had some big enterprise deals fall through because of something like this - military, insurance, fintech, etc.

▲

preston-kwei

7 hours ago

[-]

I agree. I'm trying to sell to enterprise right now and this is a blocker. Basically, my product stores emails in S3, but metadata in Supabase, and caches in Redis. It's all fragmented and super difficult to keep track of. Plus Supabase isn't WORM compliant so it is hard to ensure data doesn't get deleted.

▲

crawlwright

2 days ago

[-]

Mostly cron jobs and lifecycle rules in my experience, it’s rarely clean. S3 lifecycle policies handle the easy stuff but anything touching multiple systems usually ends up as a scheduled job that someone wrote once and nobody fully trusts.

▲

preston-kwei

7 hours ago

[-]

Do you think customers trust it either? I'm trying to differentiate between "good enough" and when enterprise customers actually care. I mean, it doesn't matter until a customer asks for their data.