FilterHN

The challenges of soft delete

91 points

by buchanae

5 hours ago

| past

| 20 comments

| atlas9.dev

| HN

▲

MaxGabriel

3 hours ago

[-]

This might stem from the domain I work in (banking), but I have the opposite take. Soft delete pros to me:

* It's obvious from the schema: If there's a `deleted_at` column, I know how to query the table correctly (vs thinking rows aren't DELETEd, or knowing where to look in another table)

* One way to do things: Analytics queries, admin pages, it all can look at the same set of data, vs having separate handling for historical data.

* DELETEs are likely fairly rare by volume for many use cases

* I haven't found soft-deleted rows to be a big performance issue. Intuitively this should be true, since queries should be O log(N)

* Undoing is really easy, because all the relationships stay in place, vs data already being moved elsewhere (In practice, I haven't found much need for this kind of undo).

In most cases, I've really enjoyed going even further and making rows fully immutable, using a new row to handle updates. This makes it really easy to reference historical data.

If I was doing the logging approach described in the article, I'd use database triggers that keep a copy of every INSERT/UPDATE/DELETEd row in a duplicate table. This way it all stays in the same database—easy to query and replicate elsewhere.

▲

gleenn

2 hours ago

[-]

If you're implementing immutable DB semantics maybe you should consider Datomic or alternatives because then you get that for free, for everything, and you also get time travel which is an amazing feature on top. It lets you be able to see the full, coherent state of the DB at any moment!

▲

ozim

33 minutes ago

[-]

DELETEs are likely fairly rare by volume for many use cases

I think one of our problems is getting users to delete stuff they don’t need anymore.

▲

nine_k

3 hours ago

[-]

> DELETEs are likely fairly rare by volume for many use cases

All your other points make sense, given this assumption.

I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.

> Undoing is really easy

Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.

In short, there are cases when soft-deletion works well, and is a good approach. In other cases it does not, and is not. Analysis is needed before adopting it.

▲

tharkun__

2 hours ago

[-]

Agreed. And if deletes are soft, you likely really just wanted a complete audit history of all updates too (at least that's for the cases I've been part of). And then performance _definitely_ would suffer if you don't have a separate audit/archive table for all of those.

▲

pixl97

1 hour ago

[-]

I mean, yes, growth forever doesn't tend to work.

I've seen a number of apps that require audit histories work on a basis where they are archived at a particular time, and that's when the deletes occurred and indexes fully rebuilt. This is typically scheduled during the least busy time of the year as it's rather IO intensive.

▲

da_chicken

2 hours ago

[-]

> I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.

At that point you should probably investigate partitioning or data warehousing.

▲

eddd-ddde

2 hours ago

[-]

I never got to test this, but I always wanted to explore in postgres using table partitions to store soft deleted items in a different drive as a kind of archived storage.

I'm pretty sure it is possible, and it might even yield some performance improvements.

That way you wouldn't have to worry about deleted items impacting performance too much.

▲

gleenn

2 hours ago

[-]

It's definitely an interesting approach but the problem is now you have to change all your queries and undeleting get more complicated. There are strong trade-offs with almost all the approaches I've heard of.

▲

snuxoll

1 hour ago

[-]

With partitioning? No you don't. It gets a bit messy if you also want to partition a table by other values (like tenant id or something), since then you probably need to get into using table inheritance instead of the easier declarative partitioning - but either technique just gives you a single effective table to query.

▲

edmundsauto

14 minutes ago

[-]

Pg moves the data between positions on update?

▲

rawgabbit

2 hours ago

[-]

I have worked with databases my entire career. I hate triggers with a passion. The issue is no one “owns” or has the authority to keep triggers clean. Eventually triggers become a dumping ground for all sorts of nasty slow code.

I usually tell people to stop treating databases like firebase and wax on/wax off records and fields willy nilly. You need to treat the database as the store of your business process. And your business processes demand retention of all requests. You need to keep the request to soft delete a record. You need to keep a request to undelete a record.

Too much crap in the database, you need to create a field saying this record will be archived off by this date. On that date, you move that record off into another table or file that is only accessible to admins. And yes, you need to keep a record of that archival as well. Too much gunk in your request logs? Well then you need to create an archive process for that as well.

These principles are nothing new. They are in line with “Generally Accepted Record Keeping Principles” which are US oriented. Other countries have similar standards.

▲

rorylaitila

3 hours ago

[-]

Databases store facts. Creating a record = new fact. "Deleting" a record = new fact. But destroying rows from tables = disappeared fact. That is not great for most cases. In rare cases the volume of records may be a technical hurdle; in which case, move facts to another database. The times I've wanted to destroy large volume of facts is approximately zero.

▲

pixl97

1 hour ago

[-]

When you start thinking of data as a potentially toxic asset with a maintenance cost to ensure it doesn't leak and cause an environmental disaster, it becomes more likely that you'd want to get rid of large volumes of facts.

▲

dpark

1 hour ago

[-]

Unless your database is immutable, every changed a record causes a “disappeared fact”.

There are many legitimate reasons to delete data. The decision to retain data forever should not be taken lightly.

▲

3rodents

3 hours ago

[-]

Soft deletes are an example of where engineers unintentionally lead product instead of product leading engineering. Soft delete isn’t language used by users so it should not be used by engineers when making product facing decisions.

“Delete” “archive” “hide” are the type of actions a user typically wants, each with their own semantics specific to the product. A flag on the row, a separate table, deleting a row, these are all implementation options that should be led by the product.

▲

dpark

1 hour ago

[-]

> Soft delete isn’t language used by users so it should not be used by engineers when making product facing decisions.

Users generally don’t even know what a database record is. There is no reason that engineers should limit their discussions of implementation details to terms a user might use.

> “Delete” “archive” “hide” are the type of actions a user typically wants, each with their own semantics specific to the product.

Users might say they want “delete”, but then also “undo”, and suddenly we’re talking about soft delete semantics.

> A flag on the row, a separate table, deleting a row, these are all implementation options that should be led by the product.

None of these are terms an end user would use.

▲

monkpit

2 hours ago

[-]

Why would implementation details be led by product? “Undo” is an action that the user may want, which would be led by product. Not the implementation in the db.

▲

strken

2 hours ago

[-]

I believe that was the point. Soft delete isn't a product requirement, it's an implementation detail, so product teams should talk about the user experience using language like "delete" or "archive" or "undo" or "customer support retrieves deleted data".

▲

Terr_

2 hours ago

[-]

Yeah: You don't "delete" a bank account, you close it, and you don't "undo", you reopen it, etc. The processes have conditions, audit rules, attached information, side-effects, etc. In some cases the same entity can't be restored, and you have to instead create a successor.

"Undo" may work as shorthand for "whatever the best reversing actions happen to be", but as any system grows it stops being simple.

▲

dpark

1 hour ago

[-]

Sure. Did someone say that the behavior should be described to customers as soft delete, though?

I read a blog about a technical topic aimed at engineers, not customers.

▲

antonvs

2 hours ago

[-]

It depends on the product. Google Cloud Storage has a soft delete feature in its product, for example: https://docs.cloud.google.com/storage/docs/soft-delete

▲

tracker1

3 hours ago

[-]

I like having archive/history tables. I often do similar with job queues when persisting to a database, in this way the pending table can stay small and avoid full scans to skip the need for deleted records...

Aside, another idea that I've kicked forward for event driven databases is to just use a database like sqlite and copy/wipe the whole thing as necessary after an event or the work that's related to that database. For example, all validation/chain of custody info for ballot signatures... there's not much point in having it all online or active, or even mixed in with other ballot initiatives and the schema can change with the app as needed for new events. Just copy that file, and you have that archive. Compress the file even and just have it hard archived and backed up if needed.

▲

talesmm14

4 hours ago

[-]

I've worked at companies where soft delete was implemented everywhere, even in irrelevant internal systems... I think it's a cultural thing! I still remember a college professor scolding me on an extension project because I hadn't implemented soft delete... in his words, "In the business world, data is never deleted!!"

▲

salomonk_mur

2 hours ago

[-]

But... It's true. Deleting data completely is an easy way to gimp and lobotomize your future analysis.

Storage is cheap. Never delete data.

▲

ziml77

2 hours ago

[-]

I prefer audit tables. Soft deletes don't capture updates, audit tables do (you could make every update a delete and insert in a soft delete table, but that adds a lot of bloat to the table)

▲

mrkeen

3 hours ago

[-]

No comment from the professor on modifications though?

▲

jamilbk

3 hours ago

[-]

At Firezone we started with soft-deletes thinking it might be useful for an audit / compliance log and quickly ran into each of the problems described in this article. The real issue for us was migrations - having to maintain structure of deleted data alongside live data just didn't make sense, and undermined the point of an immutable audit trail.

We've switched to CDC using Postgres which emits into another (non-replicated) write-optimized table. The replication connection maintains a 'subject' variable to provide audit context for each INSERT/UPDATE/DELETE. So far, CDC has worked very well for us in this manner (Elixir / Postgrex).

I do think soft-deletes have their place in this world, maybe for user-facing "restore deleted" features. I don't think compliance or audit trails are the right place for them however.

▲

maxchehab

4 hours ago

[-]

How do you handle schema drift?

The data archive serialized the schema of the deleted object representative the schema in that point in time.

But fast-forward some schema changes, now your system has to migrate the archived objects to the current schema?

▲

buchanae

4 hours ago

[-]

In my experience, archived objects are almost never accessed, and if they are, it's within a few hours or days of deletion, which leaves a fairly small chance that schema changes will have a significant impact on restoring any archived object. If you pair that with "best-effort" tooling that restores objects by calling standard "create" APIs, perhaps it's fairly safe to _not_ deal with schema changes.

Of course, as always, it depends on the system and how the archive is used. That's just my experience. I can imagine that if there are more tools or features built around the archive, the situation might be different.

I think maintaining schema changes and migrations on archived objects can be tricky in its own ways, even kept in the live tables with an 'archived_at' column, especially when objects span multiple tables with relationships. I've worked on migrations where really old archived objects just didn't make sense anymore in the new data model, and figuring out a safe migration became a difficult, error-prone project.

▲

hnthrow0287345

2 hours ago

[-]

Maybe I'm shooting for the moon, but I'd like soft delete to be some kind of built-in database feature. It would be nice to enable it on a table then choose some built-in strategies on how it's handled.

Soft-delete is a common enough ask that it's probably worth putting the best CS/database minds to developing some OOTB feature.

▲

clickety_clack

3 hours ago

[-]

We have soft delete, with hard delete running on deletions over 45 days old. Sometimes people delete things by accident and this is the only way to practically recover that.

▲

4 hours ago

[-]

We deal with soft delete in a Mongo app with hundreds of millions of records by simply moving the objects to a separate collection (table) separate from the “not deleted” data.

This works well especially in cases where you don’t want to waste CPU/memory scanning soft deleted records every time you do a lookup.

And avoids situations where app/backend logic forgets to apply the “deleted: false” filter.

▲

vjvjvjvjghv

4 hours ago

[-]

I guess that works well with NoSQL. In a relational database it gets harder to move record out if they have relationships with other tables.

▲

tempest_

4 hours ago

[-]

Eh you could implement this pretty simply with postgres table partitions

▲

buchanae

4 hours ago

[-]

Ah, that's an interesting idea! I had never considered using partitions. I might write a followup post with these new ideas.

▲

tempest_

4 hours ago

[-]

There are a bunch of caveats around primary keys and uniqueness but I suspect it could be made to work depending on your data model.

▲

theLiminator

3 hours ago

[-]

Privacy regulations make soft delete unviable in many of the cases where it's useful.

▲

wavemode

3 hours ago

[-]

Soft deletion and privacy deletion serve different purposes.

If you leave a comment on a forum, and then delete it, it may be marked as soft-deleted so that it doesn't appear publicly in the thread anymore, but admins can still read what you wrote for moderation/auditing purposes.

On the other hand, if you send a privacy deletion request to the forum, they would be required to actually fully delete or anonymize your data, so even admins can no longer tie comments that you wrote back to you.

Most social media sites probably have to implement both of these processes/systems.

▲

SchemaLoad

2 hours ago

[-]

Imo there should be some retention period for moderation but then hard deletion after that. Why would a moderator need to look up a deleted post a year after it was deleted?

▲

strken

2 hours ago

[-]

"Hi SchemaLoad, I'm Officer John from the Department of Not Letting Children Be Abused. I'm following up on something one of your users posted three years ago. Can you tell me the IP address(es) associated with the following deleted posts: A B C D"

▲

SchemaLoad

1 hour ago

[-]

You'd be required to show what you have but you aren't required to store everything forever just in case someone years later asks for it. Would be like showing up to fingerprint the scene 3 years after and being surprised it's too late.

▲

antonvs

2 hours ago

[-]

“Hi Officer John, that data is deleted and is no longer possible to access.”

Unless there’s a regulatory requirement (which there currently isn’t in any jurisdiction I’ve heard of), that’s a perfectly acceptable response.

▲

sedatk

3 hours ago

[-]

The opposite is true in countries where there are data retention laws. Soft-delete is mandatory in those cases.

▲

nemothekid

4 hours ago

[-]

The trigger architecture is actually quite interesting, especially because cleanup is relatively cheap. As far as compliance goes, it's also simply to declare that "after 45 days, deletions are permanent" as a catch all, and then you get to keep restores. For example, I think (IANAL), the CCPA gives you a 45 day buffer for right to erasure requests.

Now instead of chasing down different systems and backups, you can simply set ensure your archival process runs regularly and you should be good.

▲

iterateoften

2 hours ago

[-]

I used to be pretty adamant about implementing soft delete for core business objects.

However after 15 years I prefer to just back up regularly, have point in time restores and then just delete normally.

The amount of times I have “undeleted” something are few and far between.

▲

LorenPechtel

3 hours ago

[-]

The % of records that are deleted is a huge factor.

You keep 99%, soft delete 1%, use some sort of deleted flag. While I have not tried it whalesalad's suggestion of a view sounds excellent. You delete 99%, keep 1%, move it!

▲

da_chicken

1 hour ago

[-]

A view only makes sense if your RDBMS supports indexed views or the query engine is otherwise smart enough to pierce the view definition. Not all of them can do those things.

▲

whalesalad

4 hours ago

[-]

A good solution here (can be) to utilize a view. The underlying table has soft-delete field and the view will hide rows that have been soft deleted. Then the application doesn't need to worry about this concern all over the place.

▲

elyobo

4 hours ago

[-]

postgres with rls to hide soft deleted records means that most of the app code doesn't need to know or care about them, still issues reads, writes, deletes to the same source table and as far as the app knows its working

▲

Ronsenshi

56 minutes ago

[-]

I would also say that most modern ORMs and frameworks also either come with soft delete feature (with automatic filtering on all queries) as part of the package or there are third-party libraries available for ORMs adding this functionality without the hassle of dealing with views (maybe it's me, but I've never had good experience with DB views).

▲

ntonozzi

3 hours ago

[-]

I've given up on soft delete -- the nail in the coffin for me was my customers' legal requirements that data is fully deleted, not archived. It never worked that well anyways. I never had a successful restore from a large set of soft-deleted rows.

▲

zahlman

3 hours ago

[-]

> customers' legal requirements that data is fully deleted

Strange. I've only ever heard of legal requirements preventing deletion of things you'd expect could be fully deleted (in case they're needed as evidence at trial or something).

▲

jandrewrogers

3 hours ago

[-]

While not common, regulations requiring a hard delete do exist in some fields even in the US. The ones I familiar with are effectively "anti-retention" laws that mandate data must be removed from the system after some specified period of time e.g. all data in the system is deleted no more than 90 days after insertion. This allows compliance to be automated.

The data subject to the regulation had a high potential for abuse. Automated anti-retention limits the risk and potential damage.

▲

SchemaLoad

2 hours ago

[-]

I had an integration with a 3rd party where their legal contract required we hard delete any data from them after a year. Presumably so we couldn't build a competing product using their dataset with full history.

▲

pessimizer

48 minutes ago

[-]

You're thinking of "legal requirements" as requirements that the law insists upon rather than requirements that your legal department insists upon. You often want to delete records unrecoverably as soon as legally possible; it's likely why you wrote your data retention policy.

▲

ntonozzi

3 hours ago

[-]

Many privacy regulations enforce full deletion of data, including GDPR: https://gdpr-info.eu/.

▲

IgorPartola

2 hours ago

[-]

I have a love/hate relationship with soft deleted. There are cases where it’s not really a delete but rather a historical fact. For example, let’s say I have a table which stores an employee’s current hourly rate. They are hired at say $15/hour, then go to $17 six months later, then to $20/hour three months later. All of these three things are true and I want to be able to query which rate the employee had on a specific date even after their rate had changed. When I have a starts_on and an ends_on dates and the latter is nullable, with some data consistency logic I can create a linear history of compensation and can query historical and current data the same exact way. I also get

But this is such a huge PITA because you constantly have to mind if any given object has this setup or not and what if related objects have different start/end dates? And something like a scheduled raise for next year to $22/hour can get funny if I then try to insert that just for July it will be $24/hour (this would take my single record for next year and split it into two and then you gotta figure out which gets the original ID and which is the new row.

Another alternative to this is a pattern where you store the current state and separately you store mutations. So you have a compensation table and a compensation_mutations table which says how to evolve a specific row in a compensation table and when. The mutations for anything in the future can be deleted but the past ones cannot which lets you reconstruct who did what, when, and why. But this also has drawbacks. One of them is that you can’t query historical data the same way as current data. You also have to somehow apply these mutations (cron job? DB trigger?)

And of course there are database extensions that allow soft deletes but I have never tried them for vague portability reasons (as if anyone ever moved off Postgres).

▲

nerdponx

3 hours ago

[-]

One thing that often gets forgotten in the discussions about whether to soft delete and how to do it is: what about analysis of your data? Even if you don't have a data science team, or even a dedicated business analyst, there's a good chance that somebody at some point will want to analyze something in the data. And there's a good chance that the analysis will either be explicitly "intertemporal" in that it looks at and compares data from various points in time, or implicitly in that the data spans a long time range and you need to know the states of various entities "as of" a particular time in history. If you didn't keep snapshots and you don't have soft edits/deletes you're kinda SoL. Don't forget the data people down the line... which might include you, trying to make a product decision or diagnose a slippery production bug.

▲

pjs_

3 hours ago

[-]

Tried implementing this crap once. Never again

▲

cyberax

3 hours ago

[-]

Soft deletes + GC for the win!

We have an offline-first infrastructure that replicates the state to possibly offline clients. Hard deletes were causing a lot of fun issues with conflicts, where a client could "resurrect" a deleted object. Or deletion might succeed locally but fail later because somebody added a dependent object. There are ways around that, of course, but why bother?

Soft deletes can be handled just like any regular update. Then we just periodically run a garbage collector to hard-delete objects after some time.