* It's obvious from the schema: If there's a `deleted_at` column, I know how to query the table correctly (vs thinking rows aren't DELETEd, or knowing where to look in another table)
* One way to do things: Analytics queries, admin pages, it all can look at the same set of data, vs having separate handling for historical data.
* DELETEs are likely fairly rare by volume for many use cases
* I haven't found soft-deleted rows to be a big performance issue. Intuitively this should be true, since queries should be O log(N)
* Undoing is really easy, because all the relationships stay in place, vs data already being moved elsewhere (In practice, I haven't found much need for this kind of undo).
In most cases, I've really enjoyed going even further and making rows fully immutable, using a new row to handle updates. This makes it really easy to reference historical data.
If I was doing the logging approach described in the article, I'd use database triggers that keep a copy of every INSERT/UPDATE/DELETEd row in a duplicate table. This way it all stays in the same database—easy to query and replicate elsewhere.
I think one of our problems is getting users to delete stuff they don’t need anymore.
All your other points make sense, given this assumption.
I've seen tables where 50%-70% were soft-deleted, and it did affect the performance noticeably.
> Undoing is really easy
Depends on whether undoing even happens, and whether the act of deletion and undeletion require audit records anyway.
In short, there are cases when soft-deletion works well, and is a good approach. In other cases it does not, and is not. Analysis is needed before adopting it.
I've seen a number of apps that require audit histories work on a basis where they are archived at a particular time, and that's when the deletes occurred and indexes fully rebuilt. This is typically scheduled during the least busy time of the year as it's rather IO intensive.
At that point you should probably investigate partitioning or data warehousing.
I'm pretty sure it is possible, and it might even yield some performance improvements.
That way you wouldn't have to worry about deleted items impacting performance too much.
I usually tell people to stop treating databases like firebase and wax on/wax off records and fields willy nilly. You need to treat the database as the store of your business process. And your business processes demand retention of all requests. You need to keep the request to soft delete a record. You need to keep a request to undelete a record.
Too much crap in the database, you need to create a field saying this record will be archived off by this date. On that date, you move that record off into another table or file that is only accessible to admins. And yes, you need to keep a record of that archival as well. Too much gunk in your request logs? Well then you need to create an archive process for that as well.
These principles are nothing new. They are in line with “Generally Accepted Record Keeping Principles” which are US oriented. Other countries have similar standards.
There are many legitimate reasons to delete data. The decision to retain data forever should not be taken lightly.
“Delete” “archive” “hide” are the type of actions a user typically wants, each with their own semantics specific to the product. A flag on the row, a separate table, deleting a row, these are all implementation options that should be led by the product.
Users generally don’t even know what a database record is. There is no reason that engineers should limit their discussions of implementation details to terms a user might use.
> “Delete” “archive” “hide” are the type of actions a user typically wants, each with their own semantics specific to the product.
Users might say they want “delete”, but then also “undo”, and suddenly we’re talking about soft delete semantics.
> A flag on the row, a separate table, deleting a row, these are all implementation options that should be led by the product.
None of these are terms an end user would use.
"Undo" may work as shorthand for "whatever the best reversing actions happen to be", but as any system grows it stops being simple.
I read a blog about a technical topic aimed at engineers, not customers.
Aside, another idea that I've kicked forward for event driven databases is to just use a database like sqlite and copy/wipe the whole thing as necessary after an event or the work that's related to that database. For example, all validation/chain of custody info for ballot signatures... there's not much point in having it all online or active, or even mixed in with other ballot initiatives and the schema can change with the app as needed for new events. Just copy that file, and you have that archive. Compress the file even and just have it hard archived and backed up if needed.
Storage is cheap. Never delete data.
We've switched to CDC using Postgres which emits into another (non-replicated) write-optimized table. The replication connection maintains a 'subject' variable to provide audit context for each INSERT/UPDATE/DELETE. So far, CDC has worked very well for us in this manner (Elixir / Postgrex).
I do think soft-deletes have their place in this world, maybe for user-facing "restore deleted" features. I don't think compliance or audit trails are the right place for them however.
The data archive serialized the schema of the deleted object representative the schema in that point in time.
But fast-forward some schema changes, now your system has to migrate the archived objects to the current schema?
Of course, as always, it depends on the system and how the archive is used. That's just my experience. I can imagine that if there are more tools or features built around the archive, the situation might be different.
I think maintaining schema changes and migrations on archived objects can be tricky in its own ways, even kept in the live tables with an 'archived_at' column, especially when objects span multiple tables with relationships. I've worked on migrations where really old archived objects just didn't make sense anymore in the new data model, and figuring out a safe migration became a difficult, error-prone project.
Soft-delete is a common enough ask that it's probably worth putting the best CS/database minds to developing some OOTB feature.
This works well especially in cases where you don’t want to waste CPU/memory scanning soft deleted records every time you do a lookup.
And avoids situations where app/backend logic forgets to apply the “deleted: false” filter.
If you leave a comment on a forum, and then delete it, it may be marked as soft-deleted so that it doesn't appear publicly in the thread anymore, but admins can still read what you wrote for moderation/auditing purposes.
On the other hand, if you send a privacy deletion request to the forum, they would be required to actually fully delete or anonymize your data, so even admins can no longer tie comments that you wrote back to you.
Most social media sites probably have to implement both of these processes/systems.
Unless there’s a regulatory requirement (which there currently isn’t in any jurisdiction I’ve heard of), that’s a perfectly acceptable response.
Now instead of chasing down different systems and backups, you can simply set ensure your archival process runs regularly and you should be good.
However after 15 years I prefer to just back up regularly, have point in time restores and then just delete normally.
The amount of times I have “undeleted” something are few and far between.
You keep 99%, soft delete 1%, use some sort of deleted flag. While I have not tried it whalesalad's suggestion of a view sounds excellent. You delete 99%, keep 1%, move it!
Strange. I've only ever heard of legal requirements preventing deletion of things you'd expect could be fully deleted (in case they're needed as evidence at trial or something).
The data subject to the regulation had a high potential for abuse. Automated anti-retention limits the risk and potential damage.
But this is such a huge PITA because you constantly have to mind if any given object has this setup or not and what if related objects have different start/end dates? And something like a scheduled raise for next year to $22/hour can get funny if I then try to insert that just for July it will be $24/hour (this would take my single record for next year and split it into two and then you gotta figure out which gets the original ID and which is the new row.
Another alternative to this is a pattern where you store the current state and separately you store mutations. So you have a compensation table and a compensation_mutations table which says how to evolve a specific row in a compensation table and when. The mutations for anything in the future can be deleted but the past ones cannot which lets you reconstruct who did what, when, and why. But this also has drawbacks. One of them is that you can’t query historical data the same way as current data. You also have to somehow apply these mutations (cron job? DB trigger?)
And of course there are database extensions that allow soft deletes but I have never tried them for vague portability reasons (as if anyone ever moved off Postgres).
We have an offline-first infrastructure that replicates the state to possibly offline clients. Hard deletes were causing a lot of fun issues with conflicts, where a client could "resurrect" a deleted object. Or deletion might succeed locally but fail later because somebody added a dependent object. There are ways around that, of course, but why bother?
Soft deletes can be handled just like any regular update. Then we just periodically run a garbage collector to hard-delete objects after some time.