Data-at-Rest Encryption in DuckDB
222 points
4 months ago
| 6 comments
| duckdb.org
| HN
jasonthorsness
4 months ago
[-]
AES-GCM sensitivity to nonce reuse is a tricky implementation detail. Here they acknowledge it but then don’t share their solution - and in fact the header contains 16 bytes for the nonce instead of the expected 12 bytes and they do not share what bytes are random. Did I miss something, anyone know?
reply
jedisct1
4 months ago
[-]
Static key, random 12 byte nonces, no per-session key for temp buffers.
reply
kianN
4 months ago
[-]
I’m just continually amazed by the DuckDB team. We had built out a naive solution with OpenSSL to encrypt duckdb files, but that lead to a 2x runtime cost for first time queries and used up a lot of ram because we were encrypting/decrypting the entire file all at once. It seems like because DuckDB is encrypting at the page level and leveraging modern processors native AES operations, they are able to perform read/writes at practically no cost.
reply
PunchyHamster
4 months ago
[-]
Why not just LUKS ? Kernel level, leverages acceleration, transparent to anything you run on top of it.

DB encryption is useful if you have multiple things that need separate ACL and encryption keys but if it is one app one DB there is no need for it

reply
beala
4 months ago
[-]
From the article:

> This allows for some interesting new deployment models for DuckDB, for example, we could now put an encrypted DuckDB database file on a Content Delivery Network (CDN). A fleet of DuckDB instances could attach to this file read-only using the decryption key. This elegantly allows efficient distribution of private background data in a similar way like encrypted Parquet files, but of course with many more features like multi-table storage. When using DuckDB with encrypted storage, we can also simplify threat modeling when – for example – using DuckDB on cloud providers. While in the past access to DuckDB storage would have been enough to leak data, we can now relax paranoia regarding storage a little, especially since temporary files and WAL are also encrypted.

reply
kianN
4 months ago
[-]
We are in the separate ACL/encryption key bucket. We provide a Bayesian data analytics platform/api for other companies. Each company can have hundreds to thousands of datasets ("indices") each of which has a separate encryption key, and those keys are also stored encrypted with an organizational level key that is rotated daily.
reply
letmetweakit
4 months ago
[-]
I believe it's also to protect against the occasionally "lost" DB file.
reply
notorious_pgb
4 months ago
[-]
With respect, none of this sounds like "amazing" work on DuckDB's part. It's not bad work, either! It's competent work.

Comparing it to a naive approach (encrypting an entire database file in a single shot and loading it all into memory at once) is always going to make competent work seem "amazing".

I say this not to shit on DuckDB (I see no reason to shit on them); rather, I think it's important that we as professionals have realistic standards that we expect _ourselves_ to hit. Work we view as "amazing" is work we allow ourselves not to be able to replicate. But this is not in that category, and therefore, you should hold yourself to the same standard.

reply
kianN
4 months ago
[-]
I'm more amazed that they released this as part of their open-source offering (not clear from my above comment). Encryption is a standard lever for open-source projects to monetize.

I run a small company and needed to budget solid amount of chunk of time for next year to dig into improving this component of our system. I respect your perspective around holding high standards, but I do think it's worth getting excited about and celebrating reliable performant software that demonstrates consistent competence.

reply
vjerancrnjak
4 months ago
[-]
It’s just pipelining. Encryption is free compared to reads or writes to storage.
reply
jedisct1
4 months ago
[-]
"Sqlite [...] encryption extension is a $2000 add-on".

SqliteMultipleCiphers has been around for ages and is free https://utelle.github.io/SQLite3MultipleCiphers/

And Turso Database supports encryption out of the box: https://docs.turso.tech/tursodb/encryption

reply
michaelsbradley
4 months ago
[-]
There’s also SQLCipher, it’s been in development since 2009 and works quite well:

https://github.com/sqlcipher/sqlcipher

reply
memset
4 months ago
[-]
How do you use these in practice? Both Python and Go don’t make it easy to link a different variation of SQLite with one of these plugins compiled in. How do you make it work?
reply
ncruces
4 months ago
[-]
I don't think SqliteMultipleCiphers can be built into a runtime loadable extension (and the Turso thing is just a copy of it).

I'm confident that a scheme based on tweakable block cyphers (like Adiantum or AES XTS) could be made into decent runtime loadable extension.

I implemented such schemes for my Go driver, but Go code is not really ideal to make a runtime loadable extension of (it'd have to be ported to C/Rust/zig).

https://news.ycombinator.com/item?id=40208800

reply
glenjamin
4 months ago
[-]
Other than motherduck, is anyone aware of any good models for running multi-user cloud-based duckdb?

ie. Running it like a normal database, and getting to take advantage of all of its goodies

reply
mritchie712
4 months ago
[-]
For pure duckdb, you can put an Arrow Flight server in front of duckdb[0] or use the httpserver extension[1].

Where you store the .duckdb file will make a big difference in performance (e.g. S3 vs. Elastic File System).

But I'd take a good look at ducklake as a better multiplayer option. If you store `.parquet` files in blob storage, it will be slower than `.duckdb` on EFS, but if you have largish data, EFS gets expensive.

We[2] use DuckLake in our product and we've found a few ways to mitigate the performance hit. For example, we write all data into ducklake in blog storage, then create analytics tables and store them on faster storage (e.g. GCP Filestore). You can have multiple storage methods in the same DuckLake catalog, so this works nicely.

0 - https://www.definite.app/blog/duck-takes-flight

1 - https://github.com/Query-farm/httpserver

2 - https://www.definite.app/

reply
anentropic
4 months ago
[-]
I wonder if anyone has experimented with "Mountpoint for S3" + DuckDB yet

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mountp...

reply
sigwinch
4 months ago
[-]
The duckdb http extension reads S3 compatibles.
reply
glenjamin
4 months ago
[-]
that looks neat - how but do you handle failover/restarts?
reply
mritchie712
4 months ago
[-]
in which one? restarts are no problem on ducklake (ACID transactions in catalog)

the others, I haven't tried handling it in.

reply
philbe77
4 months ago
[-]
GizmoSQL is definitely a good option. I work at GizmoData and maintain GizmoSQL. It is an Arrow Flight SQL server with DuckDB as a back-end SQL execution engine. It can support independent thread-safe concurrent sessions, has robust security, logging, token-based authentication, and more.

It also has a growing list of adapters - including: ODBC, JDBC, ADBC, dbt, SQLAlchemy, Metabase, Apache Superset and more.

We also just introduced a PySpark drop-in adapter - letting you run your Python Spark Dataframe workloads with GizmoSQL - for dramatic savings compared to Databricks for sub-5TB workloads.

Check it out at: https://gizmodata.com/gizmosql

Repo: https://github.com/gizmodata/gizmosql

reply
philbe77
4 months ago
[-]
Oh, and GizmoData Cloud (SaaS option) is coming soon - to make it easier than ever to provision GizmoSQL instances...
reply
tempest_
4 months ago
[-]
Feels like I keep seeing "Duckdb in your postgres" posts here. Likely that is what you want.
reply
derekhecksher
4 months ago
[-]
reply
dismantle
4 months ago
[-]
Curious how the indexing of a key is hanlded. I'm not sure if the document already has it (as I don't remember coming across this), but I'm just a bit curious. Will the key being searched for be "encrypted" before a search or will a decryption occur for each block during a search.
reply
biophysboy
4 months ago
[-]
DuckDB has been more useful to me than all AI combined (and I like LLMs overall)
reply