On paper it seemed like a great fit, but it turned out the WASM build doesn't have feature-parity with the "normal" variant, so things that caused us to pick it like support for parquet compression and lazy loading were not supported. So it ended up not having great performance while introducing a lot of complexity, and also was terrible for first page load time due to needing the large WASM blob. Build pipeline complexity was also inherently higher due to the dependency and data packaging needed.
Just something to be aware of if you're thinking of using it. Our conclusion was that it wasn't worth it for most use cases, which is a shame because it seems like such a cool tech.
How large was your WASM build? I'm using the standard duckdb-wasm, along with JS functions to form the SQL queries, and not seeing onerous load times.
It's a good point, but the wasm docs state that feature-parity isn't there - yet. It could certainly be more detailed, but it seems strange that your company would do all this work without first checking the feature-coverage / specs.
> WebAssembly is basically an additional platform, and there might be platform-specific limitations that make some extensions not able to match their native capabilities or to perform them in a different way.
It was a project that exploited a new opportunity so time-to-market was the most important thing, I'm not suprised these things were missed, and replacing the data loading mechanism was maybe 1 week of work for 1 person, so it wasn't that impactful a change later.
Put all of that together, and you get a website that queries S3 with no backend at all. Amazing.
You have to store the data somehow anyway, and you have to retrieve some of it to service a query. If egress costs too much you could always change later to put the browser code on a server. Also it would presumably be possible to quantify the trade-off between processing the data client side and on the server.
Cloudflare actually has built in iceberg support for R2 buckets. It's quite nice.
Combine that with their pipelines it's a simple http request to ingest, then just point duckdb to the iceberg enabled R2 bucket to analyze.
There's no egress data transfer fees, but you still pay for the GET request operations. Lots of little range requests can add up quick.
It is time like this that makes self-hosting a lot more attractive.
But yeah - this is pretty neat. Easily seems like the future of static datasets should wind up in something like this. Just data, with some well chosen indices.
I’m just bemused that we all refer to one of the larger, more sophisticated storage systems on the plant, composed of dozens of subsystems and thousands of servers as “no backend at all.” Kind of a “draw the rest of the owl”.
Lack of server/dynamic code qualifies as no backend.
They let you easily abstract over storage.
https://2019.splashcon.org/details/splash-2019-Onward-papers...
We didn't find the frozen DuckLake setup useful for our use case. Mostly because the frozen catalog kind of doesn't make sense with the DuckLake philosophy and the cost-benefit wasn't there over a regular duckdb catalog. It also made making updates cumbersome because you need to pull the DuckLake catalog, commit the changes, and re-upload the catalog (instead of just directly updating the Parquet files). I get that we are missing the time travel part of the DuckLake, but that's not critical for us and if it becomes important, we would just roll out a PostgreSQL database to manage the catalog.
We have been doing it for quite some time in our product to bring real time system observability with eBPF to the browser and have even found other techniques to really max-it-out beyond what you get off the shelf.
But, if you'd like to instead read the article, you'll see that they qualify the reasoning in the first section of the article, titled, "Rethinking the Old Trade-Off: Cost, Complexity, and Access".
(Does not seem like a realistic scenario to me for many uses, for RAM among other resource reasons.)
But the article is a little light on technical details. In some cases it might make sense to bring the entire file client-side.
I didn't use the in browser WASM but I did expose an api endpoint that passed data exploration queries directly to the backend like a knock off of what new relic does. I also use that same endpoint for all the graphs and metrics in the UI.
DuckDB is phenomenal tech and I love to use it with data ponds instead of data lakes although it is very capable of large sets as well.
What are data ponds? Never heard the term before
Maybe there's already a term that covers this but I like the imagery of the metaphor... "smaller, multiple data but same idea as the big one".
But found it to be a real hassle to help it understand the right number of threads and the amount of memory to use.
This led to lots of crashes. If you look at the projects github issues you will see many OOM out of memory errors.
And then there was some indexed bug that crashed seemingly unrelated to memory.
Life is too short for crashy database software so I reluctantly dropped it. I was disappointed because it was exactly what I was looking for.
It uses cgroups to enforce resource limits.
For example, there’s a program I wrote myself which I run on one of my Raspberry Pi. I had a problem where my program would on rare occasions use up too much memory and I wouldn’t even be able to ssh into the Raspberry Pi.
I run it like this:
systemd-run --scope -p MemoryMax=5G --user env FOOBAR=baz ./target/release/myprog
The only difficulty I had was that I struggled to find the right name to use in the MemoryMax=… part because they’ve changed the name of it around between versions so different Linux systems may or may not use the same name for the limit.In order to figure out if I had the right name for it, I tested different names for it with a super small limit that I knew was less than the program needs even in normal conditions. And when I found the right name, the program would as expected be killed right off the bat and so then I could set the limit to 5G (five gigabytes) and be confident that if it exceeds that then it will be killed instead of making my Raspberry Pi impossible to ssh into again.
Have you used this in conjunction with DuckDB?
Non-deterministic OOMs especially are some of the worst things in the sort of tools I'd want to use DuckDB in and as you say, I found it to be more common than I would like.
DuckDB has introduced spilling to disk and some other tweaks since a good year now: https://duckdb.org/2024/07/09/memory-management
The final straw was an index which generated fine on MacOS and failed on Linux - exact same code.
Machine had plenty of RAM.
The thing is, it is really the responsibility of the application to regulate its behavior based on available memory. Crashing out just should not be an option but that's the way DuckDB is built.