FilterHN

Show HN: GizmoSQL – Run DuckDB as a Server with Arrow Flight SQL

17 points

by philbe77

3 months ago

| past

| 4 comments

| github.com

| HN

Hi,

I'm Philip Moore - the founder of GizmoData, and creator of GizmoSQL - an Apache Arrow Flight SQL Server - with DuckDB (or SQLite) back-end execution engines.

GizmoSQL is a composable SQL server with Arrow Flight SQL, DuckDB, and SQLite - with the intention of making it easy to run DuckDB (or SQLite) as a server - usable by multiple people from a client (remote) computer. It also adds security (authentication) and encryption of traffic with TLS.

To run GizmoSQL - see the steps in the README.md - where you can see how easy it is to run the server as well as how to connect via ADBC and JDBC from a remote client - such as DBeaver, Python, etc. The easiest way to run GizmoSQL is via Docker - but there are downloads for Linux and macOS for both x86-64 and arm64 platforms (download links in the README).

Why?: As you may know, DuckDB and SQLite are embedded systems - they don't enable client connectivity, and they aren't really designed for concurrency.

I've built GizmoSQL to work around that - because I believe the DuckDB engine is very powerful, and I feel like a lot of customers overpay and run distributed compute (i.e. Spark) when they don't really need to. Making it easy to have remote connectivity to DuckDB can make it easier to migrate SQL workloads from Spark or other expensive commercial platforms to this engine - with a much simpler architecture/infrastructure.

It is my intention to make GizmoSQL a commercial product - licensed for production use by organizations, but free for developers to code with - evaluate, and test.

A little bit of backstory: * I built the initial version of this while working for a former employer - it wasn't their core focus, so they open-sourced that early version. After I left there, I forked the product and have improved it substantially - to support concurrency of both reads and writes, improving security, as well as keeping it up to date with the latest versions of Apache Arrow and DuckDB.

* This project evolved from a prototype created by the brilliant Tom Drabas.

* It feels a little weird trying to make a commercial product based upon DuckDB, but MotherDuck started it :P - and I've contributed (albeit very little) to the DuckDB and Apache Arrow projects in the form of a couple of PRs.

I'm really excited about this project - I have run benchmarks of this product against commercial platforms such as Snowflake and Databricks SQL - and it holds its own running the 22-query TPC-H SF1TB benchmark, especially on cost. See the graph at: https://gizmodata.com/gizmosql

Getting started: Github README: https://github.com/gizmodata/gizmosql-public/blob/main/READM... DockerHub: https://hub.docker.com/r/gizmodata/gizmosql GizmoSQL homepage: https://gizmodata.com/gizmosql Phil's Github profile: https://github.com/prmoore77

Thanks for your time and feedback in advance.

▲

chrisjc

3 months ago

[-]

Any thought about a pass-through server for various non-local databases?

It should be extremely simple for databases that support ADBC (for example Snowflake, PostgreSQL).

For others it might just be a matter of mapping DDL, DML, DQL, etc to a supported database protocol driver (JDBC, ODBC, etc). Of course this is where things may get challenging as it would become the responsibility of your server to convert result to Arrows (tables/streams/etc). But could potentially be delegated to "worker" Flight servers (not a Flight SQL server) and then the server could return/forward their Arrow results (Flight results).

Of course some of this is to some degree already possible through DuckDB's MySQL/Postgres Extensions.

I imagine this could also be useful for developing/testing locally?

It might also provide a way to interchange databases while potentially easing database migrations (vendor to vendor) if ADBC isn't supported by the vendor.

Another potential value-addition could be to provide SQL dialect management by providing Substrait conversions (or sqlglot but looks like the server is Java, so unsure if possible, maybe Graal?).

▲

philbe77

3 months ago

[-]

Hi chrisjc, these are interesting thoughts. I was mainly interested in providing a server interface for DuckDB b/c of the performance I observed when running large OLAP workloads, and especially the cost savings vs cloud SaaS offerings. I think the big benefit here is that you can run the same workload at roughly equivalent performance - but for 80% less - from my tests...

I like your ideas, though. The server is written in C++.

I hope this helps...

Thanks!

▲

memhole

3 months ago

[-]

Nice work! I guess my only thought would be if you’re trying to turn this into a product, is this basically a client server rdbms. What advantages does using embedded dbs in this way provide compared to something like Postgres or MySQL? Or the growing popularity of duckdb on s3?

▲

philbe77

3 months ago

[-]

Thanks for the kind words!

I think the main advantages this has over other RDBMS's such as Postgres and MySQL is performance - due to the columnar / vector processing capabilities of DuckDB. I've run this in AWS with a i8g.16xlarge, and on Azure with a Standard_E64pds_v6 - and get amazing performance - due to the use of NVMe storage, lots of CPU (64) and memory (512GB) - for less than $4/hr in cloud VM cost.

This solution lets users use the large resources available on larger cloud VMs as an enterprise-grade server - for use with data notebooks, analytics dashboards, and more.

You can get performance that meets and exceeds many distributed systems on a much simpler architecture - reading from parquet datasets, assuming you've copied (cached) them on the local NVMe SSD storage. Of course - this has disadvantages - such as keeping the local copy in sync with the cloud storage (S3) - but it can be well worth it if you have a mostly read-only dataset (write-once, read-many). There is a small startup time penalty for hydrating the local storage - but I've seen throughput as high as 4GB/s for this initial copy...

I hope this helps. Thanks again for the question!

▲

mousematrix

3 months ago

[-]

Looks v. cool, how does gizmo handle concurrency for multiple writes? are there any limits that you have observed?

▲

philbe77

3 months ago

[-]

Hey mousematrix, I've tested with 3 concurrent sessions all writing to the same table - and with "BEGIN TRANSACTION" for each.

I opened a TPC-H SF1GB database with GizmoSQL, started 3 concurrent client sessions, and had each session run:

BEGIN TRANSACTION; INSERT INTO lineitem SELECT * FROM lineitem; COMMIT;

None of the 3 sessions were blocked - and all were able to insert into the table, and read the results of their inserts before committing.

It was very cool to see also that writes do NOT block reads from other sessions - the other sessions just could not see uncommitted data.

After committing in one session, the other sessions could see the data that was inserted.

I haven't stress tested the system to get limits, but I thought this little test would prove useful information for your question.

I'll work to do some benchmarking on concurrent inserts, etc.

Updates to the same data by multiple sessions behave a little differently - see details from page: https://duckdb.org/docs/stable/connect/concurrency.html

Thanks for your question!

▲

emthrowaway123

3 months ago

[-]

How does this compare to something like Clickhouse?

▲

philbe77

3 months ago

[-]

Hi emthrowaway123 - I tried comparing to ClickHouse by signing up for a free trial - and attempting to run the TPC-H 22-query benchmark with Scale-Factor 1TB against data stored in S3.

I repeatedly ran out of memory during benchmark query execution when attempting to use a 3-node cluster - with the biggest nodes my trial would allow.

I suspect that ClickHouse may not support "out-of-core" processing, b/c of the memory issues I experienced, but I'm not a ClickHouse expert by any means. I may have configured something incorrectly - but it was definitely not a turn-key experience to try to get up and running with a 1TB benchmark.

I had tried a few months ago, so I should likely try again sometime with a newer version of ClickHouse.