We leverage DuckDB as the stream processing engine, which gives SQLFlow the ability to process 10's of thousands of messages a second using ~250MiB of memory!
DuckDB also supports a rich ecosystem of sinks and connectors!
https://sql-flow.com/docs/category/tutorials/
https://github.com/turbolytics/sql-flow
We were tired of running JVM's for simple stream processing, and also of bespoke one off stream processors
I would love your feedback, criticisms and/or experiences!
Thank you
The stream is achieved by the continuous flow of data from Kafka.
SQLFlow exposes a variable for batch size. Setting the batch size to 1 will make it so SQLFlow reads a kafka message, applies the processor SQL logic and then ensures it successfully commits the SQL results to the sink, one after another.
SQLFlow provides at least once delivery guarantees. It will only commit the source message once it successfully writes to the pipeline output (sink).
https://sql-flow.com/docs/operations/handling-errors
The batch table is just a convention which allows for seamless batch size configuration. If your throughput is low, or if you require message by message processing, SQLFlow can be toggled to a batch of 1. If you need higher throughput and can tolerate the latency, then the batch can be toggled higher.
But really you should get excited for DuckDB Labs to build out materialized views. Materialized views where you can ingest more streaming data to update aggregates. This way you could just keep pushing rows through aggregates from Kafka.
It is going to be a POWER HOUSE for streaming analytics.
Contact DuckDB Labs if you want to sponsor the work on materialized views: https://duckdb.org/roadmap
DuckDB has everything that streaming engines such as Flink have; it just needs to support managing intermediate aggregate states and scheduling the materialized views itself.
Based on the tributary documentation, I understand that tributary embeds kafka consumers into duckdb. This makes duckdb the main process that you run to perform consumption. I think that this makes creating stream processing POCs very accessible. It looks like it is quite easy to start streaming data into duckdb. What I don't see is a full story around Devops, operations, testing, configuration as code etc.
SQLFlow is a service that embeds DuckDB as the storage and processing brains. Because of this, we're able to offer metrics, testing utilities, pipelines as code, and all the other DevOps utilities that are necessary to run a huge number of streaming instances 24x7. SQLFlow was created as a tool that I wish I had to for simple stream processing in production in high availability contexts :)
The DLQ and Prometheus integration out of the box are nice.
SQLFlow uses duckdb internally for windowing and stream state storage :), and I'll look at extending it to support stream / stream joins.
Could you describe a bit more about your use case? I'd really appreciate it if you could create an issue in the repo describing your use case and desired functionality a bit!
https://github.com/turbolytics/sql-flow/issues
We were looking at solving some of the simplier use cases first before branching out into these more complicated ones :)