Yet today I feel it was 2016 dataders who is the crazy one lol
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
I switched to this as well and its mainly because explorations would need to be translated to SQL for production anyways. If I start with pandas I just need to do all the work twice.
The harder thing to overcome is that pandas has historically had a pretty "say yes to things" culture. That's probably a huge part of its success, but it means there are now about 5 ways to add a column to a dataframe.
Adding support for arrow is a really big achievement, but shrinking an oversized api is even more ambitious.
(if you have already done so and it wasn't resolved, feel free to ping me on it)
I thought a colleague of mine had filed an issue but I didn’t find it. I filed it myself just now: https://github.com/apache/arrow/issues/49310
feather is optimized for fast reading
Feather might be a better fit for sime yse cases, but parquet has fantastic support and is still a pretty good choice for things that feather does.
Unless they're really focussed on eaking out every bit of read performance, people often opt for the well supported path instead.
I think by now a lot of people know you can write to Avro and compact to Parquet, and that is a key area of development. I'm not sure of a great solution yet.
Apache Iceberg tables can sit on top of Avro files as one of the storage engines/formats, in addition to Parquet or even the old ORC format.
Apache Hudi[2] was looking into HTAP capabilities - writing in row store, and compacting or merge on read into column store in the background so you can get the best of both worlds. I don't know where they've ended up.
I wish there was an industry standard format, schema-compatible with Parquet, that was actually optimized for this use case.
I actually wrote a row storage format reusing Arrow data types (not Feather), just laying them out row-wise not columnar. Validity bits of the different columns collected into a shared per-row bitmap, fixed offsets within a record allow extracting any field in a zerocopy fashion. I store those in RocksDB, for now.
https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...
https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...
https://git.kantodb.com/kantodb/kantodb/src/branch/main/crat...
Sure, except insofar as I didn’t want to pretend to be columnar. There just doesn’t seem to be something out there that met my (experimental) needs better. I wanted to stream out rows, event sourcing style, and snarf them up in batches in a separate process into Parquet. Using Feather like it’s a row store can do this.
> kantodb
Neat project. I would seriously consider using that in a project of mine, especially now that LLMs can help out with the exceedingly tedious parts. (The current stack is regrettable, but a prompt like “keep exactly the same queries but change the API from X to Y” is well within current capabilities.)
Speaking as a Rustafarian, there's some libraries out there that "just" implement a WAL, which is all you need, but they're nowhere near as battle-tested as the above.
Also, if KantoDB is not compatible with Postgres in something that isn't utterly stupid, it's automatically considered a bug or a missing feature (but I have plenty of those!). I refuse to do bug-for-bug compatible and there's some stuff that are just better not implement in this millennia, but the intent is to make it be I Can't Believe It's Not Postgres, and to run integration tests against actual everyday software.
Also, definitely don't use KantoDB for anything real yet. It's very early days.
I have a WAL that works nicely. It surely has some issues on a crash if blocks are written out of order, but this doesn’t matter for my use case.
But none of those other choices actually do what I wanted without quite a bit of pain. First, unless I wire up some kind of CDC system or add extra schema complexity, I can stream in but I can’t stream out. But a byte or record stream streams natively. Second, I kind of like the Parquet schema system, and I wanted something compatible. (This was all an experiment. The production version is just a plain database. Insert is INSERT and queries go straight to the database. Performance and disk space management are not amazing, but it works.)
P.S. The KantoDB website says “I’ve wanted to … have meaningful tests that don’t have multi-gigabyte dependencies and runtime assumptions“. I have a very nice system using a ~100 line Python script that fires up a MySQL database using the distro mysqld, backed by a Unix socket, requiring zero setup or other complication. It’s mildly offensive that it takes mysqld multiple seconds to do this, but it works. I can run a whole bunch of copies in parallel, in the same Python process even, for a nice, parallelized reproducible testing environment. Every now and then I get in a small fight with AppArmor, but I invariably win the fight quickly without requiring any changes that need any privileges. This all predates Docker, too :). I’m sure I could rig up some snapshot system to get startup time down, but that would defeat some of the simplicity of the scheme.
There is room still for an open source HTAP storage format to be designed and built. :-)
Arrow is also directly usable as the application memory model. It’s pretty common to read Parquet into Arrow for transport.
It’s pretty common to read Parquet into Arrow for transport.
I'm confused by this. Are you referring to Arrow Flight RPC? Or are you saying distributed analytic engine use arrow to transport parquet between queries?
Recently we have started documenting this to better inform choices: https://parquet.apache.org/docs/file-format/implementationst...
It's very neat for some types of data to have columns contiguous in memory.
That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.
Not sure if I misunderstood, what are the chances those different consumers / tools / operations are running in your memory space?
You still have to transfer the data, but you remove the need for a transformation before writing to the wire, and a transformation when reading from the wire.
The key phrase though would seem to be “memory representation”m and not “same memory”. You can spit the in-memory representation out to an Arrow file or an Arrow stream, take it in, and it’s in the same memory layout in the other program. That’s kind of the point of Arrow. It’s a standard memory layout available across applications and even across languages, which can be really convenient.
You can also store arrow on disk but it is mainly used as in-memory representation.
it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.
in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.
and, imho, it all really comes down to standard data types for columns!
Also, a good proportion of web apis are sending pretty small data sizes. On mass there might be an improvement if everything was more efficiently represented, but evaluating on a case by case basis, the data size often isn't the bottleneck.