I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.
Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.
(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)
Not hating on PHP, to be clear. It has its warts, but it has served me well.
"However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot."
At a first glance, it might sound like rkyv is better, after all, it has less restrictions and external schemas are annoying, but it doesn't actually solve the schema issue by having a self describing format like JSON or CBOR. You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.
This seems false after reading the book, the doc, and a cursory reading of the source code.
It is definitely independent of rust version. The code make use of repr(C) on struct (field order follows the source code) and every field gets its own alignment (making it independent from the C ABI alignment). The format is indeed portable. It is also versioned.
The schema of the user structs is in Rust code. You can make this work across languages, but that's a lot of work and code to support. And this project appears to be in Rust for Rust.
On a side note, I find the code really easy to understand and follow. In my not so humble opinion, it is carefully crafted for performance while being elegant.
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
Trusting possibly malicious inputs is an universal problem.
Here is a simple example:
echo "rm -rf /" > cmd
sh cmd
And this problem is no different in rkyv than rkvy_dyn or any other serialization format on the planet. The issue is trusting inputs. This is also called a man in the middle attack.The solution is to add a cryptographic signature to detect tempering.
But tools like Pickle or Java deserialization or, most likely, rkyv_dyn will happily give you outputs that contain callables and that contain behavior, and the result is not safe to access. (In Python, it’s wildly unsafe to access, as merely reading a field of a Python object calls functions encoded by the class, and the class may be quite dynamic.)
[0] The world is full of infamously dangerous XML parsers. Don’t use them, especially if they’re written in C or C++ or they don’t promise that they will not access the network.
> The solution is to add a cryptographic signature to detect tempering.
If you don’t have a deserializer that works on untrusted input, how do you verify signatures. Also, do you really thing it’s okay to do “sh $cmd” just because you happen to have verified a signature.
> This is also called a man in the middle attack.
I suggest looking up what a man in the middle attack is.
I was a bit confused when you compared it to Python pickle and assumed you were talking about general input validation somehow.
I agree that pickle and similar are profoundly surprising and error prone. I struggle to find any reasonable reason one would want that.
As for the man in middle attack, I meant that if somebody intercepts the serialized form, they can mutate it. And without a cryptographic signature, you wouldn't know.
Java is compiled to bytecode, and Obj-C is compiled to machine code. Yet both Android and iOS have had repeated severe vulnerabilities related to deserializing an object that contains a subobject of an unexpected type that pulls code along with it. It seems to be that rkyv_dyn has exactly the same underlying issue.
Sure, Rust is “safe”, and if all the unsafe code is sufficiently careful, it ought to be impossible to get the type of corruption that results in direct code execution, memory writes, etc. But systems can be fully compromised by semantic errors, too.
If I’m designing a system that takes untrusted input and produces an object of type Thing, I want Thing to be pure data. Once you start allowing an open set of methods on Thing or its subobjects, you have lost control of your own control flow. So doing:
thing.a.func()
may call a function that wasn’t even written at the time you wrote that line of code or even a function that is only present in some but not all programs that execute that line of code.Exploiting this is considerably harder than exploiting pickle, but considerably harder is not the same as impossible.
Ultimately you should read the code of rkyv_dyn to understand what it does instead of making random claims.
It will be faster for you to read the code than for me to attempt explaining how it works. Especially since you will most likely choose the least charitable interpretation of everything I say. There is very little code, it won't take long.
I really don't. I think you mean that Rust compiles to machine code and neither loads executable code at runtime nor contains a JIT, so you can't possibly open a file and deserialize it and end up with code or particularly code-like things from that file being executed in your process.
My point is that there's an open-ended global registry of objects that implement a given trait, and it's possible (I think) to deserialize and get an unexpected type out, and calling its methods may run code that was not expected by whoever wrote the calling code. And the set of impls and thus the set of actual methods may expand by the mere fact of linking something else into the project.
This probably won't blow up quite as badly as NSCoding does in ObjC because Rust is (except when unsafe is used) memory-safe, so use-after-free just from deserializing is pretty unlikely. But I would still never use a mechanism like this if there was any chance of it consuming potentially malicious input.
The first library that comes to mind when I think of this is `serde` with `#[derive(Serialize, Deserialize)]`, but that gives persistence-format output as you describe is preferable to the former case. I usually use it with JSON.
So, this seems like it may be a false dichotomy.
rkyv is not like this.
It’s like you listed a bunch of serialization technologies without grokking the problem outlined in the post doesn’t have much to do with rkyv itself.
The only problem in the blog post is efficient coding of optional fields and all they was introduce a bitmap. From that perspective, JSON and XML solve the optional fields problem to perfection, since an absent field costs exactly nothing.
Capnproto doesn’t support transform on serialize - the optional fields still take up disk space unless you use the packed representation which has some performance drawbacks. Also the generated capnproto rust code is quite heavy on compile times which is probably some consideration that’s important for compiling queries.
I'm not sure you are serious. What open problem do you have in mind? Support for persisting and deserializing optional fields? Mapping across data types? I mean, some JSON deserializers support deserializing sparse objects even to dictionaries. In .NET you can even deserialize random JSON objects to a dynamic type.
Can you be a little more specific about your assertion?
Your deserializer is probably running on a CPU, and that CPU probably has a very fast L1 cache and might be targeted by a compiler that can do scalar replacement of aggregates and such. A non-zero-overhead deserializer can run very quickly and result in the output being streamed efficiently from its source and ending up hot in L1 in a useful format. A zero-overhead deserializer might do messy reads in a bad order without streaming hints and run much slower.
And then to get very very large records, as in the OP, where getting a good on-disk layout may require thought. And, frequently, the right layout isn’t even array-of-structs, which is why there are so many tools designed to query column stores like Parquet efficiently.
Unless the reason you care about space is because it's some sort of wire protocol for a slow network (like LoRaWAN or Iridium packets or a binary UART protocol), where compression probably doesn't make sense because the compression overhead is too large. But even here, just defining the data layout makes sense, I think.
Tihs could take the form of a C struct with __attribute__((packed)) but that is fragile if you care about more platforms than one. (I generally don't, so that works for me!).
BS. Nothing can be faster than a read()/write() (or even mmap()) into a struct, because everything else would need to do more work.
This is basically Rob Pike's Rule 5: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.(https://users.ece.utexas.edu/~adnan/pike.html)
If anything it's the other way round, if you're not talking about business domain modeling (where data structures first is a valid approach).
And even there, the data models usually come about to make specific business processes easier (or even possible). An Order Summary is structured a specific way to allow both the Fulfilment and Invoicing processes possible, which feed down into Payment and Collections processes (and related artefacts).
It's not enough for a data structure to represent the "fundamental" degrees of freedom needed to model the situation; the algorithmic needs (vis-a-vis the available resources) most definitely matter a lot.
I'm saying that if you care about performance, data structures should be designed with approach specific tradeoffs in mind. And like I've said above, in typical business apps, it's ok to start with data structures because (a) performance is usually not a problem, (b) staying close to the domain is cleaner.
But the whole discussion involves knowing how you will use it; the advocacy is for careful consideration of data structures (based on how you will use them) resulting in less pain when designing/choosing algorithms.
> If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.
This is what I was responding to.
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
I (deep, deep in embedded systems) have seen this too often, that code is incredibly complex and impossible to reason around because it needs to reach into some data structure multiple times from different angles to answer what should be rather simple questions about next step to take.
Fix that structure, and the code simplifies automagically.
Hard disagree. That database table was a waving red flag. I don't know enough/any rust so don't really understand the rest of the article but I have never in my life worked with a database table that had 700 columns. Or even 100.
I remember a phrase from one of C. J. Date's books: every record is a logical statement. It really stood out for me and I keep returning to it. Such an understanding implies a rather small number of fields or the logical complexity will go through the roof.
I used to work in a company that had all the tags in their SCADA system feed into SQL tables. They had multiple tables (as in tens of tables), because they ran out of columns ...
I think this company was ahead of the curve.
Take this passage:
> The app relied on a SOAP service, not to do any servicey things. No, the service was a pure function. It was the client that did all the side effects. In that client, I discovered a massive class hierarchy. 120 classes each with various methods, inheritance going 10 levels deep. The only problem? ALL THE METHODS WERE EMPTY. I do not exaggerate here. Not mostly empty. Empty.
> That one stumped me for a while. Eventually, I learned this was in service of building a structure he could then use reflection on. That reflection would let him create a pipe-delimited string (whose structure was completely database-driven, but entirely static) that he would send over a socket.
Classes with empty methods? Used reflection to create a pipe-delimited string? The string was sent over the wire?
Why congratulations, you just rediscovered data transfer objects, specifically API models.
As to your hard disagree, I guess it depends... While this particular user is on the higher end (in terms of columns), it's not our only user where column counts are huge. We see tables with 100+ columns on a fairly regular basis especially when dealing with larger enterprises.
If it's not obvious, I agree with the hard disagree. Every time I see a table with that many columns, I have a hard time believing there isn't some normalization possible.
Schemas that stubbornly stick to high-level concepts and refuse to dig into the subfeatures of the data are often seen from inexperienced devs or dysfunctional/disorganized places too inflexible to care much. This isn't really negotiable. There will be issues with such a schema if it's meant to scale up or be migrated or maintained long term.
Also, normalization solves a problem that’s present in OLTP applications: OLAP/Big Data applications generally have problems that are solved by denormalization.
We have many large enterprises from wildly different domains use feldera and from what I can tell there is no correlation between the domain and the amount of columns. As fiddlerwoaroof says, it seems to be more a function of how mature/big the company is and how much time it had to 'accumulate things' in their data model. And there might be very good reasons to design things the way they did, it's very hard to question it without being a domain expert in their field, I wouldn't dare :).
This is unbelievable. In purely architectural terms that would require your database design to be an amorphous big ball of everything, with no discernible design or modelling involved. This is completely unrealistic. Are queries done at random?
In practical terms, your assertion is irrelevant. Look at the sparse columns. Figure out those with sparse rows. Then move half of the columns to a new table and keep the other half in the original table. Congratulations, you just cut down your column count by half, and sped up your queries.
Even better: discover how your data is being used. Look at queries and check what fields are used in each case. Odds are, that's your table right there.
Let's face it. There is absolutely no technical or architectural reason to reach this point. This problem is really not about structs.
Feldera provides a service. They did not design these schemas. Their customers did, and probably over such long time periods that those schemas cannot be referred to as designed anymore -- they just happened.
IIUC Feldera works in OLAP primarily, where I have no trouble believing these schemas are common. At my $JOB they are, because it works well for the type of data we process. Some OLAP DBs might not even support JOINs.
Feldera folks are simply reporting on their experience, and people are saying they're... wrong?
I remember the first time I encountered this thing called TPC-H back when I was a student. I thought "wow surely SQL can't get more complicated than that".
Turns out I was very wrong about that. So it's all about perspective.
We wrote another blog post about this topic a while ago; I find it much more impressive because this is about the actual queries some people are running: https://www.feldera.com/blog/can-your-incremental-compute-en...
Strong disagree. I'll explain.
Your argument would support the idea of adding a few columns to a table to get to a short time to market. That's ok.
Your comment does not come close to justify why you would keep the columns in. Not the slightest.
Tables with many columns create all sorts of problems and inefficiencies. Over fetching is a problem all on itself. Even the code gets brittle, where each and every single tweak risks beijg a major regression.
Creating a new table is not hard. Add a foreign key, add the columns, do a standard parallel write migration. Done. How on earth is this not practical?
I don't agree. Let me walk you through the process.
- create the new table - follow a basic parallel writes strategy -- update your database consumers to write to the new table without reading from it -- run a batch job to populate the new table with data from the old table -- update your database consumer to read from the new table while writing to both old and new tables
From this point onward, just pick a convenient moment to stop writing to the old database and call the migration done. Do post-migrarion cleanup tasks.
> Additionally, it’s easier to get work scheduled to ship a feature than it is to convince the relevant players to complete the swing.
The ease of piling up technical debt is not a justification to keep broken systems and designs. It's only ok to make a messs to deliver things because you're expected to clean after yourself afterwards.
Mistakes made early on and not corrected can snowball and lead to this kind of mess, which is very hard to back out of.
Fine, but you still need to read in those 100+ fields. So now you gotta contend with 20+ joins just to pull in one record. Not more practical than a single SELECT in my opinion.
Performance generally isn't an issue for an arbitrary number of joins as long as your indices are set up correctly.
If you really do need a bulk read like that I think you want json columns, or to just go all in with a nosql database. Even then, the above regarding indexing is still true.
So it sounds like helping customers with databases full of red flags is their bread and butter
Yes that captures it well. Feldera is an incremental query engine. Loosely speaking: it computes answers to any of your SQL queries by doing work proportional to the incoming changes for your data (rather than the entire state of your database tables).
If you have queries that take hours to compute in a traditional database like Spark/PostgreSQL/Snowflake (because of their complexity, or data size) and you want to always have the most up-to-date answer for your queries, feldera will give you that answer 'instantly' whenever your data changes (after you've back-filled your existing dataset into it).
There is some more information about how it works under the hood here: https://docs.feldera.com/literature/papers
I've worked on multiple products that have had a concept of "custom fields" who did it this way too.
771 columns (and I've read the definitions for them all, plus about 50 more that have been retired). In the database, these are split across at least 3 tables (registry, patient, tumor). But when working with the records, it's common to use one joined table. Luckily, even that usually fits in RAM.
Main table at work is about 600, though I suspect only 300-400 are actively used these days. A lot come from name and address fields, we have about 10 sets of those in the main table, and around 14 fields per.
Back when this was created some 20+ years ago it was faster and easier to have it all in one row rather than to do 20+ joins.
We probably would segment it a bit more if we did it from scratch, but only some.
But OLAP tables (data lake/warehouse stuff), for speed purposes, are intentionally denormalized and yes, you can have 100+ columns of nullable stuff.
Exactly this.
This article is not about structs or Rust. This article is about poor design of the whole persistence layer. I mean, hundreds of columns? Almost all of them optional? This is the kind of design that gets candidates to junior engineer positions kicked off a hiring round.
Nobody gets fired for using a struct? If it's an organization that tolerates database tables with nearly 1k optional rows then that comes at no surprise.
They don’t have the option to clean up the data.
100s is not unusual. Thousands happens before you realise.
Some folks pointed out that no one should design a SQL schema like this and I agree. We deal with large enterprise customers, and don't control the schemas that come our way. Trust me, we often ask customers if they have any leeway with changing their SQL and their hands are often tied. We're a query engine, so have to be able to ingest data from existing data sources (warehouse, lakehouse, kafka, etc.), so we have to be able to work with existing schemas.
So what then follows is a big part of the value we add: which is, take your hideous SQL schema and queries, warts and all, run it on Feldera, and you'll get fully incremental execution at low latency and low cost.
700 isn't even the worst number that's come our way. A hyperscale prospect asked about supporting 4000 column schemas. I don't know what's in that table either. :)
Which brings me to the question, why a rowstore? Are Z-sets hard to manage otherwise?
Another aspect of wide tables is that they tend to have a lot of dependencies, ie different columns come from different aggregations, and the whole table gets held up if one of them is late. IVM seems like a good solution for that problem.
Feldera tries to be row- and column-oriented in the respective parts that matter. E.g. our LSM trees only store the set of columns that are needed, and we need to be able to pick up individual rows from within those columns for the different operators.
I don't think we've converged on the best design yet here though. We're constantly experimenting with different layouts to see what performs best based on customer workloads.
The bitmap trick is elegant and I've seen similar patterns in other contexts. The core insight resonates beyond Rust and SQL: the data structure that's "obvious" at design time can become a bottleneck when the real-world usage pattern diverges from your assumptions. Most fields exist vs most fields might not exist is a subtil but critical distinction.
The fix being a simple layout change rather than a clever algorithm is also a good reminder. I've spent 20 years building apps and the most impactful optimizations were almost always about changing the shape of data, not adding complexity.
I'm not saying this is better than your solution, just curious :^)
The length is first. The pointer second. The inline string is terminated with 0xFF. The length is 62 bits out of 64 bits such that a specific pattern is placed in the first byte that utf8 doesn't collide with.
There may be good reasons (I don't know any) why it wasn't done like this, but from a high-level it looks possible to me too yes.
https://www.postgresql.org/docs/current/storage-page-layout....
I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.
It comes off as being a novel solution rather than connecting it to a long tradition of DB design. I believe PG for instance has used a null bitmap since the beginning 40 years ago.
Using bitmaps to indicate whether things are in-use or not is very common in systems programming. Like you said PG does it, but most other systems do this too. It's not specific to databases: in an operating system, one of the first thing it needs is an allocator, the allocator most likely will use some bitmap trick somewhere to indicate what's free vs. what's available etc.
>The fix is simple: we stop storing Option<T> and instead we store a bitmap that records which fields are None.
Right here would have been the opportunity to let those not familiar with database internals know. Something like "This technique has been widely used in many RDBMS systems over the years, you can see the PG version of this in their documentation for page layout".
Instead you go into detail on what a null bitmap is and how to implement it, calling it a "trick". Which is strange if you think your target audience is assumed to already know this common technique.
I mean not one mention of the origin of the trick or even calling it a common solution to this problem...
And who or what would you say is the origin? The "trick" is so old I'm afraid it is lost to time to say who invented the bitmap. It was used in MULTICS or THE long before PostgreSQL was invented.
Your version comes off as self promotion, going into detail how to implement a the technique with out once calling it a common technique or even mentioning its widely used in the industry to solve the problem you are solving.
It would be like saying finding rows in storage is slow so we implemented a b-tree in a separate structure for certain columns and go into detail how to create a b-tree, calling it a "trick" while never once acknowledging the is the most common form of indexing in RDBMS's, would you understand how this would seem strange?
Feel free to try it out, it's open source: https://github.com/feldera/feldera/
Did I miss something? They didn't mention why the new use case was slower. I was expecting some callback to that new usecase in the article somewhere.
Always enjoy reading about performance debugging, thanks for writing this.
Edit: they didn't talk about profiling either. It was an enjoyable read of rust serialisation for non rusty people though.
https://www.hopsworks.ai/post/rolling-aggregations-for-real-...
The most common programming language where "struct" and "class" are both kinds of user defined type is C++. In C++ they only "work differently" in that the default accessibility is different, a "struct" is the same as a "class" if you change the accessibility to public at the top with an access specifier.
If you think you saw a bigger difference you're wrong.
The overhead of screwing up NUMA concerns vastly outstrips any kind of class vs struct differences. It's really one of the very last things you should be worrying about.
Allocating an array of a class vs an array of struct might seem like you're getting a wildly different memory arrangement, but from the perspective of space & time this distinction is mostly pointless. Where the information resides at any given moment is the most important thing (L1/L2/L3/DRAM/SSD/GPU/AWS). Its shape is largely irrelevant.
I would assume because then the shape of the data would be too different? SOAs is super effective when it suits the shape of the data. Here, the difference would be the difference between an OLTP and OLAP DB. And you wouldn't use an OLAP for an OLTP workload?
Really? I've never had to do any serious db work in my career, but this is a surprise to me.
So if you want "what C does" you can just repr(C) and that's what you get. For most people that's not a good trade unless they're doing FFI with a language that shares this representational choice.