Current implementation has the following limitations:
      Maximum object size: 65534 keys
      The order of object keys is not preserved
      ...
    These limitations may be lifted by using more bytes to store offset pointers and counts on binary level. Though it's hard to imagine a real application which would need that.
on the other hand most database decisions are about finding the sweet spot compromise tailored toward the common use case they are aiming for, but your comment sound like you are expecting a magic trick.
Sticking data into the keys is definitely a thing I've seen.
One I've done personally is dump large portions of a Redis DB into a JSON object. I could guarantee for my use case it would fit into the relevant memory and resource constraints but I would also have been able to guarantee it would exceed 64K keys by over an order of magnitude. "Best practices" didn't matter to me because this wasn't an API call result or something.
There are other things like this you'll find in the wild. Certainly some sort of "keyed by user" dump value is not unheard of and you can easily have more than 64K users, and there's nothing a priori wrong with that. It may be a bad solution for some specific reason, and I think it often is, but it is not automatically a priori wrong. I've written streaming support for both directions, so while JSON may not be optimal it is not necessarily a guarantee of badness. Plus with the computers we have nowadays sometimes "just deserialize the 1GB of JSON into RAM" is a perfectly valid solution for some case. You don't want to do that a thousand times per second, but not every problem is a "thousand times per second" problem.
FoundationDB makes extensive use of this pattern, sometimes with no data on the key at all.
SICK: Streams of Independent Constant Keys
And "maps" seems to be a use case it is deliberately not aiming at.
You could store this as two columnar arrays but that is annoying and hardly anyone does that.
You would do a query like "give me all users with age over 18" or something and return a `{ [id: string]: User }`
If that optimization isn't for you, choose a different library.
If that optimization works for your use-case, it can make a huge difference.
The whole point of this project is to handle efficiently parsing "huge" JSON documents. If 65K keys is considered outrageously large, surely you can make do with a regular JSON parser.
You can split it yourself. If you can't, replace Shorts with Ints in the implementation and it would just work, but I would be very happy to know your usecase.
Just bumping the pointer size to cover relatively rare usecases is wasteful. It can be partially mitigated with more tags and tricks, but it still would be wasteful. A tiny chunking layer is easy to implement and I don't see any downsides in that.
Presumably 4 bytes dedicated to the keys would be dwarfed by any strings thrown into the dataset.
Regardless, other than complexity, would there be any reason to not support a dynamic key size? You could dedicate the first 2 bits on the key to the length of the key. 1 byte would work if there's only 64 keys, 2 bytes would give you 16k keys and 3 4M. And if you wanted to you could use a frequency table to order the pointers such that more frequently used keys are smaller values in the dictionary.
What is the difference?
The limitation comes with benefits.
I was just responding to the “X is an absurd way to do JSON”. Which seemed to single out objects vs arrays.
Like in this case maybe, but I don’t see a reason to make that general statement.
If you need support for larger structures, you may create your own implementation or extend ours (and I would really like to hear about your usecase).
SICK as a concept is simple. SICK as the library was created to cover some particular usecases and may be not suitable for everyone. We would welcome any contributions.
I have another binary encoding for different purposes (https://github.com/7mind/baboon) which relies on varints, in case of SICK I decided to go with pointers of constant size to save some pennies on access efficiency.
If I were to add support for larger amount of keys, I probably would introduce two versions of the data structure, with 16-bit and 32-bit indexing. And, maybe, 8-bit indexing for tiny amounts of keys. But that would definitely complicate the design, and should be done only when there's a real need.
Every such decision is a trade-off; I think yours is fine.
It needs to be the very first key in the object. I’ve been bitten by this because postgresql’s jsonb also does not preserve the key ordering.
I believe the latest .net release addresses this but key ordering does matter sometimes.
> An object is an unordered collection of zero or more name/value pairs, [...]
Further, since RFC 7159:
> JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible to calling software. Implementations whose behavior does not depend on member ordering will be interoperable in the sense that they will not be affected by these differences.
Both are in the current version (RFC 8259).
OTOH, I find the "but the order is not supposed to be guaranteed!" debate REALLY stupid when it comes to software where it's clear that at some point, a human will have to look at the content and correlate it with another system.
There's nothing more evil than re-shuffling JSON just for the fun of it and making everyone who has to look at the result miserable. Yes, I'm talking about you, ELK devs.
Edit: (And/or whoever wrote the underlying Java/Go libs they use for JSON that don't allow developers to patch ordering in. I remember reading GitHub issues about this.)
The underlying data structures between both are different. Ordered hash maps use more memory, are slower, and are more complicated.
Knowing CS fundamentals, using an ordered hash map should be a deliberate choice like renting a box truck when you need to move a lot of stuff. Don’t just drive a box truck everywhere because you might pick up a couch from a thrift store one day.
And yet, as I said, if the same thinking gets applied to e.g. a store of JSON documents (like ELK), chances are good the thing will ruin the UX for countless people who have to deal with the result. Note that you need exactly no hash maps to store the JSON as it is text.
To expand your analogy: …and yet roads are built so that you can drive your regular car or a box car over them, depending on your use case. You make the choice. A JSON library that doesn't afford such choices (and isn't hyper focused on performance) isn't a good one in my book.
Edit: As a sidenote: Or do you mean a freight train wagon? Then replace "road" with "rails" and "car" with "draisine" :)
Essentially, SICK also maintains some strange order based on values of some homegrown trivial hash function but the only right approach to JSON objects is to treat their keys as an unordered set.
Secondly, I fail to see advantages here as the claim is that it allows streaming for partial processing compared to JSON that has to be fully loaded in order to be parseable. Mainly, because the values must be streamed first, before their location/pointers in order for the structure to make sense and be usable for processing, but that also means we need all the parent pointes as well in order to know where to place the children in the root. So all in all, I just do not see why this is advantageous format above JSON(as that is its main complaint here), since you can stream JSON just as easily because you can detect { and } and { and ] and " and , delimiters and know when your token is complete to then process it, without having to wait for the whole structure to finish being streamed or wait for the SICK pointers to arrive in full so you can build the structure.
Or, I am just not getting it at all...
It's a specific representation of JSON-like data structures, with an indexed deduplicated binary format and JSON encoders and decoders. Why "nothing"? It's all about it.
Mostly it's not about streaming. More efficient streaming is a byproduct of the representation.
> because you can detect { and } and { and ] and " and ,
You need a pushdown automaton for that. In case of SICK you don't need potentially unbounded accumulation for many (not all) usecases.
> the values must be streamed first, before their location/pointers
Only to avoid accumulation. If you are fine with (some) accumulation, you can reorder. Also think about the updates.
But again, streaming is a byproduct. This tool is an indexed binary deduplicating storage which does not require parsing and provides amortized O(1) access time.
> There is an interesting observation: when a stream does not contain removal entries it can be safely reordered.
So if I'm understanding, the example in the readme could be sent in reverse, allowing the client to immediately use root:0 and then string:2 while the rest streams in.
I was looking for something like this, but my use case exceeds the 65k key limit for objects.
The limit comes from 2-byte element pointer size. That can be adjusted. We don't have an implementation with larger pointers but it can be done easily.
> while the rest streams in
Yes, there are many usecases where you can use some chunks of the data/rebuild some parts of the structures immediately without any accumulation. The problem is that we don't have a nice streaming abstraction which would suit anyone for any usecase.
SICK as a library is an efficient indexed binary storage for JSON with listed limitations.
SICK as a concept is much more but you might need your own implementation tailored to your usecase.
Overall, I think that mentioning JSON here at all is simply a mistake. It would be better to just introduce this as streaming protocol/framework for data structures. But then we can do the same thing with literally any format and syntax.
It's literally a deduplicated indexed binary storage for JSON (plus an approach to JSON representation more suitable for streaming than serialized JSON).
> we can do the same thing
I would highly encourage doing same things! For some reason people love to fully parse a 50MiB long JSON document when they need just 0.1% of the data in it.
However I found that in cases where you have the requirements of streaming/random access and every entry is the same... SQLite is a really great choice. It's way faster and more space efficient than it has any right to be, and it gets you proper random access (not just efficient streaming), and there are nice GUIs for it. And you can query it.
If we need more bindings for the projects we work on - we will implement and opensource them. E.g. recently we added rudimentary JS support (no cursors, just encoder/decoder).
For many reasons, we avoid working on something we don't use ourselves and we are not paid for. But your contributions are very welcome. Also we would be happy to have you as a paying client.