In the past I've abused virtual memory systems to block off a bunch of pages after my array. This lets you use an array data structure, have guard pages to prevent out of bounds access, and to have stable pointers in the data structure.
And, if you find you didn't reserve enough address space, Linux has mremap() which can grow the reserved region. Or map the region to two places at once (the original place and a new, larger place).
The space I needed was too large to be added to the heap, so I used mmap. Because of the nature of the processing (mmap, process, yeet mmap) I put the system under a lot of pressure. Maintaining the set of mapped blocks and reusing them around fixed the issue.
https://en.wikipedia.org/wiki/Intel_5-level_paging introduced in Ice Lake 6 years ago.
But anyways, isn't this just variant of std::deque? https://en.cppreference.com/w/cpp/container/deque.html
And it can't be fixed due to binary compatibility.
https://github.com/microsoft/STL/issues/147
By contrast the GNU implementation has a block size of 512 bytes
Fortunately in high performance systems the times where you actually want an unbounded queue are limited.
(1) deque uses fixed-sized blocks, not increasing-size blocks. (2) dequeue supports prepending, which adds another level of indirection internally.
Your indexing has some legitimate math to be done now which can be annoying (efficiency perspective) I think you can still get o(1) with careful allocation of powers of 2.
That said, IMO "stable pointers" is overrated; "minimize copying" is all that's useful.
With this power of twos approach you can't really truly delete from the front of the array but the amount of pointers you store is constant and the memory fragmentation is better. (Though OP never claimed to want to support deque behavior, it shouldn't be that hard to modify, though indexing seems like it has to go thru more arithmetic again)
I haven't used OP's array, but I have been bit plenty of times with std::deque's memory allocation patterns and had to rewrite with raw arrays and pointer tracking.
As long as the number of slots in a bucket is a power of two division reduces to right shift (I have no idea what std::deque does though). One of the advantages of not growing exponentially is that you can use a deque the way you would a ring buffer without it attempting to eat the entire address space.
I think VMS (or was it Tru64?) uses this mode, but many other OSes just use 43-bit or 40-bit addressing. Realistically though, I don’t think many users would be using workloads that addressed more than 38-bits worth of contiguous VA space in 1998-1999.
MSVC uses a too small block size making it worthless. libc++ block size is 16 elements or 4096 bytes.
It is generally better to use a container you can actually understand the implementation details and control.
I would not call it a variant of std::deque myself. Not wrong. But not a helpful observation imho.
Its main advantages are the O(log n) time complexity for all size changes at any index, meaning you can efficiently insert and delete anywhere, and it is easy to implement copy-on-write version control on top of it.
The only time I've ever wanted ropes is in text editing - either in an editor or in a CRDT library. They're a good choice for text editing because they let users type anywhere in a document. But that comes at a cost: Rope implementations are very complex (skip lists have similar complexity to a b-tree) and they can be quite memory inefficient too, depending on how they're implemented. They're a bad choice for small strings, immutable strings and append only strings - which as I said, are the most common string types.
Ropes are amazing when you need them. But they don't improve the performance of the average string, or the average program.
Only use them when the theoretical algorithmic properties make them the only tool for the job.
If you have a small dataset, yeah, memcpy will outperform a lot of indirect pointer lookups. But that doesn't stay true once you're memcpying around megabytes of data. The trick with indirect data structures on modern hardware is to tune the size of internal nodes and leaf nodes to make the cache misses worth it. For example, Binary trees are insanely inefficient on modern hardware because the internal nodes have a size of 2. If you give them a size of 64 or something, they perform much better. (Ie, make a b-tree). Likewise, a lot of bad tree implementations put just a single item in the leaf nodes. Its much better to have leaves store blocks of 128 items or something. And use memcpy to move data around within the block when needed.
This gets you the best of both worlds.
I spent about 18 months optimising a text based CRDT library (diamond types). We published a paper on it. By default, we store the editing history on disk. When you open a document, we reconstruct the document from scratch from a series of edits. After awhile, actually applying the stream of edits to a text document became the largest performance cost. Ropes were hugely useful. There's a stack of optimisations we made there to make that replay another 10x faster or so on top of most rope implementations. Using a linear data structure? Forget it. For nontrivial workloads, you 100% want indirect data structures. But you've gotta tune them for modern hardware.
But yeah. Contiguous dynamic arrays and hash tables, those are usually what you want.
Do any vector implementations & allocators actually do this though?
As far as "has anyone implemented this?" -- I don't know.
I've needed to add knobs to configure it, because even a handful of 4GB instances causes issues. I've defaulted to 256MB/instance, and for my own GitHub CI use 32MB/instance to reduce test flakiness.
This is to say: I found the idea that “just reserve address space, it's not an issue of you don't commit it” very flaky in practice, unless you're running on bare metal Linux.
Virtual address space just isn't free.
It doesn't need to be free, but your address space use should be able to go a handful of bits above your physical memory without problems.
Even fully allocating all the page entries, with no large pages, on an architecture with 4KB pages, would only take 8MB to track a 4GB allocation.
The LeanStore and Umbra DB buffer pool management systems intrigued me. I even got paid to work on a system like this (buffer pool at an analytical database company). The KISS data structure (https://dl.acm.org/doi/10.1145/2236584.2236587) and the START adaptive radix tree (https://db.in.tum.de/~fent/papers/Self%20Tuning%20Art.pdf?la...) have also intrigued me. I've been very intrigued and almost always disappointed.
In practice performance gains have never really materialized for me ... I do think there are wins to be had here, but only for specialized cases.
And just watch how confused and annoyed your users are when they see an absolutely huge VSS. Sysadmins freaking out at your "insane memory usage" (which is really only on paper not in reality). And the difficulty of profiling or tracking what's going on easily.
Ha, that is wishful thinking. If you do this in a tight loop in which everything is in the L1 cache, the instructions hurt!
"Memory bandwidth is the bottleneck" reasoning applies when you access bulk data without localized repetition.
Also, from a strictly prose point of view, isn't it strange that the `clz` instruction doesn't actually appear in the 10-instruction disassembly of the indexing function? It feels like it was optimized out by the compiler perhaps due to the index being compile-time known or something, but after the setup and explanation that was a bit jarring to me.
Edit: it's CLZ on Arm [1], probably what I was looking for.
[1]: https://developer.arm.com/documentation/100069/0610/A32-and-...
[1] https://web.archive.org/web/20000819070651/http://www.quanta...
Is it though? (Ab)using C macros so you can write obviously-not-C stuff like (from the example):
SegmentArray(Entity) entities = {0};
Seeing that kind of thing in example C code just makes my hair stand on end because you know it's someone who actually wants to write C++ but for whatever reason has decided to try to implement their thing in C and be clever about it. And I'm going to have to go parse through multiple levels of macro indirection to just understand what the hell is going on.
Seems like a useful data structure, despite the shortcoming that it can't be accessed like a regular array. Normally auto-expanding arrays involves realloc which is tricky with arena allocation. But jeez, just pass void pointers + size and have it assert if there's a mismatch.
It's using the `bsr` instruction which is similar (but worse). The `lzcnt` instruction in x86_64 is a part of the BMI feature introduced in Intel Haswell. The compiler does not generate these instructions by default so it runs on any x86_64.
If you add `-mbmi` or `-march=haswell` or newer to the compiler command line, you should get `clz`/`lzcnt` instead.
One frequent reason to use an array is to iterate the items. In those cases, non-contiguous memory layout is not ideal.
2. iterating a 4 billion item segmented array would have 26 cache misses. Not a big deal.
[1]: https://danielchasehooper.com/posts/typechecked-generic-c-da...
Mostly the thing that feels strange is when using say, n > 10 segments, then the smallest segment will be less than a thousandth of the largest, and iterating over the first half will access n-1 or n-2 segments, worse cache behaviour, while iterating over the second half will access 1 or two segments.
Seems like, in most cases, you would want to be able to collapse those earlier segments together.
Desktop gcc main.c
Desktop ./a.out
entities[0].a = 1
entities[0].a = 1
entities[1].a = 2https://gist.github.com/fpf3/71c72e224e1c82d9a5d37be621def42...
The errors make sense. You can't put a comma-separated initializer into a macro... What even is the symbol `entity`? It's not even clear to me what is meant by that, he doesn't define it anywhere.
edit: Looks like he updated his header since I first tried to compile. Now it works fine but the header looks significantly more complicated.