https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
[1]: https://github.com/git/git/graphs/contributors?from=6%2F3%2F...
I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.
https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.
Also, the wise statement that 'memory is fairly cheap compared to CPU for scaling'. It's insane to see how often folks would rather manually open and scan a 'static-on-deploy' 20-100MB Json file for each request vs just parsing it into structures in memory (where, for most cases, the in memory usage is a fraction of the json itself) and just caching the parsed structure for the length of the application.
Less brittleness is worth paying a few percent. Especially if it unmuddies the waters enough for someone to spot other accidental (time) complexity.
But I also don't dabble in this area nearly enough to know whether there's years of tears and toil finding out repeatedly that O(n) is ~impossible to implement and verify :)
| n | n log n |
| 5 | 8.0472 |
| 10 | 23.0259 |
| 25 | 80.4719 |
| 50 | 195.6012 |
| 100 | 460.5170 |
If you expect that n < 100 will always hold, it may be better to implement the O(n) algorithm and add a logging warning if n > 250 or so (and, maybe, a fatal error if n > 1000 or so), instead of spending time to write both versions of the algorithm and spend time finding the cut off value for choosing between the two.
One of the simplest solutions for detecting cyclic graphs is instead of collecting a lookup table or doing something non-concurrent like marking the nodes, is to count nodes and panic if the encountered set is more than an order of magnitude more than you expected.
I came onto a project that had done that before and it blew up during my tenure. The worst case graph size was several times the expected case, and long term customers were growing their data sets vertically rather than horizontally (eg, ever notice how much friction there is to making new web pages versus cramming more data into the existing set?) and now instead of 10x never happening it was happening every Tuesday.
I was watching the same thing play out on another project recently but it got cancelled before we hit that threshold for anything other than incorrect queries.
https://bsky.app/profile/randomascii.bsky.social/post/3lr24s...
If the cost of doing something goes above quadratic, you shouldn't do it at all. Because essentially every customer interaction costs you more than the one before. You will never be able to come up with ways to cover that cost faster than it ramps. You are digging a hole, filling it with cash and lighting it on fire.
If you can't do something well you should consider not doing it at all. If you can only do it badly with no hope of ever correcting it, you should outsource it.
Software operates in a crazy number of different domains with wildly different constraints.
There's a bit of a "What Computational Complexity Taught Me About B2B SaaS" bias going.
Not every app is a B2C product intending to grow to billions of users. If the costs start out as near-zero and are going to grow to still be negligible at 100% market share, who cares that it's _technically_ suboptimal? Sure, you could spend expensive developer-hours trying to find a better way of doing it, but YAGNI.
The devs voted with their feet and the customers with their wallets.
That's a pretty bold assumption.
Almost every startup that has succeeded was utterly unscalable at first in tons of technical and business ways. Then they fixed it as they scaled. Over-optimizing early has probably killed far more projects and companies than the opposite.
That’s not a bold assumption it’s the predicate for this entire sidebar. The commenter at the top said some things can’t be done in quadratic time and have to be done anyway, and I took exception.
>> unless a more optimal solution does not exist
Dropping into the middle of a conversation and ignoring the context so you can treat the participants like they are confused or stupid is very bad manners. I’m not grumpy at you I’m grumpy that this is the eleventeenth time this has happened.
> Almost every startup
Almost every startup fails. Do you model your behavior on people who fail >90% of the time? Maybe you, and perhaps by extension we, need to reflect on that.
> Then we fixed it as we scaled
Yes, because you picked a problem that can be architected to run in reasonable time. You elected to do it later. You trusted that you could delay it and turned out to be right.
>> unless a more optimal solution does not exist
When the devs discover the entire premise is unsustainable or nobody knows how to make it sustainable after banging their heads against it, they quickly find someplace else to be and everyone wonders what went wrong. There was a table of ex employees who knew exactly what went wrong but it was impolitic to say. Don’t want the VCs to wake up.
n is only rarely related to "customers". As long as n doesn't grow, the asymptotic complexity doesn't actually matter.
I’m on the fence about cubic time. I was mostly thinking of exponential and factorial problems. I think some very clever people can make cubic work despite my warnings. But most of us shouldn’t. General advice is to be ignored by masters when appropriate. That’s also the story arc of about half of kung fu movies.
Did chess solvers really progress much before there was a cubic approximation?
Thank you for the courtesy.
> I think some very clever people can make cubic work despite my warnings.
I think you're selling yourself short. You don't need to be that clever to make these algorithms work, you have all the tools necessary. Asymptotic analysis is helpful not just because it tells us a growth, but also because it limits that growth to being in _n_. If you're doing matmul and n is proportional to the size of the input matrix, then you know that if your matrix is constant then the matmul will always take the same time. It does not matter to you what the asymptotic complexity is, because you have a fixed n. In your program, it's O(1). As long as the runtime is sufficient, you know it will never change for the lifetime of the program.
There's absolutely no reason to be scared of that kind of work, it's not hard.
If SAT solvers suddenly got inordinately more expensive you’d use a human because they used to do this but the solver was better/cheaper.
Edit: checking my math, looks like in a 15 year period from around 2005 to 2020, AMD increased the number of cores by about 30x and the transistors per core by about 10x.
"Oh well my algorithm isn't really O(N^2) because I'm going to print N copies of the answer!"
Absurd!
I would argue that this is essentially part of why Intel is flagging now. They had a model of ever increasing design costs that was offset by a steady inflation of sales quarter after quarter offsetting those costs. They introduced the “tick tock” model of biting off a major design every second cycle and small refinements in between, to keep the slope of the cost line below the slope of the sales line. Then they stumbled on that and now it’s tick tick tock and clearly TSM, AMD and possibly Apple (with TSM’s help) can now produce a better product for a lower cost per gate.
Doesn’t TSM’s library of existing circuit layouts constitute a substantial decrease in the complexity of laying out an entire chip? As grows you introduce more precalculated components that are dropped in, bringing the slope of the line down.
Meanwhile NVIDIA has an even better model where they spam gpu units like mad. What’s the doubling interval for gpu units?
I've seen it several times before, and it's exactly what happened here.
Things like inserting the test data first and turning on constraints and possibly indexes afterward.
I spent a lot of time fixing n^2 in blink, but there were some fun surprises:
https://source.chromium.org/chromium/chromium/src/+/main:thi...
For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).
This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.
Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.
This is because performance is typically less important for the fast/small case and it is generally acceptable for processing twice as much to be twice (or slightly more than twice) as slow, but users are far more likely to hit and really burned by n^2 algorithms in things you thought would almost always be small and you never tested large enough sizes in testing to notice.
I wrote more on this topic here https://kevincox.ca/2023/05/09/less-than-quadratic/
If anyone had made clockless logic work you would see that adding 1 + 1 is in fact faster than adding 2^63 + 1.
If you put enough data into a hash table the key length has to increase logarithmically to the table size in order to have distinct keys per record. Even Knuth points out that hash tables are really nlogn - something I’m pretty sure my CS professors left out. In multiple classes. Man, did I get tired of hash tables, but I see now why they harped on them. Case on point, this article.
> where linear search beats constant time maps
Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.Some example workloads include:
1. Tokenization (checking if a word is a keyword)
2. Interpretation (mapping an instruction name to its action)
3. ML (encoding low-cardinality string features in something like catboost)
4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)
Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.
It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.
When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.
As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.
Let's say you've got a linear array of bytes, and you want to see if it contains a specific value. What would a modern CPU need to execute? Well, we can actually compare 64 values at a time with _mm512_mask_cmp_epu8_mask! You still need a little bit of setup and a final "did any of them match" check, of course. Want to compare 512 values? You can probably do that in less than 10 clock cycles with modern machines
Doing the same with a hash set? Better make sure that hash algorithm is fast. Sure it's O(1), but if calculating the hash takes 20 cycles it doesn't matter.
A string search algorithm that uses SIMD to do quickly discard a majority of 16, 32 or 64 attempts in parallel, and then verify the surviving ones quadratically (again 16, 32 or 64 bytes at a time) can go a very long way against a sublinear algorithm that understands needle structure, but necessarily needs to process the haystack one byte at a time.
> That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
I am confused. There are plenty of open source, fast hash map impls in C.That's the problem. A lot of these quadratic time algorithms don't set limits.
Even 'n!' is fine for small 'n'. Real production use cases don't have small 'n'.
Real production use cases absolutely have small n. You don't hear about them, because it's very hard for them to cause issues. Unless the use case changes and now the n is not small anymore and nobody noticed the trap.
It's been fine because "n" is "number of aircraft flying in this flight simulator" - and the simulator's engine starts to fail above around 2000 anyway. So even in the worst case it's still going to run within milliseconds.
As a self taught dev, when I encounter nested loops, I have to mentally go through them and try to see if they iterate through each item more than once. But that's not a very foolproof method.
It's a fairly simple thought experiment to ask yourself what if there was 10x items in this array? 100x? That is essentially what the O(n) notation is trying to quantify. You just don't need to do it that formally.
"We found that a large portion of the 48 hours taken to backup our rails respository was due to a function in `git bundle create` that checks for duplicate references entered as command line arguments. The check contained a nested for loop ( O(N2) ), which we replaced with a map data structure in an upstream fix to Git. The patch was accepted, but we also backported fix without waiting for the next Git version. With this change, backup time dropped to 41 minutes."
Generously, this writing style is supposed to show the business value of teams and individuals, for promotions or other recognition. But yeah, it can be frustrating to read this style.
ChatGPT has ruined bullet points for the rest of us…
No offense but writing this blog post couldn’t take more than a few minutes, why spoil it with LLM? Shoot, use one to check grammar and recommend edits even.
1.https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?
https://git-scm.com/docs/gitfaq#_transfers
It doesn't tell you how to make a backup safely though.
On a personal scale, Syncthing and Btrfs snapshots work plenty good enough. It's as fast as the storage/network too.
Because it's at a personal scale, the only time I can corrupt a git repo is if I work on the same repo (and it's workdir) from more than one device in the time it takes for Syncthing to replicate the changes.
But even then it's not a big deal because git fsck is quick. And I have my snapshots, and the syncthing versioning, and git defaults to two weeks before pruning. And because of how git works, using hash to identify contents, files are not easily overwritten either.
In 10y I only had one git corruption (I ran a command on the same repo on a different machine via ssh, yielding a synctning conflict). Syncthing kept copies of the conflict file. One commit disappeared from the history but not from the database. It was easy to rebase the changes. I think I used git fsck to deleted the syncthing versioned files.
That said, there's another less known feature that bundles help out with when used with `git clone --bundle-uri` The client can specify a location to a bundle, or the server can send the client the bundle location in the clone results and the client can fetch the bundle, unpack it, and then update the delta via the git server, so it's a lot lighter weight on the server for cloning large repos, and a ton faster for the client for initial clones.
EDIT: you could also rsync from a .zfs snapshot directory if you have them enabled.
... is this really the way people "back up" git repos? I mean, it is git, so isn't there some way to mirror changes to the repo in another repo and just use ZFS / snapshots / backup software / etc to do that? It's a distributed version control system. Just make sure the version control information is ... distributed?
Even with the quadratic complexity I suspect this problem has been brewing for a while. This commit has been biding its time for 16 years waiting to ruin someone’s day (let’s not confuse the utility of Make it Work early in a project with justifying retaining that code in perpetuity. Make it Right, Make it Fast.) Backups have probably been going over six hours for years and steadily climbing.
“We” feels a lot like one or two people jumping on a grenade.
If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.
Who the fuck do you think is the intended audience for an article about an algorithm in `git bundle create`? I spent approximately two minutes of my life trying to figure out where the O(n^2) algorithm was being invoked in such a way that it influenced an exponential.
Exponential was bolded in the same sentence as a big-O. 50/50 troll/author oversight.
Using exponential in this way in any context is a faux pas, because it's highly ambiguous, and requires context for clarification. But in this situation the context clearly resolved to the mathematically accurate definition, except it was used in the other way.
Production systems running and melting here...
If you can’t agree with this, then you shouldn’t be speaking or writing, IMO.
Those who argue that words that mean different things are actually equivalent have no business dealing with language.
No, not in the exact same sentence as a big-O. That's either author error, or an intentional troll. Either way it's egg on their faces.
The fix was replacing a nested loop with a map. Figuring out that this goes from O(n^2) to O(n) (modulo details like bucket count) is immediate if you know what the words mean and understand enough to identify the problem and make the fix in the first place.
I.e. you are saying and f(n) speedup means T(n)/f(n), but others would say it means f(T(n)) or some variation of that.
Also because I've often heard tha the quantum Fourier transform is an exponential speedup over the discrete Fourier transform, and there the scaling goes n^2->nlogn.
I don't think this is meaningless or non-constructive pedantry - we're a technical community and those are technical words.
Here they are actually using it to refer to growth functions (which is rare for this error) and being honest (which is also rare IMO) but it's still wrong. They should have written about quadratic or quadratic vs linear.
Regardless sloppy language leads to sloppy thought.
It’s a very specific subset of the one you’re describing.
“Tell me you don’t know what you’re talking about without telling me you don’t know what you’re talking about.”
Just be glad you have only one.
I wasted 2 minutes of my life looking for the exponential reduction. So did many others.
Now I'm wasting more of my life shit posting about it, but at least that's a conscious choice.
This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.
Bingo. One of the worst problems is helping a client piece back together a corrupted repo when they are using snapshots. Check my profile to see how I know. :)
It's usually an OMG down scenario, and then you are adding in the "oh no, now the restore is corrupted."
It's fixable but it's definitely annoying.
A few months back a better solution was provided by SQLite: sqlite3_rsync
There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
File systems can be weird. Sometimes the OS can be weird and fsync type calls may not do what you expect. At least at one point MacOS fsync didn't behave the same way as Linux (i.e. Linux should ensure the write is truly done and not just in cache so long as the drive isn't lying). [0]
> There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
Gitlab has a community edition. Not handling data well would be bad for their public image.
In this case Git already had a string set, but it's still not standard so there's a good chance the original author just didn't know about it.
The original commit was made in January 2009 (https://github.com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was added in November 2020 (https://github.com/git/git/commit/ae20bf1ad98bdc716879a8da99..., strset was added a few days later: https://github.com/git/git/commit/1201eb628ac753af5751258466...). It was first proposed in 2018 (https://lore.kernel.org/git/20180906191203.GA26184@sigill.in... the proposal specifically mentions it fixing possibly quadratic sites).
As noted in the comment, git did have a sorted string list with bisection search, and that's from 2008 (and it actually dates back to 2006 as the "path list" API, before it was renamed following the realisation that it was a generalised string list). Though as the hashmap proposal notes, it's a bit tricky because there's a single type with functions for sorted and functions for unsorted operations, you need to know whether your list is sorted or not independent of its type.
Ironically, this is much _faster_ for small sets. Sometimes the error is intentional, because the programmer believes that all inputs will be small. IME, those programmers were wrong, but that's the inverse of survival bias.
The worst troublesome cases of inefficient production are almost always O(n^2).
When I plotted the runtime, I got n^5 for the fitting curve. That’s the largest polynomial I’ve encountered in the wild. Second place has always been cubic.
Their response was that it had something to do with processing config files per module, and a suggestion to rearrange them as a workaround. They fixed the problem in the next patch and the load time went from 18 hours to a couple minutes.
I have yet to finish the article, but this means they improved the complexity to something like O(logN), right? I hate when people confuse quadratic improvement for exponential ones.
Quadratic complexity sits in an awkward sweet spot: Fast enough for medium-sized n to pass first QA, but doomed to fail eventually as n grows.
And then someone learns from this experience, gets the bright idea to set up an alert for such things, but the alert doesn’t factor in things like customer base growth or feature creep slowly pushing up the expected runtime. Eventually organic load gets close to the alarm and then the fucking thing goes off on a three day weekend (why is it always a long weekend or just before one?) and then we wage war on alarm overreach and the whole cycle repeats itself.
We like to think of ourselves as blazing trails in the wilderness but most of the time we are doing laps around the parking lot.
Also moved away from Gitlab because it's so damn slow.
I know their backend git proxy is written in golang, their runner agent is written in golang, it spawns CI jobs using containerd, written in golang, and they use postgresql and a redis-esque KV -- although for that part I do believe they're still using Sidekick (ruby) for doing job dispatch, so that one could very easily lolol back into not actioning tasks efficiently
GitHub Actions sure does enjoy just sitting there twiddling its thumbs when I push "run job," and is also both Ruby and their own crazypants ajax-y web framework, so I don't think it's exactly the shining star of performance itself
Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.
It will spin up a localhost server after the trace ends, the profiler uses the localhost server and nothing is shared with Firefox servers unless you explicitly choose to upload the data and create a permalink.
You record performance data with `perf`, then use the scripts there to turn it into a SVG.
https://github.com/google/pprof
go install github.com/google/pprof@latest
pprof -http=: prof.out
I normally collect the profiles with gperftools (https://github.com/gperftools/gperftools) and then just LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so CPUPROFILE=prof.out <your command>
I've been meaning to try Samply though. Not sure if it works with pprof.I think there have been several attempts to use S3-ish blobstores as git backends but of the two major git hosters, only one of them is MIT licensed and they don't attempt that stunt, so safe to assume it's not a viable approach
Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.
(OK, yes, k could be doing some heavy lifting here, but I doubt it.)
If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".
I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.
If my barista says that his new coffee is exponentially better, it’s ok to use it colloquially.
If my git hosting provider writes about an impactful performance improvement, it’s not.
(I don't think that anyone should use "exponentially" that way: it is an art term with a specific and particular meaning, so find another word if you mean something else! Like misusing specific legal or sporting terms...)
Instead of saying "set". The code itself uses the type "strset".
https://addons.thunderbird.net/en-US/thunderbird/addon/remov...
I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.
That worked, but was kind of slow.
Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!
The result: About 20x perf improvement if memory serves.
How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.
While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.
wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis