FilterHN

How We Found 7 TiB of Memory Just Sitting Around

201 points

by anurag

2 days ago

| past

| 6 comments

| render.com

| HN

▲

Aeolun

1 day ago

[-]

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)

▲

devjab

1 day ago

[-]

We're a much smaller scale company and the cost we lose on these things is insignificant compared to what's in this story. Yesterday I was improving the process for creating databases in our azure and I stumbled upon a subscription which was running 7 mssql servers for 12 databases. These weren't elastic and they were each paying a license that we don't have to pay because we qualify for the base cost through our contract with our microsoft partner. This company has some of the thightest control over their cloud infrastructure out of any organisation I've worked with.

This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.

▲

ffsm8

1 day ago

[-]

Isn't that mostly down to the fact the vast majority of devs explicitly don't want to do anything wrt Ops?

DevOps has - ever since it's originally well meaning inception (by Netflix iirc?) - been implemented across our industry as an effective cost cutting measure, forcing devs that didn't see it as their job to also handle it.

Which consequently means they're not interfacing with it whatsoever. They do as little as they can get away with, which inevitably means things are being done with borderline malicious compliance... Or just complete incompetence.

I'm not even sure I'd blame these devs in particular. The devs just saw it as a quick bonus generator for the MBA in charge of this rebranding while offloading more responsibilities in their shoulders.

DevOps made total sense in the work culture where this concept was conceived - Netflix was well known at that point to only ever employ senior Devs. However, in the context of the average 9-5 dev, which often knows a lot less then even some enthusiastic Jrs... Let's just say that it's incredibly dicey wherever it's successful in practice.

▲

mustyoshi

23 hours ago

[-]

I politely disagree. I spent maybe 8 hours over a week rightsizing a handful of heavy deployments from a previous team and reduced their peak resource usage by implementing better scaling policies. Before the new scaling policy the service would scale out and new pods would remain idle and ultimately get terminated without ever responding to a request quite frequently.

The service dashboards already existed, all I had to do was a bit of load testing and read the graphs.

It's not too much extra work to make sure you're scaling efficiently.

▲

ffsm8

22 hours ago

[-]

You disagree but then cite another example of low hanging fruits that nobody took action on until you came along?

Did you accidentally respond to the wrong comment? Because if anything you're giving another example of "most devs not wanting to interface with ops, hence letting it slide until someone bothers to pick up their slack"...

▲

FroshKiller

16 hours ago

[-]

The first time my director asked me if I'd ever heard of DevOps, I said, "Sure, doing two jobs for one paycheck." I'm a software developer, buddy. I write the programs. Leave me out of running them.

▲

jiggawatts

9 hours ago

[-]

> Leave me out of running them.

This is how customers end up with too-expensive Rube Goldberg machines.

You have to take some interest in how your code will run in production, even if you don't personally "operate" it.

▲

bstack

1 day ago

[-]

Author here: You’d be surprised what you don’t notice given enough nodes and slow enough resource growth over time! Out of the total resource usage in these clusters even at the high water mark for this daemonset it was still a small overall portion of the total.

▲

Aeolun

1 day ago

[-]

I’m not sure if that makes it better or worse.

▲

antoniojtorres

13 hours ago

[-]

It seems realistic to me, commonplace even. Lots to do in a company like this one.

▲

embedding-shape

23 hours ago

[-]

I didn't know what Render was when I skimmed the article at first, but after reading these comments, I had to check out what they do.

And they're a "Cloud Application Platform" meaning they manage deploys and infrastructure for other people. Their website says "Click, click, done." which is cool and quick and all, but to me it's kind of crazy an organization that should be really engineering focused and mature, doesn't immediately notice 1.2TB being used and tries to figure out why, when 120GB ended up being sufficient.

It gives much more of a "We're a startup, we're learning as we're running" vibe which again, cool and all, but hardly what people should use for hosting their own stuff on.

▲

fock

1 day ago

[-]

how large are the clusters then?

▲

fock

1 day ago

[-]

we have on-prem with heavy spikes (our batch workload can utilize the 20TB of memory in the cluster easily) and we just don't care much and add 10% every year to the hardware requested. Compared to employing people or paying other vendors (relational databases with many TB-sized tables...) this is just irrelevant.

Sadly devs are incentivized by that and going towards the cloud might be a fun story. Given the environment I hope they scrap the effort sooner rather than later, buy some Oxide systems for the people who need to iterate faster than the usual process of getting a VM and replace/reuse the 10% of the company occupied with the cloud (mind you: no real workload runs there yet...) to actually improve local processes...

▲

g-mork

17 hours ago

[-]

Somewhat unrelated, but you just tied wasteful software design to high it salaries, and also suggest a reason why Russian programmers might also seem to on the whole be far more effective than we are

I wonder if msft simply cut dev salaries by 50% in the 90s, would it have had any measurable effect on windows quality by today

▲

formerly_proven

1 day ago

[-]

It probably doesn't help that the first line of treatment for any error is to blindly increase memory request/limit and claim it's fixed (preferably without looking at the logs once).

▲

nitinreddy88

2 days ago

[-]

The other way to look is why adding NS label is causing so much memory footprint in Kubernetes. Shouldn't be fixing that (could be much bigger design change), will benefit whole Kube community?

▲

bstack

1 day ago

[-]

Author here: yeah that's a good point. tbh I was mostly unfamiliar with Vector so I took the shortest path to the goal but that could be interesting followup. It does seem like there's a lot of bytes per namespace!

▲

stackskipton

1 day ago

[-]

You mentioned in the blog article that it's doing listwatch. List Watch registers with Kubernetes API that get a list of all objects AND get a notification when anything in object you have registered with changes. A bunch of Vector Pods saying "Hey, send me a notification when anything with namespaces changes" and poof goes your Memory keeping track of who needs to know what.

At this point, I wonder if instead of relying on daemonsets, you just gave every namespace a vector instance that was responsible for that namespace and pods within. ElasticSearch or whatever you pipe logging data to might not be happy with all those TCP connections.

Just my SRE brain thoughts.

▲

fells

1 day ago

[-]

>you just gave every namespace a vector instance that was responsible for that namespace and pods within.

Vector is a daemonset, because it needs to tail the log files on each node. A single vector per namespace might not reside on the nodes that each pod is on.

▲

stackskipton

20 hours ago

[-]

I think DaemonSet is to reduce network load so Vector is not pulling logs files over the network.

We run Vector as Daemonset as well but we don't have a ton of namespaces. Render sounds like they have a ton of namespaces running maybe one or two pods since their customers are much smaller. This is probably much more niche setup then many users of Kubernetes.

▲

ahoka

19 hours ago

[-]

That's where the design is wrong.

▲

liampulles

22 hours ago

[-]

I'm a little surprised that it got to the point where pods which should consume a couple MB of RAM were consuming 4GB before action was taken. But I can also kind of understand it, because the way k8s operators (apps running in k8s that manipulate k8s resource) are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec. This reconciliation loop is simple to understand (and I think this benefit has led to the creation of a wide array of excellent open source and proprietary operators that can be added to clusters). But its also a recipe for cascading explosions in resource usage.

These kind of resource explosions are something I see all the time in k8s clusters. The general advice is to always try and keep pressure off the k8s API, and the consequence is that one must be very minimal and tactical with the operators one installs, and then engage in many hours of work trying to fine tune each operator to run efficiently (e.g. Grafana, whose default helm settings do not use the recommended log indexing algorithm, and which needs to be tweaked to get an appropriate set of read vs. write pods for your situation).

Again, I recognize there is a tradeoff here - the simplicity and openness of the k8s API is what has led to a flourish of new operators, which really has allowed one to run "their own cloud". But there is definitely a cost. I don't know what the solution is, and I'm curious to hear from people who have other views of it, or use other solutions to k8s which offer a different set of tradeoffs.

▲

never_inline

18 hours ago

[-]

> are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec.

Aren't they supposed to use watch/long polling?

▲

shanemhansen

2 days ago

[-]

The unreasonable effectiveness of profiling and digging deep strikes again.

▲

hinkley

1 day ago

[-]

The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.

▲

Negitivefrags

1 day ago

[-]

I think the biggest tool is higher expectations. Most programmers really haven't come to grips with the idea that computers are fast.

If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"

The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.

▲

azornathogron

1 day ago

[-]

Pedantic nit: At 60 fps the per frame time is 16.66... ms, not 30 ms. Having said that a lot of games run at 30 fps, or run different parts of their logic at different frequencies, or do other tricks that mean there isn't exactly one FPS rate that the thing is running at.

▲

Negitivefrags

1 day ago

[-]

The CPU part happens on one frame, the GPU part happens on the next frame. If you want to talk about the total time for a game to render a frame, it needs to count two frames.

▲

azornathogron

1 day ago

[-]

If latency of input->visible effect is what you're talking about, then yes, that's a great point!

▲

wizzwizz4

1 day ago

[-]

Computers are fast. Why do you accept a frame of lag? The average game for a PC from the 1980s ran with less lag than that. Super Mario Bros had less than a frame between controller input and character movement on the screen. (Technically, it could be more than a frame, but only if there were enough objects in play that the processor couldn't handle all the physics updates in time and missed the v-blank interval.)

▲

Negitivefrags

1 day ago

[-]

If Vsync is on which was my assumption from my previous comment, then if your computer is fast enough, you might be able to run CPU and GPU work entirely in a single frame if you use Reflex to delay when simulation starts to lower latency, but regardless, you still have a total time budget of 1/30th of a second to do all your combined CPU and GPU work to get to 60fps.

▲

mjevans

1 day ago

[-]

30mS for a website is a tough bar to clear considering Speed of Light (or rather electrons in copper / light in fiber)

https://en.wikipedia.org/wiki/Speed_of_light

Just as an example, round trip delay from where I rent to the local backbone is about 14mS alone, and the average for a webserver is 53mS. Just as a simple echo reply. (I picked it because I'd hoped that was in Redmond or some nearby datacenter, but it looks more likely to be in a cheaper labor area.)

However it's only the bloated ECMAScript (javascript) trash web of today that makes a website take longer than ~1 second to load on a modern PC. Plain old HTML, images on a reasonable diet, and some script elements only for interactive things can scream.

    mtr -bzw microsoft.com
    6. AS7922        be-36131-cs03.seattle.wa.ibone.comcast.net (2001:558:3:942::1)         0.0%    10   12.9  13.9  11.5  18.7   2.6
    7. AS7922        be-2311-pe11.seattle.wa.ibone.comcast.net (2001:558:3:3a::2)           0.0%    10   11.8  13.3  10.6  17.2   2.4
    8. AS7922        2001:559:0:80::101e                                                    0.0%    10   15.2  20.7  10.7  60.0  17.3
    9. AS8075        ae25-0.icr02.mwh01.ntwk.msn.net (2a01:111:2000:2:8000::b9a)            0.0%    10   41.1  23.7  14.8  41.9  10.4
    10. AS8075        be140.ibr03.mwh01.ntwk.msn.net (2603:1060:0:12::f18e)                  0.0%    10   53.1  53.1  50.2  57.4   2.1
    11. AS8075        2603:1060:0:10::f536                                                   0.0%    10   82.1  55.7  50.5  82.1   9.7
    12. AS8075        2603:1060:0:10::f3b1                                                   0.0%    10   54.4  96.6  50.4 147.4  32.5
    13. AS8075        2603:1060:0:10::f51a                                                   0.0%    10   49.7  55.3  49.7  78.4   8.3
    14. AS8075        2a01:111:201:f200::d9d                                                 0.0%    10   52.7  53.2  50.2  58.1   2.7
    15. AS8075        2a01:111:2000:6::4a51                                                  0.0%    10   49.4  51.6  49.4  54.1   1.7
    20. AS8075        2603:1030:b:3::152                                                     0.0%    10   50.7  53.4  49.2  60.7   4.2

▲

hinkley

16 hours ago

[-]

In the cloud era this gets a bit better but my last job I removed a single service that was adding 30ms to response time and replaced it with a consul lookup with a watch on it. It wasn’t even a big service. Same DC, very simple graph query with a very small response. You can burn through 30 ms without half trying.

▲

javier2

1 day ago

[-]

its also about cost. My game computer has 8 cores + 1 expensive gpu + 32GB ram for me alone. We dont have that per customer.

▲

oivey

1 day ago

[-]

This is again a problem understanding that computers are fast. A toaster can run an old 3D game like Quake at hundreds of FPS. A website primarily displaying text should be way faster. The reasons websites often aren’t have nothing to do with the user’s computer.

▲

paulryanrogers

1 day ago

[-]

That's a dedicated toaster serving only one client. Websites usually aren't backed by bare metal per visitor.

▲

oivey

1 day ago

[-]

Right. I’m replying to someone talking about their personal computer.

▲

Aeolun

1 day ago

[-]

If your websites take less than 16ms to serve, you can serve 60 customers per second with that. So you sorta do have it per customer?

▲

vlovich123

1 day ago

[-]

That’s per core assuming the 16ms is CPU bound activity (so 100 cores would serve 100 customers). If it’s I/O you can overlap a lot of customers since a single core could easily keep track of thousands of in flight requests.

▲

OJFord

1 day ago

[-]

With a latency of up to 984ms

▲

javier2

12 hours ago

[-]

Im just saying that we dont have gaming pc specs per customer to chug that 7GB of data for every request in 30ms

▲

avidiax

1 day ago

[-]

It's also about revenue.

Uber could run the complete global rider/driver flow from a single server.

It doesn't, in part because all of those individual trips earn $1 or more each, so it's perfectly acceptable to the business to be more more inefficient and use hundreds of servers for this task.

Similarly, a small website taking 150ms to render the page only matters if the lost productivity costs less than the engineering time to fix it, and even then, only makes sense if that engineering time isn't more productively used to add features or reliability.

▲

hinkley

17 hours ago

[-]

Practically, you have to parcel out points of contention to a larger and larger team to stop them from spending 30 hours a week just coordinating for changes to the servers. So the servers divide to follow Conway’s Law, or the company goes bankrupt (why not both?).

Microservices try to fix that. But then you need bin packing so microservices beget kubernetes.

▲

onethumb

1 day ago

[-]

Uber could not run the complete global rider/driver flow from a single server.

▲

avidiax

18 hours ago

[-]

I'm saying you can keep track of all the riders and drivers, matchmake, start/progress/complete trips, with a single server, for the entire world.

Billing, serving assets like map tiles, etc. not included.

Some key things to understand:

* The scale of Uber is not that high. A big city surely has < 10,000 drivers simultaneously, probably less than 1,000.

* The driver and rider phones participate in the state keeping. They send updates every 4 seconds, but they only have to be online to start a trip. Both mobiles cache a trip log that gets uploaded when network is available.

* Since driver/rider send updates every 4 seconds, and since you don't need to be online to continue or end a trip, you don't even need an active spare for the server. A hot spare can rebuild the world state in 4 seconds. State for a rider and driver is just a few bytes each for id, position and status.

* Since you'll have the rider and driver trip logs from their phones, you don't necessarily have to log the ride server side either. Its also OK to lose a little data on the server. You can use UDP.

Don't forget that in the olden times, all the taxis in a city like New York were dispatched by humans. All the police in the city were dispatched by humans. You can replace a building of dispatchers with a good server and mobile hardware working together.

▲

hinkley

17 hours ago

[-]

You could envision a system that used one server per county and that’s 3k servers. Combine rural counties to get that down to 1000, and that’s probably less servers than uber runs.

What the internet will tell me is that uber has 4500 distinct services, which is more services than there are counties in the US.

▲

exe34

1 day ago

[-]

I believe the argument was that somebody competent could do it.

▲

lazide

10 hours ago

[-]

The reality is that, no, that is not possible. If a single core can render and return a web page in 16ms, what do you do when you have a million requests/sec?

The reality is most of those requests (now) get mixed in with a firehose of traffic, and could be served much faster than 16ms if that is all that was going on. But it’s never all that is going on.

▲

hinkley

1 day ago

[-]

Lowered expectations are come in part from people giving up on theirs. Accepting versus pushing back.

▲

antonymoose

1 day ago

[-]

I have high hopes and expectations, unfortunately my chain of command does not, and is often an immovable force.

▲

hinkley

1 day ago

[-]

This is a terrible time to tell someone to find a movable object in another part of the org or elsewhere. :/

I always liked Shaw’s “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.”

▲

zahlman

1 day ago

[-]

> The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.

> There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.

It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.

▲

hinkley

1 day ago

[-]

Architecture, cache eviction, memory bandwidth, thermal throttling.

All of which have gotten perhaps an order of magnitude worse in the time since I started on this theory.

▲

hinkley

1 day ago

[-]

And Amdahl’s Law. Perf charts will complain about how much CPU you’re burning in the parallel parts of code and ignore that the bottleneck is down in 8% of the code that can’t be made concurrent.

▲

zahlman

19 hours ago

[-]

I meant architecture of the codebase, to be clear. (I'm sure that the increasing complexity of hardware architecture makes it harder to figure out how to write optimal code, but it isn't really degrading the performance of naive attempts, is it?)

▲

hinkley

17 hours ago

[-]

The problem Windows had during its time of fame is the developers always had the fastest machines money could buy. That decreased the code-build-test cycle for them, but it also made it difficult for the developers to visualize how their code would run on normal hardware. Add the general lack of empathy inspired by their toxic corporate culture of “we are the best in the world” and its small wonder why windows, 95 and 98 ran more and more dogshit on older hardware.

My first job out of college, I got handed the slowest machine they had. The app was already half done and was dogshit slow even with small data sets. I was embarrassed to think my name would be associated with it. The UI painted so slowly I could watch the individual lines paint on my screen.

My friend and I in college had made homework into a game of seeing who could make their homework assignment run faster or using less memory. Such as calculating the Fibonacci of 100, or 1000. So I just started applying those skills and learning new ones.

For weeks I evaluated improvements to the code by saying “one Mississippi, two Mississippi”. Then how many syllables I got through. Then the stopwatch function on my watch. No profilers, no benchmarking tools, just code review.

And that’s how my first specialization became optimization.

▲

jesse__

1 day ago

[-]

Broadly agree.

I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.

▲

hinkley

1 day ago

[-]

Clinic.js tried and lost steam. I have a recollection of a profiler called JProfiler that represented space and time as a graph, but also a recollection they went under. And there is a company selling a product of that name that has been around since that time, but doesn’t quite look how I recalled and so I don’t know if I was mistaken about their demise or I’ve swapped product names in my brain. It was 20 years ago which is a long time for mush to happen.

The common element between attempts is new visualizations. And like drawing a projection of an object in a mechanical engineering drawing, there is no one projection that contains the entire description of the problem. You need to present several and let brain synthesize the data missing in each individual projection into an accurate model.

▲

never_inline

18 hours ago

[-]

what do you think about speedscope's sandwich view?

▲

hinkley

17 hours ago

[-]

More of the same. JetBrains has an equivalent, though it seems to be broken at present. The sandwich keeps dragging you back to the flame graph. Call stack depth has value but width is harder for people to judge and it’s the wrong yardstick for many of the concerns I’ve mentioned in the rest of this thread.

The sandwich view hides invocation count, which is one of the biggest things you need to look at for that remaining 3x.

Also you need to think about budgets. Which is something game designers do and the rest of us ignore. Do I want 10% of overall processing time to be spent accessing reloadable config? Reporting stats? If the answer is no then we need to look at that, even if data retrieval is currently 40% of overall response time and we are trying to get from 2 seconds to 200 ms.

That means config and stats have a budget of 20ms each and you will never hit 200ms if someone doesn’t look at them. So you can pretend like they don’t exist until you get all the other tent poles chopped and then surprise pikachu face when you’ve already painted them into a corner with your other changes.

When we have a lot of shit that all needs to get done, you want to get to transparency, look at the pile and figure out how to do it all effectively. Combine errands and spread the stressful bits out over time. None of the tools and none of the literature supports this exercise, and in fact most of the literature is actively hostile to this exercise. Which is why you should read a certain level of reproval or even contempt in my writing about optimization. It’s very much intended.

Most advice on writing fast code has not materially changed for a time period where the number of calculations we do has increased by 5 orders of magnitude. In every other domain, we re-evaluate our solutions at each order of magnitude. We have marched past ignorant and into insane at this point. We are broken and we have been broken for twenty years.

▲

never_inline

3 hours ago

[-]

I would like to know where I can read more in depth about profiling and performance analysis techniques.

▲

hinkley

1 day ago

[-]

Keys require O(logn) space per key or nlogn for the entire data set, simply to avoid key collisions. But human friendly key spaces grow much, much faster and I don’t think many people have looked too hard at that.

There were recent changes to the NodeJS Prometheus client that eliminates tag names from the keys used for storing the tag cardinality for metrics. The memory savings wasn’t reported but the cpu savings for recording data points was over 1/3. And about twice that when applied to the aggregation logic.

Lookups are rarely O(1), even in hash tables.

I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.

▲

vlovich123

1 day ago

[-]

Why aren’t let’s just 128bit UUIDs? Those are guaranteed to be globally unique and don’t require so much spacex

▲

hinkley

1 day ago

[-]

Why aren’t what 128bit UUIDs?

> keeping names concise without triggering transposition or reading comprehension errors.

Code that doesn’t work for developers first will soon cease to work for anyone. Plus how do you look up a uuid for a set of tags? What’s your perfect hash plan to make sure you don’t misattribute stats to the wrong place?

UUIDs are entirely opaque and difficult to tell apart consistently.

▲

timzaman

7 hours ago

[-]

7tib.. that's like 3 servers..