AMD Open-Source 1B OLMo Language Models
78 points
2 months ago
| 4 comments
| amd.com
| HN
duchenne
2 months ago
[-]
Training a 1B model on 1T tokens is cheaper than people might think. A H100 GPU can be rented for 2.5$ per hour and can train around 63k tokens per second for a 1B model. So you would need around 4,400 hours of GPU training costing only $11k And costs will keep going down.
reply
lumost
2 months ago
[-]
Is there a handy table for this? My napkin math has either underestimated throughput by 2 orders of magnitude or the above estimate is high.
reply
YetAnotherNick
2 months ago
[-]
You require 6 * parameter * token flops[1] to train LLM. Which means (flop/s of H100 * MFU) / (6 * parameter) token per second. Assuming MFU of 40%, it is (1000 * 10^12 * 0.4) / (6 * 10^9) token/sec = 67,000 token/sec.

This repo[2] by Meta achieves 48% MFU, or 80k token/second.

[1]: https://arxiv.org/pdf/2001.08361

[2]: https://github.com/facebookresearch/lingua

reply
codetrotter
2 months ago
[-]
(1,000,000,000,000/63,000)/(60*60)

(1T tokens / 63k tokens per second) / (60 seconds per minute * 60 minutes per hour)

Is approx 4400 hours

So I guess that’s how the calculation went.

Or did you mean a source for the number of tokens per second?

reply
lumost
2 months ago
[-]
Tokens per second ;) I can do the arithmetic on my own.
reply
throwaway888abc
2 months ago
[-]
"Furthermore, AMD OLMo models were also able to run inference on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs). Developers can easily run Generative AI models locally by utilizing the AMD Ryzen™ AI Software."

Hope these AI PCs will run also something better than 1B model.

What is it useful for ? Spellcheck ?

reply
lumost
2 months ago
[-]
The point is that AMD is doing the legwork to ensure that AI models can run on their chips. While they could settle for inference workloads (port llama to AMD). It is unlikely that many teams will widely adopt their silicon unless they can be used in the end-end ML stack. Many pure OSS efforts have tried and failed to make AMD work for this use case.

As a chip maker - they will also have some undersold, QA, or otherwise wasted parts available for these training efforts - so the capex is likely less severe for them compared to a random startup betting on AMD.

reply
cyberax
2 months ago
[-]
It's amazing how NVidia became worth $3T simply because they have better drivers and CUDA.

AMD has great hardware, but they never could be assed to do anything about their software.

reply
nmstoker
2 months ago
[-]
"utilizing the AMD Ryzen™ AI Software* sounds really unappealing! Like when companies don't realise you think their software to leverage hardware is bad and you'd prefer being able to use features via something generic
reply
anon291
2 months ago
[-]
It's not really. Anyone who's ever done any low-level assembly coding on modern chips knows that it is already a herculean engineering effort. The idea that your customers, who are experts in machine learning models (like transformers, activation functions, etc) are going to feel comfortable with memory hierarchies, synchronization, floating point precision, etc is just crazy.
reply
cyberax
2 months ago
[-]
Yes, that's what I mean. NVidia provided easy to use tooling (CUDA), and made sure it JustWorks everywhere.

AMD did approximately nothing with ROCm.

Investing $10-20m of developer time into making ROCm work reliably easily would have paid for itself 100x.

reply
almostgotcaught
2 months ago
[-]
> Investing $10-20m of developer time into making ROCm work reliably easily would have paid for itself 100x.

I love when outsiders throw around random-ass takes like this. Just curious: how'd you come up with this number? Is it backed by literally any thought/data/roadmap?

Let's do some rough back of the envelope calculations: 20MM is 100 engineers working for 1 year. Or maybe it's 5 years of work for 20 engineers? Which one of those perspectives (if any!) sounds to you like a reasonable assessment of the gap between AMD and NVIDIA?

A quick reminder before you answer: whatever you think is actually involved in improving ROCm, unless you work on ROCm, you're almost certainly not considering an entire iceberg of complexity (runtime/driver/firmware).

Let's put it another way: forget AMD investing, I'll invest in you since you're so confident. I'll give you 20MM as a high-interest, non-dischargeable loan (say 8%) and all the runtime/driver/firmware source for AMDGPU. Up for it? All you have to do is improve ROCm such that it's competitive with CUDA and you can take home a huge slice of the TAM and you'll be rich. Easy right?

Cutting to the chase: you're off by at least two orders of magnitude on your goofy estimate; the real numbers are probably closer to 200MM invested every year for 10 years. And you still wouldn't be caught up because in those 10 years NVIDIA wasn't sitting on its laurels just waiting for you to catch up!

reply
cyberax
2 months ago
[-]
> I love when outsiders throw around random-ass takes like this. Just curious: how'd you come up with this number? Is it backed by literally any thought/data/roadmap?

It's a multiple of what the TinyGrad ( https://tinygrad.org/#tinybox ) startup raised in capital. So $10-20m is absolutely reasonable, especially if you add an established HR with a hiring pipeline, established IT dept, offices, etc.

The multiplier is also easy to justify, given the stock price of NVidia and AMD.

> A quick reminder before you answer: whatever you think is actually involved in improving ROCm, unless you work on ROCm, you're almost certainly not considering an entire iceberg of complexity (runtime/driver/firmware).

Oh, I do. I've been following the OpenSource AMD driver development for the last 2 decades.

And I maintain that the total amount of investment that AMD needed to make to rival NVidia in the market cap, would have been around that number.

> Cutting to the chase: you're off by at least two orders of magnitude on your goofy estimate; the real numbers are probably closer to 200MM invested every year for 10 years.

For an entirely new company starting from scratch? Reasonable. But AMD is not a new company, and they already are doing most of the work needed.

reply
xtreme
2 months ago
[-]
200 mm/year gets you roughly 1000 engineers at 200k salary. Is that not enough to make rocm experience equal to cuda?
reply
anon291
2 months ago
[-]
Considering that each kernel / kernel size is usually custom tuned on NVIDIA, I'd say no. Working in this field at several different companies, there are likely thousands of hand-tuned variations of a simple GEMM kernel. Each one required an engineer to look at specifically, even if they're all variations on a common theme.

As far as I know (and again, I work in the field of AI compilers), we're still a ways off from complete end-to-end generation of highly optimized kernels. If you want it to go fast, you need to write it by hand [1], and then test and validate.

Moreover, chip makers are constantly adding new features (Tensor Cores in NVIDIA for example), so the compiler is always playing catch up and at some point an engineer has to sit down (likely a team of them) and think 'what's the best way to exploit this hardware functionality for software performance?'. Then they have to test and validate that, and then either write a kernel, or attempt to put that know-how into a compiler.

Multiply this times the number of kernels in a typical suite, and... yeah.

And that was my point about herculean effort on modern chips. Assembly language isn't just the old 'Add register 1 and 2 and dump in R3' anymore. It's 'Use this instruction to access memory in this way, so that it's in a compatible format for the next instruction' and 'oh yeah, make sure your memory synchronization primitives are such that the whole thing is coherent'. Good luck!

Even going one step up into a higher-level language, you have to know how the kernel gets compiled to make it worthwhile. Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance. You have to know the hardware intimately. This is where having the software co-designed with hardware is very important. Basically, every AI chipmaker of any importance does this, including the startups, like Groq and Cerebras.

[1] A lot of kernels share basic patterns, so its not as hard as it sounds, but definitely requires engineering effort to get the design right.

reply
almostgotcaught
2 months ago
[-]
> Considering that each kernel / kernel size is usually custom tuned on NVIDIA, I'd say no. Working in this field at several different companies, there are likely thousands of hand-tuned variations of a simple GEMM kernel. Each one required an engineer to look at specifically, even if they're all variations on a common theme.

Lol that's absolutely not true. What you're describing is literally impossible for any company that has more than one product family on the market since each product has different scratch sizes, number of vector registers, data types supported/emulated etc.

Outside of trade show demos, kernels are codegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention...

> Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance.

I guess you work at AMD. The reason AMD ships a whole bunch of binary kernels is not because someone tuned/designed each one but because AMD doesn't have a PTX/SASS equivalent. So each kernel has to be compiled at build time for each device (it's also why they can't have LTS support for architectures).

reply
anon291
2 months ago
[-]
> outside of trade show demos, kernels are condegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention

I never said they weren't using code generation. I said that each one requires a manual tune. You will set various parameters, determine if the generated code does well enough and then if there's performance to squeeze out, you modify the code generator.

> I guess you work at AMD.

Close but not quite

reply
almostgotcaught
2 months ago
[-]
> that each one requires a manual tune

Ya definitely not - everyone uses grid search or whatever latest BPO tuning strategy.

reply
anon291
2 months ago
[-]
Oh right. Those require no engineering effort because I said so.
reply
almostgotcaught
2 months ago
[-]
They require one person or team to engineer and then a whole bunch of people to use...? That doesn't resemble in the least what you were describing where each kernel is hand-tuned for each shape and device. But please do continue to insist you're still somehow right
reply
anon291
2 months ago
[-]
Sigh. I use to be an engineering manager for a kernels team. I think I know what I'm talking about. Yes, each kernel is paid individual attention to even if many are basically the same and require little rework. It's a lot of work. Now I work as an ic in the same field . I don't need to insist I'm right because it's what I do all day
reply
almostgotcaught
2 months ago
[-]
> I think I know what I'm talking about

I've refuted your claims point by point and all you can say is "trust me bro"? Cool but I think you do not in fact know what you're talking about.

reply
lumost
1 month ago
[-]
I work in DB kernels, everything gets hand tuned as there is economic reason for hand tuning it. The expectation in many of these systems is that there are no wasted cycles. You can codegen a decent kernel, but then someone will find a better approach - do you want the slower version of the product, or the faster one?

You can see this in action with matrix math libraries, folks have been hand tuning those for decades at this point.

reply
anon291
2 months ago
[-]
Your claim is that there are automated methods (which I mentioned in my original post) to manage the complexity. My claim is that it requires a large team of engineers working on it. I'm not really sure what you think you've refuted.
reply
binary132
2 months ago
[-]
1000 engineers don’t automatically crank out 50x more code than 20 engineers. But GP is just saying there are a lot of subcomponents involved that each need major engineering effort dedicated to them.

I see it less as an engineering problem and more as a market problem. AMD stuff has existed, it’s the market that doesn’t see a point in it, and at this point, even feature parity or CUDA compatibility for that matter won’t make a huge dent. People will just keep using what they know and are recommended.

It’s more amazing to me that NVDA is so intensely inflated by this LLM hype wave. I find it genuinely scary to think about what’s going to happen when 95+% of AI slopware startups fold. Nvidia won’t be the only company financially impacted. Our entire economy runs on fads.

reply
razodactyl
2 months ago
[-]
I appreciate this comment keeping us in line.
reply
almostgotcaught
2 months ago
[-]
The vast majority of comments/responses on hn are worthless hot-takes, chest thumping, speculation like what I responded to. It's impossible to keep it in check because you to be "open-minded" and "steel man" and whatever. Any minute now dang will show up and reprimand for breaking one of the scriptural rules.
reply
anon291
2 months ago
[-]
Oh I guess I was responding to the "It's amazing part". AMD sells a car without a steering wheel. NVIDIA does, and it's not really amazing that people prefer that one (in my opinion at least)
reply
teleforce
2 months ago
[-]
Never underestimate development eco-system. Ballmer was famously repeatedly shouting developers many times in one of the Microsoft Windows conferences and now he's one of the richest persons. Microsoft also got out of their ways by introducing WSL for running Linux alongside Windows when they realized the majority of OS running their Azure cloud are Linux.
reply
princearthur
2 months ago
[-]
Some use cases require a small memory footprint, e.g. parallel inferences. I suppose there are also dark patterns like tracking, where you don't want the load to stand out.
reply
Havoc
2 months ago
[-]
It’s less size of model and more mem throughout and npu tops that’s the limiting factor for this class of device

Which means you can do larger but it’ll become ever slower

reply
sireat
2 months ago
[-]
Baby steps, but how useful is a 1B model these days?

It seems actual domain specific usefulness (say specific programming language, translation, etc) starts at 3B models.

reply
adt
2 months ago
[-]
reply