FilterHN

SEQUOIA: Exact Llama2-70B on an RTX4090 with half-second per-token latency

131 points

by zinccat

13 days ago

| past

| 7 comments

| infini-ai-lab.github.io

| HN

▲

spxneo

13 days ago

[-]

this is quite worrying for OpenAI as the rate token prices have been plummeting thanks to Meta and its going to have to keep cutting its prices while capex remains flat. whatever Sam says in interviews just think the opposite and the whole picture comes together.

It's almost a mathematical certainty that people who invested in OpenAI will need to reincarnate in multiple universes to ever see that money again but no bother many are probably NVIDIA stock holders to even out the damage.

▲

jiggawatts

13 days ago

[-]

There’s a Pareto frontier where Meta is pushing out the boundaries along the “private” and “cheap” axes.

Open AI can release GPT 4.5 or 5 and push out the boundary in the direction of “correctness” and “multimodality”.

Either way, we win as customers while the the level of competition remains this hot.

I personally want a smart AI much more than a cheap or fast one. Your mileage may vary.

▲

mft_

13 days ago

[-]

Well, Pareto is about optimisation, not either/or. I want a model that’s smart enough, while also being locally-executable.

I don’t know whether/when we’ll get there, and whether it will be improvements in models, or underlying model technology, or GPU/TPUs with larger memory at a consumer price point, or something else, that will deliver it.

▲

jiggawatts

13 days ago

[-]

That's just the middle of the Pareto frontier. Some people want one corner, others the other corner. It's like compression. You can have light compression and high speed, vice-versa, or a balance. You want balance. I want one of the corners.

▲

michelsedgh

13 days ago

[-]

I agree with you somewhat. You are correct unless they have a much better GPT model that have not released for whatever reason. They are a year ahead than competitors and GPT4 is pretty old now. I find it hard to believe they don’t have much more capable models now. We Will see though

▲

j45

13 days ago

[-]

The polish of OpenAI stuff when released has been quite mature since gpt4 or even 3.5.

They are no doubt sitting on ultra polished stuff. When you are the tip of the arrow though and the cutting edge itself it might not be as efficient but does it ever show you things you can’t unsee.

When OpenAI can launch a video thing a day after because it’s ready to go. I am less and less skeptical e dry time they ship because the quality of the first version isn’t sliding back wards even in different areas like video.

Maybe releasing it is strategic, or releasing it also requires supporting it infrastructure wise and then some. That might be a challenge.

My feeling is the next model of an k between may have massive efficiency and performance improvements without having to go quantum with brute forcing it.

Meanwhile others who are following what OpenAI has done seem to be able to optimize it and make it more efficient whether it’s open source or otherwise.

Both are doing important work and I'm not sure I want to see it as a one winner take all game.

The way AI vendors are responding suddenly to another’s launch feels like they are always ready to launch and continue to add functionality to it that could also ship.

It reminds me of when Google spent a billion dollars advertising bing had a billion pages indexed. Google stayed quiet. Then when the money was spent by Microsoft, Google simply added a zero or two to their search page, when they used to list how many pages they have indexed. They were just sitting on it already done, announcing it when it’s to their benefit.

▲

mark_l_watson

13 days ago

[-]

Also, what will the effect of open models be on the LLM provider industry? What effect will Meta’s scorched earth policy of killing markets by releasing very good open models have?

I use LLMs constantly, but no longer in a commercial environment (I am retired except for writing books, performing personal research projects, and small consulting tasks). I now usually turn first to local models for most things: ellama+Emacs is good enough for me to mostly stop using GPT-4+Emacs or GitHub Copilot, the latest open 7B, 8B, 30B models running on my Mac seem sufficient for most of the NLP and data manipulating things I do.

However, it is also fantastic to have long context Gemini, OpenAI APIs, Claude, etc. available when needed or just to experiment with.

▲

Plankaluel

13 days ago

[-]

GPT-4 is not a single model. The GPT-4 that was released initially a year ago is way worse in benchmarks than the newest versions of it and the original version has been beat by quite a lot of other models by this point.

The newest version of GPT-4 is probably still overall the best model currently, but it is only a few months old, and the picture depends a lot on what benchmarks you are looking at.

E.g. for what we are doing at our company (document processing, etc.) Claude-3 Opus and Gemini-1.5 Pro are currently the better models. The newest GPT-4 even performed worse than a previous version.

So to me it def. seems like the gap is getting smaller. Of course, OpenAI could be coming out with GPT-5 next week and it could be vastly better than all other current models.

▲

easygenes

13 days ago

[-]

There's wide speculation that what will be branded as either GPT-4.5 or GPT-5 has finished pretraining now and is undergoing internal testing for a fairly near-term release.

▲

michelsedgh

13 days ago

[-]

My speculation is that internally they have much stronger models like Q* but they won’t be able to release them to public even if they want to for lack of compute and safety and other reasons they see probably…

▲

kaliqt

13 days ago

[-]

They don't actually care about safety, that's a lie, so compute and business strategy is the only thing stopping them.

SoRA is the same. It's not ready and it's too slow.

▲

whimsicalism

13 days ago

[-]

I am curious whether this is true - OAI at least has the reputation in the industry of caring the least about safety of the major labs

▲

hhh

13 days ago

[-]

If they don’t care about safety (or perceived safety), why do they spend so much time lobotomizing models for safety reasons?

▲

Terretta

13 days ago

[-]

market reach e.g. ability to have chat app on iOS (the API is less limited)

public relations, limit the edge case nonsense 'journalists' hype so corporate execs aren't terrorized into avoiding buying

doesn't have to be as smart as it could be, it just has to be smarter than other models, so might as well file down some sharp edges for sake of above

▲

whimsicalism

13 days ago

[-]

I didn’t say they don’t care about safety, merely that of the big labs they care the least or close to the least

▲

realusername

13 days ago

[-]

Because of PR reasons. They want to avoid government legislations and pretending that they care helps

▲

RcouF1uZ4gsC

13 days ago

[-]

> My speculation is that internally they have much stronger models like Q*

People used to speculate the same about Google. Everyone hypes up their “secret, too powerful to release” models. Remember the dude who was convinced that there was a sentient AI in the machine? The light of actual public release tends to expose a lot of the hype.

▲

HeatrayEnjoyer

13 days ago

[-]

That would be a reasonable assumption if OpenAI did not already have an established track record of repeatedly re-defining our fundamental expectations of what technology can do.

GPT-4 was already completed and secretly being tested on Bing users in India in mid-2022 (there were even Microsoft forum posts asking about the funny chatbot). Even after heavy quantization and the alignment tax GPT-4 is still the bar to beat. It's been two years and their funding has increased over 10x since then.

Short of a fundamental Hard Problem that they cannot overcome, their internal bleeding edge models can reasonably be assumed to possess significantly greater capabilities.

▲

torginus

13 days ago

[-]

Honestly I'm pretty puzzled by this mystical fog that hangs over OpenAIs skunkworks projects - don't people leave for other jobs/go to conferences etc.?

I'm surprised that nobody call tell what they infact do or do not have.

▲

lightbritefight

13 days ago

[-]

Truth tends to take the wind out of hypes sails.

With hundreds of billions on the line for the founders and a whole lot of likely unvested stock options for the employees, it doesnt seem like anyone wants to open up about whats actually going on day to day.

▲

imtringued

13 days ago

[-]

I'm not saying Claude 3 and Gemini are better than GPT4 in every aspect, but those two models can at least perform addition on arbitrarily long numbers, meanwhile GPT4 struggles.

▲

j-bos

13 days ago

[-]

Isn't that why he's making rounds to lock down the biggest AI's?

▲

hiddencost

13 days ago

[-]

I suspect that when it costs 0.5c per 100 million generated token, and you can generate 1000 tokens per second, they'll be very happy.

▲

moralestapia

13 days ago

[-]

Disclaimer: not a fan of "Open"AI

Everyone could say anything about open source models, but they're comparing themselves to what OpenAI released a year ago. They haven't shown all of their cards yet and they have a decent moat already in place; some say they have no moat, I disagree, they have one of the best moats possible which is brand awareness.

Sora on its own could bring in billions in revenue; an open-source Sora will take at least another year, if not two, to come out. Then more time until it can run on commodity hardware. An open source model that only runs in a dedicated H100 is actually less useful than a closed model behind an API call; not to detract from open source, I think it's the way to go but I'm just being pragmatic and realistic. There's a reason why MS Office is still the top productivity app in the world, even though dozens of open source alternatives exist.

▲

Hendrikto

13 days ago

[-]

> they have one of the best moats possible which is brand awareness.

Do they though?

If you talk to "regular people", everybody knows ChatGPT, but nobody knows or cares about OpenAI. And most of them don‘t even really know that name. They call it ChatUuuuhm, ChatThingy, Chad Gippity, or similar.

I think they will just switch, when something better comes along.

▲

CapeTheory

13 days ago

[-]

Good old Chatty Jeeps

▲

moralestapia

13 days ago

[-]

I don't really follow your logic ...

ChatGPT is a brand that belongs to OpenAI, that's ... not really hard to understand.

▲

poslathian

13 days ago

[-]

MS had yet to fully stabilize that lead a full decade after they had won the os platform standard for ibm compatible pcs. A platform standard moat goes way way beyond a brand advantage.

Azure, while significant, has no similar monopoly to support OpenAI. Do you really see a structural advantage to openAI beyond the Microsoft products integrating it?

▲

moralestapia

13 days ago

[-]

Can you clarify what is meant by "structural advantage"?

▲

jstummbillig

13 days ago

[-]

I disagree.

a) A year after GPT-4 set the bar, it's still the best model, despite everyone else not having to do it first. Just copy, and just software. And that's not for lack of trying by every other viable prime player on the planet with unprecedented acceleration.

Imagine any other piece of software, where the incumbent has a mere 2-3 year head start, in which they had to work out the entire product that everyone else, despite just having to copy and pressing the pedal through the floor is struggling just trying to catch up with.

b) The current models including GPT-4 are so bad. The few billions can be made by just by continue playing this game of improvements for a few years and getting better each year. I think people are wildly confused about how big this market is going to be when that happens. They are not squeezing hosting or compute. They are squeezing intelligence. Intelligence is the entire economy. The notion that there would ever not be room for multiple things here, maybe through size or specialisation or cost (as with all other intelligence), and that a few billion dollar are a big deal, is so strange to me.

c) The game will at some point, be mostly about infra and optimization. People come to the conclusion that's a problem for the incumbents, when our entire industry is mostly about infra and optimization. AWS is infra and optimization. I think even the average hn tinkerer understands that therein lies a proposition that's not exactly equivalent to "just rent a few servers and do it yourself".

▲

anon373839

13 days ago

[-]

> A year after GPT-4 set the bar, it's still the best model

Debatable. Many people find Claude Opus superior, and I know I've found it consistently better for challenging coding questions. More importantly, the delta between GPT-4 and everything else is getting smaller and smaller. Llama 3 is basically interchangeable with GPT-4 for a huge number of tasks, despite its smaller size.

▲

jstummbillig

13 days ago

[-]

> Many people find Claude Opus superior

Many more do not, according to the LMSYS leaderboard.

> Llama 3 is basically interchangeable with GPT-4 for a huge number of tasks

Sure. I am sure the number approaches infinity, if you are willing to let the model inform the task. That's usually not what most people are looking for in a tool.

▲

threeseed

13 days ago

[-]

GPT-4 was released in March 2023.

Which means the research that went into it would've been finalised quite some time prior.

Meaning that you're getting close to a 2 year head start.

▲

acheong08

13 days ago

[-]

While they still call it GPT-4, the one topping the rankings are newer iterations of it despite still retaining the same name. The latest one is from 2024-04-09. Sure that one probably finished training a few months ago but it is by no means a 2 year head start.

▲

ashu1461

13 days ago

[-]

Agree, the delta is getting smaller. And for majority of the tasks you can use the Claude Sonet which is better than 3.5 and also fast.

But at the same time when you actually want to solve a complicated problem, deep down you know that only GPT 4 can crack it.

▲

jstummbillig

13 days ago

[-]

Even more important, you know that GPT-4 will probably also not crack it. Which is why the SOTA is not terribly interesting. The delta between GPT-4 and the competition has been closing but why anyone would assume that this is a trend and that it would continue with GPT-4.5 to competition, or GPT-5 to competition instead of the other way around is a mystery to me.

I am not saying it could not be true. But extrapolating from differences between current bad models to a future with better models is weird, specially when everyone seems to pretty much agree that scale is the difference between the two and scale is hard and exclusive.

▲

anon373839

13 days ago

[-]

There’s a scatterplot that’s been circulating on Twitter. The trend lines show that since the time of GPT-2, open weights models have improved at a steeper rate than proprietary models, with the two on a path to intersect.

▲

jstummbillig

13 days ago

[-]

I would argue that's to be expected after the first generally accepted POC (GPT-3.5) was released, with it an entire industry created, and other companies actually started copying/competing in a big way.

It seems a stretch to read this as a continuing trend, when (from what I gather everyone agrees on) the way to better models seems to be ever more efficient handling of ever larger amounts of money, compute and data, with no reasonable limits in sight on any of the three.

▲

anon373839

13 days ago

[-]

Scaling up LLMs is only going to go so far, and it will yield diminishing marginal returns on all of that money, compute, and data. It’s a regime of exponential increases in inputs for linear gains in the outputs - barring some technological breakthroughs which could come from anywhere, not just from OpenAI.

▲

hackerlight

13 days ago

[-]

Depends how good their next model is, and if they prevent leaks and departures so they can prolong the lead for an undetermined amount of time.

▲

14u2c

13 days ago

[-]

Most of the big "investments" in OpenAI are in the form of compute credits. I fail to see the downside of that.

▲

modeless

13 days ago

[-]

I don't need exact results. FP8 quantization is almost lossless and even 6-bit quantization is usually acceptable. Can this be combined with quantization?

▲

mmoskal

13 days ago

[-]

Yes. It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences.

It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster.

Edit typo

▲

dimask

13 days ago

[-]

> Can this be combined with quantization?

It is in their TODO part in https://github.com/Infini-AI-Lab/Sequoia/tree/main

▲

alecco

13 days ago

[-]

INT8, not FP8

▲

freeqaz

13 days ago

[-]

So this is 8x faster for serving these models than before? Or is this about it being more deterministic? I can't quite tell from reading it.

▲

maccam912

13 days ago

[-]

The idea is to serve models that would normally be considered too large for GPU memory (70 billion parameters at 16 bytes each for 140 GB of memory required). Some people figured out you can offload the model and only have parts of it loaded so a 24 GB GPU like the 4090 can still serve the model, but it goes a lot slower. They have a new way to serve the same model on the same GPU but 8x better throughput. Something about decoding tokens on a smaller model maybe, then just checking multiple tokens on the larger model in a single batch. Magic, but ultimately its the same model, same GPU, same output as before, but much better throughout.

▲

aussieguy1234

13 days ago

[-]

I'm looking at buying 2 X RTX 3060s to run LLama 70b for my new PC I just purchased.

Will this work, or do I need a Tesla P40 or two?

▲

tarruda

13 days ago

[-]

Note that 2 RTX 3060 will probably be significantly slower than RTX4090.

Even with RTX 4090, 2 tokens per second is very slow and likely not ideal for most tasks. It is impressive (much faster than previous solutions), but still very slow for real time use.

If you want to run Llama 3 70b, might be better to purchase a mac studio with 64gb RAM (more for longer contexts) and run with 4-bit quantization.

My 2 cents: For most common tasks Llama 3 8b will be more than enough, and you can run that with full precision using a single rtx 3090. At a much lower cost, you can also run Llama 3 8b with 8-bit quantization in a single RTX 3060, if it has 12GB RAM.

▲

dannyw

13 days ago

[-]

Theoretically there's no reason why this shouldn't work, but you likely will find the software isn't designed for multi-GPU and have to reimplement/fix things yourself.

You will also be getting about 720GB/s of memory bandwidth with 2x3060; instead of 1TB/s with the 4090; so expect lower performance.

▲

34679

13 days ago

[-]

I picked up a couple RTX 4060ti in the 16GB version for $450 each a couple days ago from Bestbuy. Had been looking at the 3060 like yourself. Installed LM Studio and have been trying out a bunch of models with varying levels of quantization, completely pain free.

▲

thelittleone

13 days ago

[-]

Other than portability and privacy, are there any benefits to running a local model with a 4090, versus running the same model on-demand on a cloud service with the same or more powerful card?

▲

razodactyl

13 days ago

[-]

There are always going to be pros and cons. That's why solutions like managed databases are reality. From an expert perspective it seems like there's more to lose but from the perspective of a company with employee turn over, possible data loss, security etc. the benefits start to far outweigh the costs.

This reasoning can mostly be applied here. If you want to learn about and pull the LLM apart. Perhaps fine-tune and tinker then 100% go ahead running locally. You however won't be able to scale this up easily for a consumer base and the electricity use and heat output starts to become a problem.

At some point it's more beneficial to pay the provider for inference, this includes upkeep, latest models, faster generation, stability, hosting etc.

Pros and cons! Choice is important and Meta is doing the right thing by the AI community and tech community in general by being realistic with these programs. The ecosystem is giving back by being able to access these high quality models.

▲

j45

13 days ago

[-]

What Meta is doing is very nice and differentiates them.

I also hope that it ought not change if it became more palatable to not be open.

▲

razodactyl

10 days ago

[-]

Meta seem to be thinking 10 years ahead where anyone can run these models at the edge.

Perhaps it's not about where the model is hosted but what can be built on it.

Meta have added Llama3 across the board on all their apps.

Chat is fun but in the wrong context it's useless. However the training data return on millions of users is something interesting to pay attention to.... Llama4 might be a significant jump!

Increased model intelligence and innovative applications of language technology will be where the real value appears. Open-sourcing and allowing public amplification of abilities and enhancements is a very smart move.

The marketing department is also commendable. What happened to Grok? LLMs are everywhere - we're running them on home computers, that's where we should be pondering the next moves.

▲

kaliqt

13 days ago

[-]

Guaranteed uptime.

you are the guarantor but that's good enough.

▲

choppaface

13 days ago

[-]

Eventually these models will need to run on mobile devices, so commodity desktop GPUs are a good stepping stone. Alexnet / Caffe got traction because they could be run on commodity desktop machines. Then a few years later phones could run object detection etc.

▲

imtringued

13 days ago

[-]

If you have a robot or self driving car, you're going to want on device inference for your vision language models.

For video games, being locked to a cloud service means the feature will disappear when the servers are shut down.

▲

zwaps

13 days ago

[-]

Is it me or is this paper basically missing all technical information?

I get that Therese proprietary technology, but if so, can we please not put this on arxiv and pretend it’s a scientific contribution?

▲

qrios

13 days ago

[-]

The linked github repo [1] seems to have the code available and well documented.

[1] https://github.com/Infini-AI-Lab/Sequoia/tree/main/Engine

▲

halyconWays

13 days ago

[-]

Someone get this into koboldcpp