And in 10-20 years it’ll be capable of some crazy stuff
I might be ignorant of the field but why do we assume this?
How do we know it won’t just plateau in performance at some point?
Or that say the compute requirements become impractically high
No one has hit a model/dataset size where the curves break down, and they're fairly smooth. Usually simple models that accurately predict performance work pretty well nearby existing performance, so I expect trillion or 10-trillion parameter models to be on the same curve.
What we haven't seen yet (that I'm aware of) is whether the specializations to existing models (LoRa, RLHF, different attention methods, etc.) follow similar scaling laws, since most of the efforts have been focused on achieving similar performance on smaller/sparser models and not investing the large amounts of money into huge experiments. It will be interesting to see what Deepmind Gemini reveals.
The same was true of transistors, until it wasn't and they started diverging from the predictions about how they would behave when very small. Sometime around the late Netburst era (the Pentium 4/Netburst architecture was sunk by this problem - they assumed, designing it, that it would scale to 8-10GHz on a sane power budget, and it simply didn't as the "improvement per transistor shrink" became less and less).
Data
Compute
Algorithms
All three are just scratching the surface of what is possible.
Data: What has been scraped off the internet is just <0.001% of human knowledge as most platforms cannot be scraped so easily, are in formats that are not in text like video, audio, or just plain old pieces of paper undigitized. Finally there are probably techniques to increase data through synthetic means, which is purportedly OpenAI's secret sauce to GPT-4's quality.
Compute: While 3nm processes are approaching an atomic limit (0.21nm for Si), there is still room to explore more densely packed transistors or other materials like Gallium Nitride or optical computing. Not only that but there is a lot of room in hardware architecture to allow more parallelism and 3-D stacked transistors.
Algorithms: The transformer and other attention mechanisms have several sub-optimal components to them like how arbitrary the Transformer is in terms of design decisions, and quadratic time complexity for attention. There also seems to be a large space of LLM augmentations like RLHF for instruction following and improvements in factuality and other mechanisms.
And these ideas are just from my own limited experience. So I think its fair to say that LLMs have plenty of room to improve.
> Data
> Compute
> Algorithms
Not to be facetious but so is all other software. LLMs appear to scale in correlation to the first two but it's not clear what the correlation is and that's the basis of the question being asked.
Wouldn't that be the wall we'll hit? Think of how shitted up Google Search is with generated garbage, I'm imagining we're already in the 'golden age' where we were able to train on good datasets before it gets 'polluted' with LLM generated data that may not be accurate, and it just continues to become less accurate over time.
I think feeding the internet into a LLM will be seen as the mainframe days of AI.
Is this actually true? My gut check says yes, but I'm also unaware of any meaningful way to actually quantify the volume of sensor data processed by a baby (or anyone else for that matter), and it wouldn't shock me to discover if we could we'd find it to be a huge volume.
That doesn't mean there isn't possibly a plateau somewhere but it's somewhere way off in the distance.
The problem with fusion and quantum computing is that advances are being made, but because those advances aren't consumer facing, you don't see them. Eg December 2022, they managed to get more energy out of a fusion experiment than they put in. That's huge! I'm not going to see an effect on my power bill for another couple decades, if ever, but it's real actual solid progress. For quantum computing, they're moving past the singular q-bit tech demonstrations level and moving into actual practical applications like making chips that can talk to each other **. Again, doesn't remotely affect me or my laptop today, but we've moved past the 1998 Stanford/IBM 2 q-bit computer.
Meanwhile, I can adopt a new model getting dropped with an afternoon of work, and see the results in milliseconds, in the case of StableDiffusion-turbo.
* https://www.technologyreview.com/2023/11/16/1083491/whats-co... ** https://www.technologyreview.com/2023/01/06/1066317/whats-ne...
Cryptocurrency’s need to be fully decentralised is the thorn in it’s side. Be your own bank is a bit too much for most people used to cash or a bank account they can call up if there is a problem. It has fundamental social problems that there may be solutions to but probably not.
Fusion and quantum are massive physics and engineering challenges. With ML we are already building the chips to scale, so we know it is scalable and doable.
It is a 50 to 100 problem not 0 to 1.
I also was not arguing for ai to not improve by quit a bit more.. I actually think ai will make a few more big steps forward, but the garantee for this is not ankered in the inflowing capital/talent but instead in the relative clear path forward of the technologie and partly known ineffective architecture.
But that's just my opinion and no one knows the future. If you read papers on arxiv.org, progress is being made. Papers are being written, low-hanging fruit consumed. So we're going to try because PhDs are there for the taking on the academic side, and generational wealth is there for the taking on the business side.
E. F. Codd invented the relational database and won the Turing Award. Larry Ellison founded Oracle to sell relational databases and that worked out well for him, too.
There's plenty of motivation to go around.
Digital computer architecture evolved the way it did because there was no other practical way to get the job done besides enforcing a strict separation of powers between the ALU, memory, mass storage, and I/O. We are no longer held to those constraints, technically, but they still constitute a big comfort zone. Maybe someone tinkering with a bunch of FPGAs duct-taped together in their basement will be the first to break out of it in a meaningful way.
Good LLMs like ChatGPT are a relatively new technology so I think it's hard to say either way. There might be big unrealized gains by just adding more compute, or adding/improving training data. There might be other gains in implementation, like some kind of self-improvement training, a better training algorithm, a different kind of neural net, etc. I think it's not unreasonable to believe there are unrealized improvements given the newness of the technology.
On the other hand, there might be limitations to the approach. We might never be able to solve for frequent hallucinations, and we might not find much more good training data as things get polluted by LLM output. Data could even end up being further restricted by new laws meaning this is about the best version we will have and future versions will have worse input data. LLMs might not have as many "emergent" behaviors as we thought and may be more reliant on past training data than previously understood, meaning they struggle to synthesize new ideas (but do well at existing problems they've trained on). I think it's also not unreasonable to believe LLMs can't just improve infinitely to AGI without more significant developments.
Speculation is always just speculation, not a guarantee. We can sometimes extrapolate from what we've seen, but sometimes we haven't seen enough to know the long term trend.
I think I have a corollary type idea: Why are LLM's not perhaps like "Linux," something than never really needs to be REWRITTEN from scratch, merely added to or improved on? In other words, isn't it fair to think that LoRA's are the really important thing to pay attention to?
(And perhaps, like Google Fuschia or whatever, new LLMs might just be mostly a waste of time from an innovators POV?)
It gets murkier trying to map that actual capabilities, but so far, lower loss has led to much stronger capabilities.
Its not unfeasable in the future to have a box at home that you can ask a fairly complicated question, like "how do I build a flying car", and it will have the ability to
- tell you step by step instructions of what you need to order
- write and run code to simulate certain things
- analyze you work from video streams and provide feedback
- possibly even have a robotic arm with attachments that can do some work.
From a software perspective, I've wondered for a while if as LLM usage matures, there will be an effort to optimize hotspots like what happened with VMs, or auto indexing like in relational DBs. I'm sure there are common data paths which get more usage, which could somehow be prioritized, either through pre-processing or dynamically, helping speed up inference.
Also, GPT4 seems to include multiple LLMs working in concert. There's bound to be way more fruit to picked along that route as well. In short, there's tons of areas where improvements large and small can be made.
As always in computer science, the maxim, "Make it work, make it work well, then make it work fast," applies here as well. We're collectively still at step one.
Great video to talk about this: https://www.youtube.com/watch?v=ARf0WyFau0A
In threads on LLMs, this point doesn't get brought up as much as I'd expect, so I'm curious if I'm missing talks on this or maybe it's wrong. But I see this as the way forward. Models generating tons of answers, and other models being able to pick out the correct ones, and the combinations being beyond human ability, where after, humans can do their own verification.
Edit:
Think of it this way. Trying to create something isn't easy. If I was to write a short story, it'd be very difficult, even if I spent years reading what others have written to learn their patterns. If I then tried to write and publish a single one myself, no chance it'd be any good.
But _judging_ short stories is much easier to do. So if I said screw it, I'll read a couple stories to get the initial framework, then write 100 stories in the same amount of time I'd have spent reading and learning more about short stories, I can then go through the 100 and pick out the one I think is the best and publish that.
That's where I see LLMs going and what the video and papers mentioned in the video say.
I'm not an expert here either but I wonder if there will be the same "leap" we saw from ChatGPT3-4 or if there's a diminishing curve to performance, ie: adding another trillion parameters has less of a noticeable effect than the first few hundred billion.
[0] https://fortune.com/2023/09/09/ai-chatgpt-usage-fuels-spike-... -- I am fairly certain they paid for that water, it was not a commensurate price given the circumstances, and if they had to ask to use it first the answer would have been, no, by a reasonable environmental stewardship organization.
I, of course, already know how to do all this for a mere $80B.
If you've made chips with latches and LUTS, any performance data you can share, no matter how old, would be helpful
It's an idea that's been bouncing around in my head since reading George Gilder's call to waste transistors. Imagine the worst possible FPGA, no routing hardware, and slow it down even more with a latch on every single LUT. Optimize it slightly, by making cells with 4 bits in, 4 bits out (64 bits of programming per cell), with the cells clocked in 2 phases, like the colors of a chess board. This means that each white cell has static inputs from the black cells.... and is thus fully deterministic, and easy to reason about. The complement happens on the other phase. Together, it becomes turning complete.
The thing is, it does computing with NO delays between compute and memory. All the memory is effectively transferred to compute every clock cycle. The latency sucks because you'll take n/2 cycles to get data across an N*N grid. However, you'll get an answer every clock cycle after that.
Imagine a million GPT-4 tokens/second.... not related to each other, of course, but parallel streams, interleaved as the data streams across the chips.
Imagine a bad cell.... you can detect it, and route around it. Yields don't have to be 100%.
The extreme downside is that tools for programming this thing don't exist. VHDL, etc... aren't appropriate. I'm going to have to build them. I've been stuck in analysis paralysis, but I've decided to try to get Advent of Code done using my bitgrid simulator. I hope to be done before it starts again next December. ;-)
I wish I shared your optimism. However I've seen no evidence that society is prepared to deal with a large swath of jobs being obsoleted by AI. I have no doubt that the "haves" will call it a technological utopia, but I strongly suspect the "have nots" will be larger than ever.
We're the only AI company that can offer HorseSense (TM)
Per usual, I can build this technological panopticon/utopia for a bargain price of $80B. Some people think it can be done for cheaper but they haven't spent as much time as I have on this problem. I have the architecture ready to go, all I need is the GPUs, cameras, microphones, speakers, and wireless data network. The software is the easy part but the panoptic infrastructure is what requires the most capital. The software/brain can be done for maybe $2B but it needs eyes, ears, and a mouth to actually be useful.
The second stage is building up the actuators to bypass people but once the panopticon is ready it won't be hard to build up the robot factories to enact the will of AGI directly via robots acting on the environment.
Anything that has seen continual growth will be assumed to have further continual growth at a similar rate.
Or, how I mentally model it even if it's a bit incorrect: People see sigmoidal growth as exponential.
I suspect that we've already seen the shape of the curve: a 1B parameter model can index a book; a 4B model can converse, but a 14B model can be a little more eloquent. Beyond that no real gains will be seen.
The "technology advancement" phase has already happened mostly, but the greater understanding of theory, that would discourage foolish investments hasn't propagated yet. So there's probably at least another full year of hype cycle before the next buzzword is brought out to start hoovering up excess investment funds.
So if we have that much compute power already why can't we just configure it in the right way to match a human brain?
I'm not sure I totally buy that logic though, since I would think the architecture/efficiency of a brain is way different from a computer
But even if you’re looking just at the LLM it seems like there’s a lot of ways it can be improved still.
We don't.
But that's also the sort of thing you can't say when seeking huge amounts of funding for your LLM company.