FilterHN

The human genome contains around 1.5GB of information and DeepSeek v3 weighs in at around 800GB, so it's a bit apples-to-oranges. As you say, what's been evolved over hundreds of millions of years is the learning apparatus and architecture, but we largely learn online from there (with some built-in behaviours like reflexes). It's a testament to the robustness of our brains that the overwhelming majority of humans learn pretty effectively. I suspect LLM training runs are substantially more volatile (as well as suffering from the obvious data efficiency issues).

If you'd like an unsolicited recommendation, 'A Brief History of Intelligence' by Max Bennett is a good, accessible book on this topic. It explicitly draws parallels between the brain's evolution and modern AI.

▲

sdpmas

1 hour ago

[-]

i think evolution meta-learns the architecture, hyperparams. some domain knowledge too (for ex, we all perceive the world as 3d) but not much. if you compare the text consumed by human vs AI (and i think this is fair b/c even with evolution text is a pretty recent invention for humans), the gap is many orders of magnitude.

▲

throwaway894345

16 minutes ago

[-]

Tangentially, some scientists think humans may have hardwiring for detecting snakes https://en.wikipedia.org/wiki/Snake_detection_theory

▲

naasking

8 minutes ago

[-]

Great project. On the matter of data efficiency and regularization, I'd love to see someone try scaling GrokAlign!

▲

nsnzjznzbx

4 hours ago

[-]

We will get to the point where you can quickly bootstrap i.e. an LLM can train a better LLM in a loop, leave it and it can really learn. Like learn learn.

"Train yourself to solve this problem see OBJECTIVE.md"

▲

nine_k

3 hours ago

[-]

This is the kind of runaway self-improving development that proponents of the singularity keep talking about.

The problem is that training appears to be really slow and expensive. Some quality thinking is required to improve the training approach and the architecture before committing resources to training a new large model. And even the largest models are by now not nearly as good at quality thinking as the best humans.

▲

abeppu

2 hours ago

[-]

In their little algorithm box on Chain Distillation, they have at step 2b some expression that involves multiplying and dividing by `T`, and then they say "where α = 0.5, T = 1.0".

I think someone during the copy-editing process told them this needed to look more complicated?

▲

sdpmas

2 hours ago

[-]

the T stands for tea :)

▲

naruhodo

2 hours ago

[-]

Ah, so it's a source of randomness! Presumably 1.0 corresponds to a really hot cup of fresh tea.

▲

littlestymaar

6 hours ago

[-]

> Data efficiency matters because compute grows much faster than data [2] (referencing a paper from 2022)

I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.

Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.

That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.

▲

ACCount37

4 hours ago

[-]

There's "cheap" bulk data - simple synthetics, unfiltered scrapes. Used for pre-training, especially early pre-training. And then there's "expensive" data. Human domain expert solutions, made by people you hire for $100 an hour. Used for SFT.

For "expensive" data, it makes a lot of sense to use every trick in the book to squeeze that data for all its worth.

▲

akshayvegesna

5 hours ago

[-]

You seem to be making two points: - synthetic data is a valuable direction to pursue when you have compute - chinchilla scaling laws have some flaws for small models Both of these are side points to the core purpose of the Slowrun.

The main point is the 100M tokens we train on push people to come up with novel ideas to improve pretraining, outside of facile synthetic data generation. I think we should continue to push on synthetic data, but why not come up with some new ideas too? You cannot use synthetic data for everything (see sdpmas's point)

▲

sdpmas

6 hours ago

[-]

> you can simply generate more, and higher quality, artificial data

this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.

good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.

▲

littlestymaar

5 hours ago

[-]

> this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that

I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.

▲

sdpmas

5 hours ago

[-]

continual learning is LLMs :) ultimately everything will be/already is data bottlenecked.

▲

yorwba

6 hours ago

[-]

Related: Discussion on the initial NanoGPT Slowrun announcement: https://news.ycombinator.com/item?id=47251259 (185 points 15 days ago, 39 comments)

▲

sdpmas

6 hours ago

[-]

thanks!