What you get is an iterator over the dataset that samples based on how far you are in the training.
This seems really similar to the motivations around masked language modeling. By providing increasingly-masked targets over time, a smooth difficulty curve can be established. Randomly masking X% of the tokens/bytes is trivial to implement. MLM can take a small corpus and turn it into an astronomically large one.
e.g. DeepCubeA 2019 (!) paper to solve Rubik cube.
Start with solved state and teach the network successively harder states. This is so "obvious" and "unhelpful in real domains" that perhaps they havent heard of this paper.
If you sat down to solve a problem you’ve never seen before you wouldn’t even know what a valid “later state” looking like.
The happy Tetris bug is also a neat example of how “bad” inputs can act like curriculum or data augmentation. Corrupted observations forced the policy to be robust to chaos early, which then paid off when the game actually got hard. That feels very similar to tricks in other domains where we deliberately randomize or mask parts of the input. It makes me wonder how many surprisingly strong RL systems in the wild are really powered by accidental curricula that nobody has fully noticed or formalized yet.