Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.
> Only environments that could be fully solved by at least two human participants (independently) were considered for inclusion in the public, semi-private and fully-private sets.