What they did is closer to fine-tuning, so this comparison isn’t helpful. The article is transparent about this at least, but listing the cost and performance seems disingenuous when they’re mostly piggybacking off an existing model. Until they train an equivalently sized model from scratch and demonstrate a notable benefit, all this looks like is at best a sidegrade to transformers.