They used eight A100s, but don't say how long it took to train their LLM. It would be interesting to know the wall clock time they spent. Their dataset is, relatively speaking, tiny which means it should take fewer resources to replicate from scratch.
What's interesting though is that the Smalley model performed better, though they don't speculate why that is.
[0] https://github.com/keyonvafa/world-model-evaluation/blob/949... [1] https://github.com/keyonvafa/world-model-evaluation/blob/949... [2] https://github.com/keyonvafa/world-model-evaluation/blob/949... [3] https://github.com/keyonvafa/world-model-evaluation/blob/mai...
I think this stuff will become a lot more fascinating after transformers have bottomed out on their hype curve and become a tool when building specific types of models.
Do you have any good pointers (literature, code etc) on the mechanics of this?
(like tinyllama or smaller, or just use whatever karpathy repo is most fun at the moment and train some gpt2 equivalent)
Really always happy to chat about this stuff, with anybody. Would love to explore ideas here, it's a fun hobby, and we're living in a golden age of open-source structured datasets. I haven't actually found a community interested specifically in static knowledge injection. Email in profile, in (ebg_13 encoded).
Used in non generative language models like BERT but should help with generative models as well.