It was fascinating digging into this to find their dataset weights defined in a declarative YAML file [2]. 70% is from FineWeb/Commoncrawl but filtered using a classifier trained on Llama-70b's rating from 0-5 of the educational content of the text [3]. This is something we know small models like Phi-3 have been doing for a while, but it's great to see a fully open reproduction of it that beats their benchmarks. Definitely supports the idea you can get even better reasoning at smaller model sizes by carefully filtering and curating your training data (and generating good synthetic data from/distilling bigger models).
You can see the 450k Llama educational value scores here: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-ll... It's interesting, I think the text with 3 scores is really good, but the 5 scores pick content that is not very reasoning or information-heavy but just mentions education or a worksheet. For SmolLM they just took the documents with scores >= 3 so it doesn't matter a ton.
2. https://github.com/huggingface/smollm/blob/9efce803bc7e37727... 3. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier
In this page, SmolLM2-1.7B does a bit better than Qwen2.5-1.5B which is ahead of Llama3.2-1B. At the next size level up, in other comparisons I've seen that e.g. Phi-3.5 (which is ~3.8B params) does a bit better than Llama 3.2 3B. Gemma 2 has a 9B size, llama 3.1 has an 8B size and I think when that came out Mistral had a 7B model -- so whenever a new "small" thing does "better" than its peers, we can't easily see whether it's because of any of the many small choices that the authors made were actually better.