I think it’s important to come up with other forms of generating synthetic data that doesn’t come from distilling other models. Translating documents, OCRing old documents and using Digital Twins to train visual models come to mind. I’ve never successfully trained any model text-related, but I think the quality of the original text should be critical in how it will perform.
You mentioned smaller models achieving better results than ChatGPT, but those models have trouble extending their knowledge to a wide variety of topics, which is shown by their subpar performance in GPQA (general knowledge) tests.