I appreciate the honesty, but now there's no journey, and that's what I'm interested in. I can ask a LLM myself.
There's a lot of pre-processing, experimentation and validation that went into this project. The training data collection and sanitization alone is a big undertaking.
As for the blog post itself, from the article:
> Note: This blog post is 100% written by me. No AI has been used whatsoever.
Put another way: You can ask the LLM yourself to do this project? Please do, share your prompt, I'd like to see it.
im pretty sure its a real text in Welsh. there might be typos from ocr but yeah thats what the language really looks like, i dont speak it but its easy to recognize.
"It will be easy for the knowledgeable to fix the few errors that remain [in the text]". (Bydd yn rwydd iawn i'r cyfarwydd ddiwygio'r ychydig.")
Which is exactly what the OP is doing.
And anyway, I think the most important thing is dataset quality. Dumping in whatever dataset you find on Huggingface is a recipe for mediocrity, so I'm also spending a lot of time on that.
Thanks for the writeup. A more granular followup would be cool too.
Do you mind expanding this question? More granular in what way? what would you like to know that is missing from the post?