The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.
The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.
Key Features:
LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.
Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").
Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.
This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!
Check it out:
Online: https://datascale-ai.github.io/data_engineering_book/
GitHub: https://github.com/datascale-ai/data_engineering_book
I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.
I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.
We are actually three first-year Master's students. This project is indeed a summary of our learning from this past semester, which we rushed to wrap up right before the Chinese New Year break.
When I mentioned 'Project Lead,' I was referring to a senior PhD candidate in our lab. He acts as a mentor to review our code and ensure quality control, but the learning and implementation are very much ours. And yes, to move fast and polish the English, we did utilize LLMs during the writing process.
My English is not good, so I use GPT to help translate and polish my replies to be polite. Maybe it made them sound too robotic. I am reading every comment myself. Sorry for the wrong impression.
> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure
https://github.com/datascale-ai/data_engineering_book/blob/m...
Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...
Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...
I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!
Edit: they did, and I've moved that post to the toptext.
Whether it's GPT or not, it needs rewriting.
Lance[1] (the format, not just LanceDB) is a great example, where you have columnar storage optimized for both analytical operations and vector workloads together with built-in versioning for dataset iteration.
Plus (very important) random access, which is important for stuff like sampling and efficient filtering during curation but also for working with multimodal data, e.g. videos.
Lance is not alone, vortex[2] is another one, nimble[3] from Meta yet another one and I might be missing a few more.
[1] https://github.com/lance-format/lance [2] https://vortex.dev [3] https://github.com/facebookincubator/nimble
Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?
We've found keyword search (BM25) often beats semantic search for specific entity names/IDs, while vectors win on concepts. Do you cover hybrid search patterns/re-ranking in the book? That seems to be where most production systems end up.
Thanks for understanding, and Happy New Year!
How is possible a Chinese publication gets to the top in HN?
We are pleasantly surprised by the warm reception. We know the project (and our English localization) is still a Work in Progress, but we are committed to improving it to meet the high standards of the HN community. We'll keep shipping updates!