The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.
The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.
Key Features:
LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.
Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").
Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.
This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!
Check it out:
Online: https://datascale-ai.github.io/data_engineering_book/
GitHub: https://github.com/datascale-ai/data_engineering_book
> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure
https://github.com/datascale-ai/data_engineering_book/blob/m...
Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...
Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...
I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!
Edit: they did, and I've moved that post to the toptext.
How is possible a Chinese publication gets to the top in HN?
We are pleasantly surprised by the warm reception. We know the project (and our English localization) is still a Work in Progress, but we are committed to improving it to meet the high standards of the HN community. We'll keep shipping updates!