Python Data Science Handbook
108 points
4 hours ago
| 8 comments
| jakevdp.github.io
| HN
farhanhubble
1 hour ago
[-]
reply
ellisv
2 hours ago
[-]
These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).

In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.

reply
trio8453
40 minutes ago
[-]
This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.
reply
__rito__
1 hour ago
[-]
This is one of the few books that I read cover-to-cover when I was starting out learning Data Science in 2020/21. Will recommend.
reply
sschnei8
2 hours ago
[-]
Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.
reply
dahcryn
1 hour ago
[-]
why? It's the industry standard as far as my reach goes.

What other framework would you replace it with?

No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

reply
crystal_revenge
44 minutes ago
[-]
You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.

Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.

reply
porker
1 hour ago
[-]
> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

Can you expand on why Polars isn't optimised for a holistic approach to data science?

reply
fifilura
21 minutes ago
[-]
I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.

It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.

reply
rbartelme
15 minutes ago
[-]
This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.
reply
maleldil
15 minutes ago
[-]
You can always convert from Polars to Pandas. Plotnine will do it automatically for you, even.
reply
maxnoe
1 hour ago
[-]
The book is quite old actually, not sure if "this day and age" still applies to it
reply
msto
2 hours ago
[-]
It was originally published in 2016, and I think this is still the first edition.
reply
xenophonf
2 hours ago
[-]
What's wrong with Pandas?
reply
clickety_clack
1 hour ago
[-]
I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.
reply
biofox
54 minutes ago
[-]
R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.
reply
clickety_clack
20 minutes ago
[-]
The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.

I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.

The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.

reply
rbartelme
13 minutes ago
[-]
Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.
reply
amelius
24 minutes ago
[-]
Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.
reply
wiz21c
2 hours ago
[-]
I wouldn't say it's a handbook because it's more like an introduction. But it's pretty well written.
reply
synergy20
1 hour ago
[-]
it's written 8 years ago though, there is a 2ed of the book by the same author.
reply
phone_book
32 minutes ago
[-]
The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."
reply
BenGosub
3 hours ago
[-]
He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.
reply
ayhanfuat
1 hour ago
[-]
He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.
reply
AI-NoGuardrails
42 minutes ago
[-]
very cool!
reply