FilterHN

Python Data Science Handbook

108 points

by cl3misch

4 hours ago

| past

| 8 comments

| jakevdp.github.io

| HN

▲

farhanhubble

1 hour ago

[-]

I loved his Statistics for Hackers talk: https://speakerdeck.com/pycon2016/jake-vanderplas-statistics...

▲

ellisv

2 hours ago

[-]

These types of books are always interesting to me because they tackle so many different things. They cover a range of topics at a high level (data manipulation, visualization, machine learning) and each could have its own book. They balance teaching programming while introducing concepts (and sometimes theory).

In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.

▲

trio8453

40 minutes ago

[-]

This book was absolute fire for getting started with data science in 2017-2018, Jake is a great teacher.

▲

__rito__

1 hour ago

[-]

This is one of the few books that I read cover-to-cover when I was starting out learning Data Science in 2020/21. Will recommend.

▲

sschnei8

2 hours ago

[-]

Interesting choice of Pandas in this day and age. Maybe he’s after imparting general concepts that you could apply to any tabular data manipulator rather than selecting for the latest shiny tool.

▲

dahcryn

1 hour ago

[-]

why? It's the industry standard as far as my reach goes.

What other framework would you replace it with?

No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

▲

crystal_revenge

44 minutes ago

[-]

You can assert whatever you want, but Polars is a great answer. The performance improvements are secondary to me compared to the dramatic improvement in interface.

Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.

▲

porker

1 hour ago

[-]

> No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.

Can you expand on why Polars isn't optimised for a holistic approach to data science?

▲

fifilura

21 minutes ago

[-]

I have not work with Polars, but I would imagine any incompatibility with existing libraries (e.g. plotting libraries like plotnine, bokeh) would quickly put me off.

It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.

▲

rbartelme

15 minutes ago

[-]

This is a non-issue with Polars dataframes to_pandas() method. You get all the performance of Polars for cleaning large datasets, and to_pandas() gives you backwards compatibility with other libraries. However, plotnine is completely compatible with Polars dataframe objects.

▲

maleldil

15 minutes ago

[-]

You can always convert from Polars to Pandas. Plotnine will do it automatically for you, even.

▲

maxnoe

1 hour ago

[-]

The book is quite old actually, not sure if "this day and age" still applies to it

▲

msto

2 hours ago

[-]

It was originally published in 2016, and I think this is still the first edition.

▲

xenophonf

2 hours ago

[-]

What's wrong with Pandas?

▲

clickety_clack

1 hour ago

[-]

I probably wouldn’t rewrite an entire data science stack that used pandas, but most people would use polars if starting a new project today.

▲

biofox

54 minutes ago

[-]

R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.

▲

clickety_clack

20 minutes ago

[-]

The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.

I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.

The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.

▲

rbartelme

13 minutes ago

[-]

Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.

▲

amelius

24 minutes ago

[-]

Pandas turns 10x developers with a lust for life into 0.1x developers with grey hairs.

▲

wiz21c

2 hours ago

[-]

I wouldn't say it's a handbook because it's more like an introduction. But it's pretty well written.

▲

synergy20

1 hour ago

[-]

it's written 8 years ago though, there is a 2ed of the book by the same author.

▲

phone_book

32 minutes ago

[-]

The linked Github seems to have the 2nd edition in the form of notebooks, https://github.com/jakevdp/PythonDataScienceHandbook/blob/ma..., under the Using Code Examples section, "attribution usually includes the title, author, publisher, and ISBN. For example: "Python Data Science Handbook, 2nd edition, by Jake VanderPlas (O’Reilly). Copyright 2023..." compared to the OP's link which has "The Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2016..."

▲

BenGosub

3 hours ago

[-]

He's a great writer and I miss his blog. He had an awesome post on pivot table that I think is now a part of this book.

▲

ayhanfuat

1 hour ago

[-]

He is also the creator of the Altair visualization library (Vega-Lite in Python https://altair-viz.github.io/). I really like using it.

▲

AI-NoGuardrails

42 minutes ago

[-]

very cool!