FilterHN

aorist

11 hours ago

[-]

> Examples include converting boxplots into violins or vice versa, turning a line plot into a heatmap, plotting a density estimate instead of a histogram, performing a computation on ranked data values instead of raw data values, and so on.

Most of this is not about Python, it’s about matplotlib. If you want the admittedly very thoughtful design of ggplot in Python, use plotnine

> I would consider the R code to be slightly easier to read (notice how many quotes and brackets the Python code needs)

This isn’t about Python, it’s about the tidyverse. The reason you can use this simpler syntax in R is because it’s non-standard-evaluation allows packages to extend the syntax in a way Python does not expose: http://adv-r.had.co.nz/Computing-on-the-language.html

jampekka

1 hour ago

[-]

I wonder what the last example of "logistics without libraries" would look like in R. Based on my experience of having to do "low-level" R, it's gonna be a true horror show.

In R it's often that things for which there's a ready made libraries and recipes are easy, but when those don't exist, things become extremely hard. And the usual approach is that if something is not easy with a library recipe, it just is not done.

robot-wrangler

9 hours ago

[-]

>> I would consider the R code to be slightly easier to read (notice how many quotes and brackets the Python code needs)

Oh god no, do people write R like that, pipes at the end? Elixir style pipe-operators at the beginning is the way.

And if you really wanted to "improve" readability by confusing arguments/functions/vars just to omit quotes, python can do that, you'll just need a wrapper object and getattr hacks to get from `my_magic_strings.foo` -> `'foo'`. As for the brackets.. ok that's a legitimate improvement, but again not language related, it's library API design for function sigs.

rtaylorgarlock

7 hours ago

[-]

Upvoted for pipes at the beginning

https://seaborn.pydata.org/tutorial/introduction.html

isolli

1 hour ago

[-]

Or seaborn. It was built exactly for this purpose: abstracting some of the annoying kinks of matplotlib while still offering a rich set of features.

getnormality

7 hours ago

[-]

It's not about Python, it's about how R lets you do something Python can't?

npalli

11 hours ago

[-]

Python is nothing without it’s batteries.

jskherman

11 hours ago

[-]

Python is its batteries.

10 hours ago

[-]

The design and success of e.g. Golang is pretty strong support for the idea that you can't and shouldn't separate a language from its broader ecosystem of tooling and packages.

LtWorf

10 hours ago

[-]

The success of python is due to not needing a broader ecosystem for A LOT of things.

They are of course now abandoning this idea.

lmm

7 hours ago

[-]

> The success of python is due to not needing a broader ecosystem for A LOT of things.

I honestly think that was a coincidence. Perl and Ruby had other disadvantages, Python won despite having bad package management and a bloated standard library, not because of it.

procaryote

2 hours ago

[-]

The bloated standard library is the only reason I kept using python in spite of the packaging nightmare. I can do most things with no dependencies, or with one dependency I need over and over like matplotlib

If python had been lean and needed packages to do anything useful, while still having a packaging nightmare, it would have been unusable

lmm

1 hour ago

[-]

Well, sure, but equally I think there would have been a lot more effort to fix the packaging nightmare if it had been more urgent.

LtWorf

29 minutes ago

[-]

The bloated standard library is the reason why you can send around a single .py file to others and they can execute it instantly.

Most of the python users are not able nor aware of venv, uv, pip and all of that.

rjzzleep

2 hours ago

[-]

It's because Ruby captured the web market and Python everything else, and I get everything is more timeless than a single segment.

1vuio0pswjnm7

10 hours ago

[-]

What language is used to write the batteries

logicprog

10 hours ago

[-]

C/C++, in large part

JPKab

5 hours ago

[-]

These days it's a whole lot of Rust.

volemo

3 hours ago

[-]

These days it’s still a whole lot of Fortran, with some Rust sprinkled on top. (:

pjmlp

49 minutes ago

[-]

Which since Fortran 2003, or even Fortran 95, has gotten rather nice to use.

saboot

10 hours ago

[-]

And below that, FORTRAN :)

throwaway2037

9 hours ago

[-]

I hear this so much from Python people -- almost like they are paid by the word to say it. Is it different from Perl, Ruby, Java, or C# (DotNet)? Not in my experience, except people from those communities don't repeat that phrase so much.

The irony here: We are talking about data science. 98% of "data science" Python projects start by creating a virtual env and adding Pandas and NumPy which have numerous (really: squillions of) dependencies outside the foundation library.

m55au

8 hours ago

[-]

Someone correct me if I'm completely wrong, but by default (i.e. precompiled wheels) numpy has 0 dependencies and pandas has 5, one of which is numpy. So not really "squillions" of dependencies.

pandas==2.3.3

├── numpy [required: >=1.22.4, installed: 2.2.6]

├── python-dateutil [required: >=2.8.2, installed: 2.9.0.post0]

│ └── six [required: >=1.5, installed: 1.17.0]

├── pytz [required: >=2020.1, installed: 2025.2]

└── tzdata [required: >=2022.7, installed: 2025.2]

noitpmeder

4 hours ago

[-]

I don't know about _squillions_, but numpy definitely has _requirements_, even if they're not represented as such in the python graph.

e.g.

  https://github.com/numpy/numpy/blob/main/.gitmodules (some source code requirements)
  https://github.com/numpy/numpy/tree/main/requirements (mostly build/ci/... requirements)
  ...

m55au

3 hours ago

[-]

They're not represented, because those are build-time dependencies. Most users when they do pip install numpy or equivalent, just get the precompiled binaries and none of those get installed. And even if you compile it yourself, you still don't need those for running numpy.

dm319

9 hours ago

[-]

> This isn’t about Python, it’s about the tidyverse.

> it’s non-standard-evaluation allows packages to extend the syntax in a way Python does not expose

Well this is a fundamental difference between Python and R.

evolighting

8 hours ago

[-]

R is more of a statistical software than a programming language. So, if you are a so-called "statistician," then R will feel familiar to you

UniverseHacker

8 hours ago

[-]

No, R is a serious general purpose programming language that is great for building almost any type of complex scientific software with. Projects like Bioconductor are a good example.

evolighting

5 hours ago

[-]

Perhaps a in a context of comparison with Python?

In my limited experience, Using R feels like to using JavaScript in the browser: it's a platform heavily focused on advanced, feature-rich objects (such as DataFrames and specialized plot objects). but you could also just build almost anything with it.

sarusso

6 minutes ago

[-]

The main flaw of this article is comparing a general-purpose language built with production systems in mind (Python) with a domain-specific language designed for interactive analysis (R)... Beware of comparing apples and oranges, because productizing R code typically requires rewriting it in another language.

HelloNurse

7 minutes ago

[-]

Guess what, doing a complex task (filtering and aggregating example penguins) with a specialized and ossified library (Pandas) is better than doing it with basic lists and dicts.

More terse, more efficient, less error prone, hopefully more numerically accurate, as if Python had an ecosystem of well designed libraries on par with R.

RobinL

15 hours ago

[-]

I think a lot of this comes down to the question: Why aren't tables first class citizens in programming languages?

If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.

R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.

FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.

poulpy123

13 minutes ago

[-]

> Why aren't tables first class citizens in programming languages?

Because they were created by before the need for it and maybe before their invention.

Manipulating numeric arrays and matrices in python is a bit clunky because it was not designed as a scientific computing language so they were added as library. It's much more integrated and natural to use in scientific computer languages such as matlab. However the reverse is also true: because matlab wasn't designed to do what python does, it's a bit clunkier to use outside scientific computing

OkayPhysicist

9 hours ago

[-]

There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another. Graphs, and relatedly, state machines are tools that are grossly underused because of bad language-level support. Finally, not a structure per se, but I think most languages that are batteries-included enough to included a regex engine should have a a full-fledged PEG parsing engines. Most, if not all, Regex horror stories derive from a simple "Regex is built in".

What tools are easily available in a language, by default, shape the pretty path, and by extension, the entire feel of the language. An example that we've largely come around on is key-value stores. Today, they're table stakes for a standard library. Go back to 90's, and the most popular languages at best treated them as second-class citizens, more like imported objects than something fundamental like arrays. Sure, you can implement a hash map in any language, or import some else's implementation, but oftentimes you'll instead end up with nightmarish, hopefully-synchronized arrays, because those are built-in, and ready at hand.

HelloNurse

1 minute ago

[-]

> I think most languages that are batteries-included enough to included a regex engine should have a a full-fledged PEG parsing engines

Then there would be more PEG horror stories. In addition, string and indices in regex processing are universal, while a parser is necessarily more framework-like, far more complex and doomed to be mismatched for many applications.

jltsiren

6 hours ago

[-]

When there is no clear canonical way of implementing something, adding it to a programming language (or a standard library) is risky. All too often, you realize too late that you made a wrong choice, and then you add a second version. And a third. And so on. And then you end up with a confusing language full of newbie traps.

Graphs are a good example, as they are a large family of related structures. For example, are the edges undirected, directed, or something more exotic? Do the nodes/edges have identifiers and/or labels? Are all nodes/edges of the same type, or are there multiple types? Can you have duplicate edges between the same nodes? Does that depend on the types of the nodes/edges, or on the labels?

throwaway2037

9 hours ago

[-]

    > There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another.

I disagree. Most programmers will go their entire career and never need a matrix data structure. Sure, they will use libraries that use matrices, but never use them directly themselves. It seems fine that matrices are not a separate data type in most modern programming languages.

OkayPhysicist

9 hours ago

[-]

Unless you think "most programmers" === "shitty webapp developers", I strongly disagree. Matrices are first class, important components in statistics, data analysis, graphics, video games, scientific computing, simulation, artificial intelligence and so, so much more.

And all of those programmers are either using specialized languages, (suffering problems when they want to turn their program into a shitty web app, for example), or committing crimes against syntax like

rotation_matrix.matmul(vectorized_cat)

lock1

7 hours ago

[-]

That's needlessly aggressive. Ignoring webapps, you could do gamedev without even knowing what a matrix is.

You don't even need such construction in most native applications, embedded systems, and OS kernel development.

throwaway2037

4 hours ago

[-]

This is my exactly point. Even in a highly specialised library for pricing securities, the amount of code that uses matrices is surprisingly small.

theamk

4 hours ago

[-]

I am working in embedded. Had to optimize weights for an embedded algorithm, decided to use linear regression and thus needed matrices.

And if you do robotics, the chances of encountering a matrix are very high.

habinero

3 hours ago

[-]

I don't see why the majority of engineers need to cater to your niche use cases. It's a programming language, you can just make the library if it doesn't exist. Nobody's stopping you.

Plus, plenty of third party projects have been incorporated into the Python standard library.

[0] https://en.wikipedia.org/wiki/Q_(programming_language_from_K...

RodgerTheGreat

14 hours ago

[-]

There are a number of dynamic languages to choose from where tables/dataframes are truly first-class datatypes: perhaps most notably Q[0]. There are also emerging languages like Rye[1] or my own Lil[2].

I suspect that in the fullness of time, mainstream languages will eventually fully incorporate tabular programming in much the same way they have slowly absorbed a variety of idioms traditionally seen as part of functional programming, like map/filter/reduce on collections.

[1] https://ryelang.org/blog/posts/comparing_tables_to_python/

[2] http://beyondloom.com/tools/trylil.html

https://www.nushell.sh/book/working_with_tables.html

liveranga

6 hours ago

[-]

Nushell is another one with tables built-in:

middayc

11 hours ago

[-]

Another page about Rye tables: https://ryelang.org/cookbook/working-with/tables/

genidoi

2 hours ago

[-]

This is an interesting observation. One possible explanation for a lack of robust first class table manipulation support in mainstream languages could be due to the large variance in real-world table sizes and the mutually exclusive subproblems that come with each respective jump in order-of-magnitude row size.

The problems that one might encounter in dealing with a 1m row table are quite different to a 1b row table, and a 1b row table is a rounding error compared to the problems that a 1t row table presents. A standard library needs to support these massive variations at least somewhat gracefully and that's not a trivial API surface to design.

riskassessment

10 hours ago

[-]

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

Everyone in R uses data.frame because tibble (and data.table) inherits from data.frame. This means that "first class" (base R) functions work directly on tibble/data.table. It also makes it trivial to convert between tibble, data.table, and data.frames.

don-bright

3 hours ago

[-]

Every copy of Microsoft Excel includes Power Query which is in the M language and has tables as a type. Programs are essentially transformations of table columns and rows. Not sure if its mainstream but is widely available. M language is also included in other tools like PowerBI and Power Automate.

paddleon

14 hours ago

[-]

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,

which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.

elehack

11 hours ago

[-]

And readability. data.table is very capable, but the incantations to use it are far less obvious (both for reading and writing) than dplyr.

But you can have the best of both worlds with https://dtplyr.tidyverse.org/, using data.table's performance improvements with dplyr syntax.

extr

11 hours ago

[-]

Yeah data.table is just about the best-in-class tool/package for true high-throughput "live" data analysis. Dplyr is great if you are learning the ropes, or want to write something that your colleagues with less experience can easily spot check. But in my experience if you chat with people working in the trenches of banks, lenders, insurance companies, who are running hundreds of hand-spun crosstabs/correlational analyses daily, you will find a lot of data.table users.

Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.

maest

6 hours ago

[-]

> Why aren't tables first class citizens in programming languages?

They are in q/kdb and it's glorious. Sql expressions are also first class citizens and it makes it very pleasant to write code

riidom

9 hours ago

[-]

PyTorch was first only Torch, and in Lua. I didn't follow it too close at the time, but apparently due to popular demand it got redone in Python and voila PyTorch.

brikym

1 hour ago

[-]

This. I really really want some kind of data frame which has actual compile time typing my LSP/IDE can understand. Kusto query language (Azure Data Explorer) has it and the auto completion and error checking is extremely useful. But kusto query language is really just limited to one cloud product.

nextos

14 hours ago

[-]

I don't think this is the real problem. In R and Julia tables are great, and they are libraries. The key is that these languages are very expressive and malleable.

Simplifying a lot, R is heavily inspired by Scheme, with some lazy evaluation added on top. Julia is another take at the design space first explored by Dylan.

Iwan-Zotow

6 hours ago

[-]

R was clone of S

kelipso

15 hours ago

[-]

People use data.table in R too (my favorite among those but it’s been a few years). data.table compared to dplyr is quite a contrast in terms of language to manipulate tabular data.

jna_sh

15 hours ago

[-]

I know the primary data structure in Lua is called a table, but I’m not very familiar with them and if they map to what’s expected from tables in data science.

https://www.lua.org/pil/2.5.html

Jtsummers

14 hours ago

[-]

Lua's tables are associative arrays, at least fundamentally. There's more to it than that, but it's not the same as the tables/data frames people are using with pandas and similar systems. You could build that kind of framework on top of Lua's tables, though.

TheSoftwareGuy

14 hours ago

[-]

IIRC those are basically hash tables, which are first-class citizens in many languages already

IgorPartola

10 hours ago

[-]

SQL is not just about a table but multiple tables and their relationships. If it was just about running queries against a single table then basic ordering, filtering, aggregation, and annotation would be easy to achieve in almost any language.

Soon as you start doing things like joins, it gets complicated but in theory you could do something like an API of an ORM to do most things. With using just operators you quickly run into the fact that you have to overload (abuse) operators or write a new language with different operator semantics:

  orders * customers | (customers.id == orders.customer_id | orders.amount > Decimal(‘10.00’)

Where * means cross product/outer join and | means filter. Once you add an ordering operator, a group by, etc. you basically get SQL with extra steps.

But it would be nice to have it built in so talking to a database would be a bit more native.

sgarland

7 hours ago

[-]

Every time I see stuff like this (Google’s new SQL-ish language with pipes comes to mind), I am baffled. SQL to me is eminently readable, and flows beautifully.

For reference, I think the same is true of Python, so it’s not like I’m a Perl wizard or something.

RA_Fisher

9 hours ago

[-]

R’s the best, bc it’s been a statistical analysis language from the beginning in 1974 (and was built and developed for the purpose of analysis / modeling). Also, the tidyverse is marvelous. It provides major productivity in organizing and augmenting the data. Then there’s ggplot, the undisputed best graphical visualization system + built-ins like barplot(), or plot().

But ultimately data analysis is going beyond Python and R into the realm of Stan and PyMC3, probabilistic programming languages. It’s because we want to do nested integrals and those software ecosystems provide the best way to do it (among other probabilistic programming languages). They allow us to understand complex situations and make good / valuable decisions.

kevinhanson

14 hours ago

[-]

this is my biggest complaint about SAS--everything is either a table or text.

most procs use tables as both input and output, and you better hope the tables have the correct columns.

you want a loop? you either get an implicit loop over rows in a table, write something using syscalls on each row in a table, or you're writing macros (all text).

m_mueller

2 hours ago

[-]

Fortran gives you that and more, it has first class multidimensional arrays, including matrix operations.

127

10 hours ago

[-]

Because there's no obvious universal optimal data structure for heterogeneous N-dimensional data with varying distributions? You can definitely do that, but it requires an order of magnitude more resource use as baseline.

alexnewman

11 hours ago

[-]

APL Is great

smartmic

1 hour ago

[-]

Agreed. I once used it for data preparation for a data science project (GNU APL). After a steep learning curve, it felt very much like writing math formulas — it was fun and concise, and I liked it very much. However, it has zero adoption in today's data science landscape. Sharing your work is basically impossible. If you're doing something just for yourself, though, I would probably give it a chance again.

7thaccount

10 hours ago

[-]

Perfect solution for doing analysis on tables. Wes McKinney (inventor of pandas is rumored to have been inspired by it too).

My problem with APL is 1.) the syntax is less amazing at other more mundane stuff, and 2.) the only production worthy versions are all commercial. I'm not creating something that requires me to pay for a development license as well as distribution royalties.

CivBase

15 hours ago

[-]

What is a table other than an array of structs?

thom

14 hours ago

[-]

It’s not that you can’t model data that way (or indeed with structs of arrays), it’s just that the user experience starts to suck. You might want a dataset bigger than RAM, or that you can transparently back by the filesystem, RAM or VRAM. You might want to efficiently index and query the data. You might want to dynamically join and project the data with other arrays of structs. You might want to know when you’re multiplying data of the wrong shapes together. You might want really excellent reflection support. All of this is obviously possible in current languages because that’s where it happens, but it could definitely be easier and feel more of a first class citizen.

FridgeSeal

9 hours ago

[-]

Well it could be a struct of arrays.

Nitpicking aside, a nice library for doing “table stuff” without “the whole ass big table framework” would be nice.

It’s not hard to roll this stuff by hand, but again, a nicer way wouldn’t be bad.

RobinL

14 hours ago

[-]

I would argue that's about how the data is stored. What I'm trying to express is the idea of the programming language itself supporting high level tabular abstractions/transformations such as grouping, aggregation, joins and so on.

pjc50

2 minutes ago

[-]

Yeah, that's LINQ+EF. People have hated ORMs for so long (with some justification) that perhaps they've forgotten what the use case is.

(and yes there's special language support for LINQ so it counts as "part of the language" rather than "a library")

p1necone

11 hours ago

[-]

Implementing all of those things is an order of magnitude more complex than any other first class primitive datatype in most languages, and there's no obvious "one right way" to do it that would fit everyones use cases - seems like libraries and standalone databases are the way to do it, and that's what we do now.

camdenreslink

14 hours ago

[-]

Sounds a lot like LINQ in .NET (which is usually compatible with ORMs actually querying tables).

CivBase

14 hours ago

[-]

Ah, that makes more sense. Thanks for the clarification.

ModernMech

10 hours ago

[-]

The difference is semantics.

What is a paragraph but an array of sentences? What is a sentence but an array of words? What's a word but an array of letters? You can do this all the way down. Eventually you need to assign meaning to things, and when you do, it helps to know what the thing actually is, specifically, because an array of structs can be many things that aren't a table.

ModernMech

10 hours ago

[-]

It makes sense from a historical perspective. Tables are a thing in many languages, just not the ones that mainstream devs use. In fact, if you rank programming languages by usage outside of devs, the top languages all have a table-ish metaphor (SQL, Excel, R, Matlab).

The languages devs use are largely Algol derived. Algol is a language that was used to express algorithms, which were largely abstractions over Turing machines, which are based around an infinite 1D tape of memory. This model of 1D memory was built into early computers, and early operating systems and early languages. We call it "mechanical sympathy".

Meanwhile, other languages at the same time were invented that weren't tied so closely to the machine, but were more for the purpose of doing science and math. They didn't care as much about this 1D view of the world. Early languages like Fortran and Matlab had notions of 2D data matrices because math and science had notions of 2D data matrices. Languages like C were happy to support these things by using an array of pointers because that mapped nicely to their data model.

The same thing can be said for 1-based and 0-based indexing -- languages like Matlab, R, and Excel are 1-based because that's how people index tables; whereas languages like C and Java are 0-based because that's how people index memory.

constantcrying

12 hours ago

[-]

>Why aren't tables first class citizens in programming languages?

Matlab has them, in fact it has multiple competing concepts of it.

dm319

9 hours ago

[-]

Dplyr is quite happy with data.frame. R is built around tabular data. Other statistical languages are too, such as Stata.

getnormality

7 hours ago

[-]

Saying that SQL is the standard for manipulating tabular data is like saying that COBOL is the standard for financial transactions. It may be true based on current usage, but nobody thinks it's a good idea long term. They're both based on the outdated idea that a programming language should look like pidgin English rather than math.

Iwan-Zotow

6 hours ago

[-]

In R data.table is basically SQL in another shape

willvarfar

2 hours ago

[-]

My experience was that data science was doable but clunky and ugly with pandas. It got slightly better with polars. Only really slightly better. Then, for me at least, it jumped lightyears ahead with duckdb.

These days I run some big query on an OLAP database and download the results to parquet stored on the local disk of a cloud notebook VM and then mine it to bits with duckdb reading straight from these parquet files.

The notebooks end up with very clear SQL queries and results (most notebook servers support SQL cells with highlighting and completion etc), and small pockets of python cells for doing those corner case things that an imperative language makes easier.

So when I get to the bottom of the article where it shows the difference between Python and R, I'm screaming "wouldn't that look better in SQL?!" :)

mettamage

1 hour ago

[-]

Huh, as a frequent polars user, I'll try duckdb.

goatlover

1 hour ago

[-]

So you're saying you prefer SQL to dataframes. I prefer dataframes and staying in the native language.

willvarfar

52 minutes ago

[-]

Duckdb can see and manipulate dataframes too. Duckdb has it's own storage, but other table storage - e.g. the parquet files I mentioned or even csv files or even dataframes from pandas and polars - are first-class citizens. Duckdb lets you query them quickly and expressively.

jakobnissen

15 hours ago

[-]

Excellent article - except that the author probably should have gated their substantiation of the claim behind a cliffhanger, as other commenters have mentioned.

The author's priorities are sensible, and indeed with that set of priorities, it makes sense to end up near R. However, they're not universal among data scientists. I've been a data scientist for eight years, and have found that this kind of plotting and dataframe wrangling is only part of the work. I find there is usually also some file juggling, parsing, and what the author calls "logistics". And R is terrible at logistics. It's also bad at writing maintainable software.

If you care more about logistics and maintenance, your conclusion is pushed towards Python - which still does okay in the dataframes department. If you're ALSO frequently concerned about speed, you're pushed towards Julia.

None of these are wrong priorities. I wish Julia was better at being R, but it isn't, and it's very hard to be both R and useful for general programming.

Edit: Oh, and I should mention: I also teach and supervise students, and I KEEP seeing students use pandas to solve non-table problems, like trying to represent a graph as a dataframe. Apparently some people are heavily drawn to use dataframes for everything - if you're one of those people, reevaluate your tools, but also, R is probably for you.

ActorNightly

12 hours ago

[-]

>Excellent article

Except its not. Data science in python pretty much requires you to use numpy. So his example of mean/variance code is a dumb comparison. Numpy has mean and variance functions built in for arrays.

Even when using raw python in his example, some syntax can be condesed quite a bit:

groups = defaultdict(list) [groups[(row['species'], row['island'])].append(row['body_mass_g']) for row in filtered]

It takes the same amount of mental effort to learn python/numpy as it does with R. The difference is, the former allows you to integrate your code into any other applicaiton.

dragonwriter

9 hours ago

[-]

> Numpy has mean and variance functions built in for arrays.

Even outside of Numpy, the stdlib has the statistics packages which provides mean, variance, population/sample standard deviation, and other statistics functions for normal iterables. The attempt to make Python out-of-the-box code look bad was either deliberately constructed to exaggerate the problems complained of, or was the product of a very convenient ignorance of the applicable parts of Python and its stdlib.

ModernMech

10 hours ago

[-]

I dunno. Numpy has its own data types, its own collections, its own semantics which are all different enough from Python, I think it's fair to consider it a DSL on its own. It'd be one thing if it was just, operator overloading to provide broadcasting for python, but Numpy's whole existence is to patch the various shortcomings Python has in DS.

a_bonobo

3 hours ago

[-]

>I find there is usually also some file juggling, parsing, [...]

I'd say I'm 50/50 Python/R for exactly this reason: I write Python code on HPC or a server to parse many, many files, then I get some kind of MB-scale summary data I analyse locally in R.

R is not good at looping over hundreds of files in the gigabytes, Python is not good at making pretty insights from the summary. A tool for every task.

progval

11 hours ago

[-]

The pure Python code in the last example is more verbose than it needs to be.

    groups = {}
    for row in filtered:
        key = (row['species'], row['island'])
        if key not in groups:
            groups[key] = []
        groups[key].append(row['body_mass_g'])

can be rewritten as:

    groups = collections.defaultdict(list)
    for row in filtered:
        groups[(row['species'], row['island'])].append(row['body_mass_g'])

and

    variance = sum((x - mean) ** 2 for x in values) / (n - 1)
    std_dev = math.sqrt(variance)

as:

    std_dev = statistics.stddev(values)

roadside_picnic

8 hours ago

[-]

> (n - 1)

It's also funny that one would write their own standard deviation function and include Bessel's correction. Usually if I'm manually re-implementing a standard deviation function it's because I'm afraid the implementors blindly applied the correction without considering whether or not it's actually meaningful for the given analysis. At the very least, the correct name for what's implemented there should really be `sample_std_dev`.

m55au

8 hours ago

[-]

It is sadly really inconsistent. The stdlib statistics has two separate functions, stdev for sample and pstdev for population. Numpy and pandas both have .std() with ddof (delta degrees of freedom) as a parameter, but numpy defaults to 0 (population) and pandas to 1 (sample).

ashdev

10 hours ago

[-]

Disagree.

In the first instance, the original code is readable and tells me exactly what's what. In your example, you're sacrificing readability for being clever.

Clear code(even if verbose) is better than being clever.

billyoyo

10 hours ago

[-]

Using a very common utility in the standard library is to avoid reinventing the wheel is not "clean code"?

defaultdict is ubiquitous in modern python, and is far from a complicated concept to grasp.

ux266478

8 hours ago

[-]

I don't think that's the right metaphor to use here, it exists at a different level than what I would consider "reinventing the wheel". That to me is more some attempt to make a novel outward-facing facet of the program when there's not much reason to do so. For example, reimplementing shared memory using a custom kernel driver as your IPC mechanism, despite it not doing anything that shared memory doesn't already do.

The difference between the examples is so trivial I'm not really sure why the parent comment felt compelled to complain.

MarsIronPI

10 hours ago

[-]

I think code clarity is subjective. I find the second easier to read because I have to look at less code. When I read code, I instinctively take it apart and see how it fits together, so I have no problem with the second approach. Whereas the first approach is twice as long so it takes me roughly twice as long to read.

explodes

10 hours ago

[-]

The 2nd version is the most idiomatic.

ashdev

7 hours ago

[-]

Interesting! Thanks for the responses. I'm not python native and haven't worked as extensively with python as some of you here.

That said, I'll change my mind here and agree on using std library, but I'd still have separate 'key' assignment here for more clarity.

9 hours ago

[-]

I would keep the explicit key= assignment since it's more than just a single literal but otherwise the second version is more idiomatic and readable.

neuropacabra

1 hour ago

[-]

I expected the author will complain rightfully about the tooling, including linters, formatters and package managers. Things improved drastically over the years with Astral’s ruff, uv and alpha stage ty.

But the article says that very exotic syntax is more readable. I think this is mostly about the libraries, where honestly I equally don’t like matplotlib and R’s ggplot. But I would not think it’s language problem.

I was hoping to find some performance benchmarks or something more than feelings about certain block of code. Don’t get me wrong I am also not a die hard fan of Python although I have written a lot or production code in it. Mentioning bloated, boilerplate code…I am afraid author should look on Java or any modern JavaScript project.

sheepscreek

2 hours ago

[-]

I really didn’t understand the author’s grievances. The only concrete example they illustrated was one where they concluded that Python without Pandas is verbose and ugly to achieve the same outcome, hence Python is not great for Data Science.

That’s a bad argument or a naive and obvious one; depending on how you look at it.

Python wasn’t designed for Data Science. It is not a DSL for it. MATLAB was arguably designed for scientific computing, and yet it’s the most disliked language in the StackOverflow liked/disliked index.

Here’s a different way to look at it. A good programming language is like the weather in a city. I would love to live somewhere where it’s 72F/23C all year round. But if it’s in the middle of nowhere and I’ve got no friends to hang out with, would I? I don’t think so.

FWIW, Python is like Sweden or Finland, with shitty weather for 6 months of the year yet thriving against all odds.

PS: I think the article’s topic is a bit click-batey (not a particularly useful discussion) because it’s polarizing and no one will be 100% right about it. It’s perhaps best thought of as an opinion piece.

rossdavidh

10 hours ago

[-]

Speaking as a python programmer who has occasionally done work in R: yes, of course. Python is not a great language for anything; it's a pretty good language for just about anything. That is, and always has been, its strength.

If you're doing data science all day, you should learn R, even if it's so weird at first (for somebody coming from a C-style language) that it seems way harder; R is made for the way statisticians work and think, not the way computer programmers work and think. If you're doing data science all day, you should start thinking and working like a statistician and working in R, and the fact that it seems to bend your mind is probably at least in part good, because a statistician needs to think differently than a programmer.

I work in python, though, almost all of the time.

psunavy03

10 hours ago

[-]

I'm not sure what that last example is meant to be other than an anti-Python caricature. If you're implementing calculating things like standard deviations by hand, that's not real-world coding, that's the undergraduate harassment package which should end with a STEM bachelor's.

Of course there's a bunch of loops and things; you're exposing what has to happen in both R and Python under the hood of all those packages.

roadside_picnic

8 hours ago

[-]

> that's not real-world coding

It's pretty clear the post is focused on the context of work being done in an academic research lab. In that context I think most of the points are pretty valid, but most of the real world benefit I've experience from using Python is being able to work more closely with engineering (even on non-Python teams).

I shipped R code to a production environment once over my career and it felt incredibly fragile.

R is great for EDA, but really doesn't work well for iteratively building larger software projects. R is has a great package system, but it's not so great when you need abstraction in between.

SubiculumCode

8 hours ago

[-]

Yeah, to me, R has never really been a.language I'd choose to program with...it's a statistical powerhouse to analyze datasets with great packages / SOTA statistical methods, etc, not a roduction tool.

Surac

2 hours ago

[-]

I at the moment try to learn python as a hobby language. I use c c++ and c# to earn my money. MY biggest problem is finding good examples that are up to date. I spent a whole day learning that there a four (I think) ways to do formatting strings. This „bloat“ in syntax makes even a simple print very heavy to digest. I don’t even bother using v2 python only v3. Also using whitespaces to block things together sounds appealing but in reality you need to use editors that can indent and unindent whole blocks or I never get it right

Stratoscope

2 hours ago

[-]

You seem to be making things more difficult for yourself than they need to be.

For the strings, just use f-strings and forget all the others. You can even do things like this for debugging:

  >>> class User:
  ...     pass
  ... user = User()
  ... user.name = "Surac"
  ...
  >>> print(f"{user.name=}")
  user.name='Surac'
  >>>

For the block indenting, what editor are you using? Pretty much every modern editor lets you select a block and indent/unindent with Tab/Shift+Tab.

VS Code and PyCharm are both free and are great for Python coding. They each have a full debugger, which is invaluable when you are learning a language.

lenkite

2 hours ago

[-]

15 years ago, Python programmers used to mock Perl by quoting the Zen of Python: "There should be one - and preferably only one - obvious way to do it.". This was in stark contrast to Perl's TIMTOWTDI motto: "There Is More Than One Way To Do It."

The Zen of Python is sadly now an absolute lie.

IshKebab

2 hours ago

[-]

> but in reality you need to use editors that can indent and unindent whole blocks or I never get it right

What editor are you using that can't do that? Notepad?

[1] https://link.springer.com/article/10.1007/s11336-017-9581-x

jakubmazanec

53 minutes ago

[-]

I wish people used Julia more. Few years ago I reimplemented some MATLAB code for a novel algorithm [1] I wanted to use in my dissertation about psychometrics and Julia was great language to work with - and also the code ran for 20 minutes instead of 60.

maratc

45 minutes ago

[-]

How important was this saving of 40 minutes for the whole timeline of the project of writing your dissertation about psychometrics?

whyenot

15 hours ago

[-]

What makes Python a great language for data science, is that so many people are familiar with it, and that it is an easy language to read. If you use a more obscure language like Clojure, Common Lisp, Julia, etc., many people will not be familiar with the language and unable to read or review your code. Peer review is fundamental to the scientific endeavor. If you only optimize on what is the best language for the task, there are clearly better languages than Python. If you optimize on what is best for science then I think it is hard not to argue that Python (and R) are the best choices. In science, just getting things done is not enough. Other people need to be able to read and understand what you are doing.

BTW AI is not helping and in fact is leading to a generation of scientists who know how to write prompts, but do not understand the code those prompts generate or have the ability to peer review it.

13 hours ago

[-]

I can't speak for Julia - never used it; never used Common Lisp for analyzing data (I don't think it's very "data-oriented" for the modern age and the shape of data), but Clojure is really not "obscure" - it only looks weird for the first fifteen minutes or so; once you start using it - it is one of the most straightforward and reasonable languages out there - it is in fact simpler than Python and Javascript. Immutable-by-default makes it far much easier to reason about the code. And OMG, it is so much more data-oriented - it's crazy that more people don't use it. Most never even heard about it.

7thaccount

9 hours ago

[-]

I tried to get into Clojure, but a lot of the JVM hosted languages require some Java experience. Same thing with Scala and Kotlin or F# on .NET.

The early tooling was also pretty dependent on Vim or Emacs. Maybe it's all easier now with VSCode or something like that.

geokon

4 hours ago

[-]

It doesn't require any Java but the docs do at times sort of assume you understand the JVM to some extent - which was a bit frustrating when first learning the language. It'll use terms like "classpath" without explaining what that is. However nowadays with LLMs these are insignificant speedbumps.

If you want to use Java you also don't really need to know Java beyond "you create instances of classes and call methods on them". I really don't want to learn a dinosaur like Java, but having access to the universe of Java libs has saved me many times. It's super fun and nice to use and poke around mature Java libs interactively with a REPL :)

All that said I'd have no idea how to write even a helloworld in Java

PS: Agreed on Emacs. I love Emacs.. but it's for turbo nerds. Having to learn Emacs and Clojure in parallel was a crazy barrier. (and no, Emacs is not as easy people make it out to be)

8 hours ago

[-]

None of this even remotely true. I've gotten into Clojure without knowing jackshit about Java, almost ten years later, after tons of things successfully built and deployed, still don't know jackshit about Java. Mia, co-host of 'Clojure apropos' podcast was my colleague, we've worked together on multiple teams, she learned Clojure as her very first PL. Later she tried learning some Java and she was shocked how impossibly weird it looked compared to Clojure. Besides, you can use Clojure without any JVM - e.g., with nbb. I use it for things like browser automation with Playwright.

The tooling story is also very solid - I use Emacs, but many of my friends and colleagues use IntelliJ, Vim, Sublime and VSCode, and some of them migrated to it from Atom.

7thaccount

6 hours ago

[-]

It might not be a problem for you, but it has been for many. I did start by reading through 3 Clojure books. The repl and the basic stuff like using lists is all easy of course, but the tooling was pretty poor compared to what I was used to (I like lisp, but Emacs is a commitment). Also, a lot of tutorials at the time definitely assumed java familiarity, especially with debugging java stack traces.

MarsIronPI

10 hours ago

[-]

Common Lisp fan here, but not a data scientist. Why do you say to avoid CL for data analysis? Not trying to flame or anything, just curious about your experience with it.

8 hours ago

[-]

I don't have great experience of using CL for analyzing data, because of "why?", if I already have another Lisp that is simply amazing for data.

Clojure, unlike lists in traditional Lisps, based on composable, unified abstraction for its collections, they are lazy by default and literal readable data structures, they are far easier to introspect and not so "opaque" compared to anything - not just CL (even Python), they are superb for dealing with heterogeneous data. Clojure's cohesive data manipulation story is where Common Lisp's lists-and-symbols just can't match.

dreamcompiler

5 hours ago

[-]

Homework assignments notwithstanding, very few serious Common Lisp programs use lists and symbols as their primary data structures. This has been true since around 1985.

Common Lisp has O[1] vectors, multidimensional arrays, hash-tables (what Clojure calls maps), structs, and objects. It has set operations too but it doesn't enforce membership uniqueness. It also has bignums, several sizes of floats, infinite-precision rationals, and complex numbers. Not to mention characters, strings, and logical operations on individual bits. The main difference from Clojure is that CL data structures are not immutable. But that's an orthogonal issue to the suggestion that CL doesn't contain a rich library of modern data structures.

Common Lisp has never been limited to "List Processing."

flexagoon

2 hours ago

[-]

That's ok, I don't think anyone knows how to properly write Julia. After using it for a while and following the community (watching talks, checking the forum etc), I don't think it has a concept of code quality. You just throw random code at the wall until it starts working. Which makes sense, considering most of the users are scientists.

hekkle

10 hours ago

[-]

> What makes Python a great language for data science, is that so many people are familiar with it

While I agree with you in principal this also leads to what I call the "VB Effect". Back in the day VB was taught at every school as part of the standard curriculum. This made every kid a 'computer wizz'. I have had to fix many a legacy codebase that was started by someone's nephew the whizz kid.

aethor

7 hours ago

[-]

Peer review is fundamental to scientific endeavor but... in ML fields, reviewers almost never check the code and Python package management is hardly reproducible. So clearly we are not there, Python or not.

forgotpwd16

15 hours ago

[-]

Article is well written but fails to address its own thesis by postponing it to a sequel article. At its current state only alludes that Python is not great because requires specialized packages. (And counterexample is R for which also used a package.)

stevenpetryk

15 hours ago

[-]

Totally agree. The author's most significant example is two code snippets that are quite similar and both pretty nice.

pacbard

14 hours ago

[-]

When you think about a data science pipeline, you really have three separate steps:

[Data Preparation] --> [Data Analysis] --> [Result Preparation]

Neither Python or R does a good job at all of these.

The original article seems to focus on challenges in using Python for data preparation/processing, mostly pointing out challenges with Pandas and "raw" Python code for data processing.

This could be solved by switching to something like duckdb and SQL to process data.

As far as data analysis, both Python and R have their own niches, depending on field. Similarly, there are other specialized languages (e.g., SAS, Matlab) that are still used for domain-specific applications.

I personally find result preparation somewhat difficult in both Python and R. Stargazer is ok for exporting regression tables but it's not really that great. Graphing is probably better in R within the ggplot universe (I'm aware of the python port).

mfld

31 minutes ago

[-]

This really calls for an A/B speed programming test of Python vs. R practitioners.

roadside_picnic

8 hours ago

[-]

So I've been writing Python for around 20 years now, and doing data science/ML work for around 15. Despite being a Python programmer first I spent a good 5 years using R exclusively. There's a lot of things I genuinely love about R and I strongly believe that R is unfairly maligned by devs... but there's a good reason I have written exclusively Python for DS work for the last 5 years.

> Python is pretty good for deep learning. There’s a reason PyTorch is the industry standard. When I’m talking about data science here, I’m specifically excluding deep learning.

I've written very little deep learning code over my career, but made very frequent use of the GPU and differentiable programming for non-deep learning specific tasks. In general Python is much easier to write quantitative programs that make use of the hardware, and you have a lot more options when your problem doesn't fit into RAM.

> I have been running a research lab in computational biology for over two decades.

I've been working nearly exclusively in industry for these two decades and a major reason I find Python just better is it's much, much easier to interface with other parts of engineering when you're a using truly general purpose PL. I've actually never worked for a pure Python shop, but it's generally much easier to get production ML/DS solutions into prod when working with Python.

> Data science as I define it here involves a lot of interactive exploration of data and quick one-off analyses or experiments

This re-iterates the previous difference. In my experience I would call this "step one" in all my DS related work. The first step is to understand the problem and de-risk. But the vast majority of code and work is related to delivering a scalable product.

You can say that's not part of "data science", but if you did you'd have a hard time finding a job on most of the teams I've worked on.

All that said, my R vs Python experience has boiled down to: If your end result is a PDF report, R is superior. If your end result is shipping a product, then Python is superior. And my experience has been that, outside of university labs, there aren't a lot of jobs out there for DS folks who only want to deliver PDFs.

keeeba

11 hours ago

[-]

As a fairly extensive user of both Python and R, I net out similarly.

If I want to wrangle, explore, or visualise data I’ll always reach for R.

If I want to build ML/DL models or work with LLM’s I will usually reach for Python.

Often in the same document - nowadays this is very easy with Quarto.

https://www.queryverse.org/

Joel_Mckay

11 hours ago

[-]

Python has a list of issues fundamentally broken in the language, and relies heavily on integrated library bindings to operate at reasonable speeds/accuracy.

Julia allows embedding both R and Python code, and has some very nice tools for drilling down into datasets:

It is the first language I've seen in decades that reduces entire paradigms into single character syntax, often outperforming both C and Numpy in many cases. =3

https://yuri.is/not-julia/

9 hours ago

[-]

Deeply ironic for a Julia proponent to smear a popular language as "fundamentally broken" without evidence.

kelipso

5 hours ago

[-]

This is like one of those people posting Dijkstra’s letter advocating for 0-based indexing without ever having read or understood what they posted.

4 hours ago

[-]

What does indexing syntax have to do with Julia having a rough history of correctness bugs and footguns?

Joel_Mckay

1 minute ago

[-]

Sure, all software is terrible if looking at bug frequency history...

https://bugs.python.org/

Griefers ranting about years old _closed_ tickets on v1.0.5 versions on a blog as some sort of proof of lameness... is a poorly structured argument. Julia includes regression testing features built into even its plotting library output, and thus issues usually stay resolved due to pedantic reproducibility. Also, running sanity-checks in any llvm language code is usually wise.

Best of luck =3

Joel_Mckay

8 hours ago

[-]

Python threading and computational errata issues go back a long time. It is a popular integration "glue" language, but is built on SWiG wrappers to work around its many unresolved/unsolvable problems.

Not a "smear", but rather a well known limitation of the language. Perhaps your environment context works differently than mine.

It is bizarre people get emotionally invested in something so trivial and mundane. Julia is at v1.12.2 so YMMV, but Queryverse is a lot of fun =3

Havoc

2 hours ago

[-]

Realistically it’s winning because it’s accessible rather than perfectly suited

amai

12 hours ago

[-]

The example would better be written in SQL. So according to the author that would make SQL a great language for data science. SQL also supports tables natively. This conclusion is of course ridiculous and shows the shallow reasoning in this article.

gyulai

3 hours ago

[-]

I think, the lesson learned from › Python v. R ‹ is that people prefer doing data science in a general purpose language that is also okay-ish for data science over a language that's purpose-built for data science but suffers from diseconomies. Specifically: Imagine a new database or something like that has just come out. Now, the audience that wants to wire it into applications and the audience that wants to tap it to extract data for analytics put their weight together to create the demand for the Python library. The economies for that work out better than if you had to create two different libraries in two different languages to satisfy those two groups of demand.

LanceH

3 hours ago

[-]

You mention a good point of using Python to put out the results.

I think munging the input into a clean enough data set that you can work on is another place Python excels compared to analysis specific tools like R.

plaidfuji

7 hours ago

[-]

Python is a pretty bad language for tabular data analysis and plotting, which seems to be the actual topic of this post. R is certainly better, hell Tableau, Matlab, JMP, Prism and even Excel are all better in many cases. Pandas+seaborn has done a lot, but seaborn still has frustrating limits. And pandas is essentially a separate programming language.

If your data is already in a table, and you’re using Python, you’re doing it because you want to learn Python for your next job. Not because it’s the best tool for your current job. The one thing Python has on all those other options is $$$. You will be far more employable than if you stick to R.

And the reason for that is because Python is one of the best languages for data and ML engineering, which is about 80% of what a data science job actually entails.

jampekka

1 hour ago

[-]

> And pandas is essentially a separate programming language.

I'd say dplyr/tidyverse is a lot more a separate programming language to R than pandas is to Python.

getnormality

6 hours ago

[-]

...unless your data engineering job happens on a database, in which case R's dbplyr is far better than anything Python has to offer.

janalsncm

7 hours ago

[-]

Python is versatile which is what makes it popular. You can load back and forth from a GPU using well-tested libraries. You can memmap things if you need to. If your loops are too slow you can rewrite the hot loops in rust or C. You can read and write from most file formats in a couple of lines.

niemandhier

14 hours ago

[-]

Python is just a language that:

1. Is easy to read

2. Was easy to extend in languages that people who work with scientific data happen to like.

When I did my masters we hacked around in the numpy source and contributed here and there while doing astrophysics.

Stuff existed in Java and R, but we had learned C in the first semester and python was easier to read and contrary to MATLAB numpy did not need a license.

When data science came into the picture, the field was full of physicists that had done similar things. They brought their tools as did others.

jillesvangurp

3 hours ago

[-]

The main feature of Python is that it is approachable by people who have never programmed before. They might have a vague notion of wanting to instruct a computer to first do this and then do that. Imperative programming is their starting point. And Python delivers that. It was designed as a scripting language whose primary use indeed was to script together other things. It always was good at that and that was the main thing it was used for in the nineties.

It got popular once Linux distributions started relying on a lot of python scripts (e.g. Red Hat and Debian). As a side effect it was present on a lot of Linux and Unix systems early on. Scientists in the early 2000s and late nineties had access to workstations running Linux and Unix. So, Python was simply the approachable thing that was just there already.

And because it's so easy, there are lots of people getting into Python. So it got its own dynamic of generations of researchers in all sorts of fields knowing about Python being the goto thing to reach for. It never really was the best at anything it does. That wasn't even a goal. It's a bit slow. A bit verbose/clumsy compared to some of the alternatives that some data scientists prefer. It lacks a lot of features other languages have. Etc. This doesn't matter because it is simple and easy. The type of users that are new to programming are looking for something simple that they can understand. Not the platonic ideal of a language that mathematicians or computer scientists might prefer.

Python is the modern equivalent of BASIC which had this role before python was created. It wasn't that amazing. But early home computers had it as part of their OS. E.g. the Commodore 64 that was my first computer had an interactive Basic shell with the ability to load games from a tape as the main OS experience.

actuallyalys

9 hours ago

[-]

As much as I like Python and personally prefer it to R, I don’t really disagree. But I’m not sure R is a great language for data science either—it has its own weaknesses, e.g., writing custom loops (or functional equivalents with map or reduce) was pretty clunky last I tried it.

The other thing is that a lot of R’s strengths are really the tidyverse’s. Some of that is to R’s credit as an extensible language that enables a skilled API designer to really shine of course, but I think there’s no reason Python the language couldn’t have similar libraries. In fact it has, in plotnine. (I haven’t tried Polars yet but it does at least seem to have a more consistent API.)

mushufasa

14 hours ago

[-]

Languages inherently have network effects; most people around the world learn English so they can talk with other professionals who also know English, not because they are passionate about Charles Dickens.

My take (and my own experience) is that python won because the rest of the team knows it. I prefer R but our web developers don't know it, and it's way better for me to write code that the rest of our team can review, extend, and maintain.

culebron21

4 hours ago

[-]

This was underwhelming. I work with Python and Pandas, and I can show examples of much clumsier workflows I run into. The most often, you get dataframe[(dataframe.column1 == something) & ~dataframe.column2.isna()] constucts, which show that python syntax falls short here, and isn't suitable for such manipulations. Unfortunately, there's no alternative, and I don't see R as much easier, there are plenty of ugly things as well there.

There's Julia -- it has serious drawbacks, like slow cold start if you launch a Julia script from the shell, which makes it unsuitable for CLI workflows.

Otherwise you have to switch to compiled languages, with their tradeoffs.

markkitti

2 hours ago

[-]

> Unfortunately, there's no alternative, and I don't see R as much easier, there are plenty of ugly things as well there.

Have you tried Polars? It really discourages the inefficient creation of intermediate boolean arrays such as in the code that you are showing.

> There's Julia -- it has serious drawbacks, like slow cold start if you launch a Julia script from the shell, which makes it unsuitable for CLI workflows.

Julia has gotten significantly better over time with regard to startup, especially with regard to plotting. There is definitely a preference for REPL or notebook based development to spread the costs of compilation over many executions. Compilation is increasingly modular with package based precompilation as well as ahead-of-time compilation modes. I do appreciate that typical compilation is an implicit step making the workflow much more similar to a scripting language than a traditionally compiled language.

I also do appreciate that traditional ahead-of-time static compilation to binary executable is also available now for deployment.

After a day of development in R or Python, I usually start regretting that I am not using Julia because I know yesterday's code could be executing much faster if I did. The question really becomes do I want to pay with time today or over the lifetime of the project.

jampekka

1 hour ago

[-]

> Have you tried Polars? It really discourages the inefficient creation of intermediate boolean arrays such as in the code that you are showing.

The problem is not usually inefficiency, but syntactic noise. Polars does remove that in some cases, but in general gets even more verbose (apparently by design), which gets annoying fast when doing explorative data analysis.

UniverseHacker

8 hours ago

[-]

Doing computational biology for several decades in about a dozen languages, I do think R is a much better language for data science, but in practice I end up using Python almost every time because it has more libraries, and it’s easier to find software engineers and collaborators to work on Python. However, R makes for much simpler cleaner code, less silent errors, and the 1 indexing makes dealing with biological sequences much less hassle.

3eb7988a1663

6 hours ago

[-]

Pardon? Less silent errors? R has quite a few foot guns around permissively parsing user intention. Which does make it handy for exploratory analysis, but a lot more fragile when you want production code.

Just a simple one that can get you, R is 1-indexed. Yet if you have a vector, accessing myvec[0] is not an error. Alternatively, if you had say, a vector length of 3 and do myvec[10] that gets NA (an otherwise legal value). Or you could make an assignment past the end of the vector myvec[15] <- 3.14 , which will silently extend the array, inserting NAs

drnick1

11 hours ago

[-]

I suppose it depends on what exactly is meant by "data science." If find that for stochastic simulations, C++ and the Eigen library are unbeatable. You get the readability of high-level code with the performance of low-level code thanks to the "zero-cost abstractions" of Eigen.

If by data science you mean loading data to memory and running canned routines for regression, classification and other problems, then Python is great and mostly calls C/FORTRAN binaries under the hood, so Python itself has relatively little overhead.

15 hours ago

[-]

From many practical points, Clojure is great for data. And you can even leverage python libs via clj-python.

phforms

14 hours ago

[-]

In the past few years I have seen some serious efforts from the Clojure community to make Clojure more attractive for data science. Check out the Scicloj[1] group and their data science stack/toolkit Noj[2] (still in beta) as well as the high-performance tabular data processing library tech.ml.dataset (TMD)[3].

- [1] https://scicloj.github.io

- [2] https://scicloj.github.io/noj

- [3] https://github.com/techascent/tech.ml.dataset

geokon

4 hours ago

[-]

What's worth emphasizing is that you're not marrying in to an ecosystem of libs. There are a lot of separate pieces that you can typically use separately. I do climate data work without most of Scicloj's tools, but I do use tech.ml.dataset extensively

dragonwriter

10 hours ago

[-]

The bare python/stdlib example used (as well as bare python and avoiding add-on data science oriented libraries not being the way most people would use python for data science) is just...bad? (And, by bad here I mean showing signs of deliberately avoiding stdlib features in order to increase the appearance of the things the author then complains about.)

A better stdlib-only version would be:

    from palmerpenguins import load_penguins
    import math
    from itertools import groupby
    from statistics import fmean, stdev

    penguins = load_penguins()

    # Convert DataFrame to list of dictionaries
    penguins_list = penguins.to_dict('records')

    # create key function for grouping/sorting by species/island
    def key_func(x):
        return x['species'], x['island']

    # Filter out rows where body_mass_g is missing and sort by species and island
    filtered = sorted((row for row in penguins_list if not math.isnan(row['body_mass_g'])), key=key_func)

    # Group by species and island
    groups = groupby(filtered, key=key_func)

    # Calculate mean and standard deviation for each group
    results = []
    for (species, island), group in groups:
        values = [row['body_mass_g'] for row in group]
        mean_value = fmean(values)
        sd_value = stdev(values, xbar=mean_value)
        results.append({
            'species': species,
            'island': island,
            'body_weight_mean': mean_value,
            'body_weight_sd': sd_value
        })

taeric

11 hours ago

[-]

I'm heavily inclined to agree with the general thought, but I balk at the low level code showing why a language is bad at something. In this specific case, without the tidyverse, R isn't exactly peaches and cream.

As annoying as it is to admit it, python is a great language for data science almost strictly because it has so many people doing data science with it. The popularity is, itself, a benefit.

https://blog.genesmindsmachines.com/p/python-is-not-a-great-...

spicybbq

14 hours ago

[-]

Part 2 is here:

programmertote

14 hours ago

[-]

Disclaimer: I have nothing against R or Python and I'm not partial to either.

Python, the language itself, might not be a great language for data science. BUT the author can use Pandas or Polars or another data-science-related library/framework in Python to get the job done that s/he was trying to write in R. I could read both her R and Pandas code snippets and understand them equally.

This article reads just like, "Hey, I'm cooking everything by making all ingredients from scratch and see how difficult it is!".

yodsanklai

10 hours ago

[-]

Maybe R is fine for people who use it all the time? but as SWE that occasionally needs to do some data analysis, I find it much easier to rely on tools I know rather than R. R is pretty convoluted as a language.

rdtsc

14 hours ago

[-]

They basically advocate using R. I think it depends what they mean by "data science" and if the person will be doing just data science. If that's the case then R may be better. As in their whole career is going to built on that domain. But let's say they are on a general computer science track, now they'll probably benefit from learning Python more than R, simply because they can use it for other purposes.

> Either way, I’ll not discuss it further here. I’ll also not consider proprietary languages such as Matlab or Mathematica, or fairly obscure languages lacking a wide ecosystem of useful packages, such as Octave.

I feel, to most programming folks R is in the same category. R is to them what Octave is to the author. R is nice nice, but do they really want to learn a "niche" language, even if it has better some features than Python? Is holding a whole new paradigm, syntax, library ecosystem in your head worth it?

yeahwhatever10

15 hours ago

[-]

A little late for this

ASalazarMX

15 hours ago

[-]

"Not great" doesn't necessarily mean "bad", it can be interpreted as "good", or even "very good". An honest title would have explicitly qualified how suitable the author found it was.

That the author avoided saying Python was a bad language outright speaks a great deal of its suitability. Well, that, and the majority data science in practice.

huherto

14 hours ago

[-]

For what is worth. The Kotlin folks have been adding some cool features and tools for data analysis. https://kotlinlang.org/docs/data-analysis-overview.html

NuSkooler

14 hours ago

[-]

You could end it with "Python is not a great language".

Now, is Python a SUCCESSFUL language? Very.

kasperset

15 hours ago

[-]

R data science people generally come to data science field from life science or stats field. Python data science people generally originate from other fields that are mostly engineering focused. Again this may not apply to all the cases but that is my general observation.

Recently I am seeing that Python is heavily pushed for all data science related things. Sometimes objectively Python may not be the best option especially for stats. It is hard to change something after it becomes the "norm" regardless of its usability.

paulfharrison

15 hours ago

[-]

R is so good in part because of the efforts of people like Di Cook, Hadley Wickham, and Yihui Xie to create an software environment that they like working in.

It also helps that in R any function can completely change how its arguments are evaluated, allowing the tidyverse packages to do things like evaluate arguments in the context of a data frame or add a pipe operator as a new language feature. This is a very dangerous feature to put in the hands of statisticians, but it allows more syntactic innovation than is possible in Python.

cb321

15 hours ago

[-]

Like Python, R is a 2 (+...) language system. C/Fortran backends are needed for performance as problems scale up.

Julia and Nim [1] are dynamic and static approaches (respectively) to 1 language systems. They both have both user-defined operators and macros. Personally, I find the surface syntax of Julia rather distasteful and I also don't live in PLang REPLs / emacs all day long. Of course, neither Julia nor Nim are impractical enough to make calling C/Fortran all that hard, but the communities do tend to implement in the new language without much prompting.

[1] https://nim-lang.org/

codeptualize

1 hour ago

[-]

Wait, so there is one example, which shows the R and Python equivalents are pretty much the same..

I was all hyped up, ready to see the amazing examples and arguments that would convince me to pick up R, and it gave me absolutely nothing (except quotes and brackets..).

Disappointing.

serjester

15 hours ago

[-]

Seems like their critique boils down to two areas - pandas limitations and fewer built ins to lean on.

Personally I've found polars has solved most of the "ugly" problems that I had with pandas. It's way faster, has an ergonomic API, seamless pandas interop and amazing support for custom extensions. We have to keep in mind Pandas is almost 20 years old now.

I will agree that Shiny is an amazing package, but I would argue it's less important now that LLMs will write most of your code.

drtournier

8 hours ago

[-]

JavaScript is not a great language for web development either, yet…

solatic

14 hours ago

[-]

Shell is the best language for data science. Pick the best tools for each of getting data, cleaning data, transforming data, and visualizing data, then stitch them together by sheer virtue of the fact that text is the universal interoperable protocol and files are the universal way of saving intermediate stages of data.

Best part is, write a --help, and you can load them into LLMs as tools to help the LLMs figure it out for you.

Fight me.

1. https://redo.readthedocs.io/en/latest/

11 hours ago

[-]

redo[1] with shell scripts has become my goto method of dealing with multi-step data problems. It makes it easy to review each step of data retrieval, clean-up, transformation, etc.

I use mlr, sqlite, rye, souffle, and goawk in the shell scripts, and visidata to interactively review the intermediate files.

morshu9001

5 hours ago

[-]

Data science is the one thing I consider Python especially good at

drchaim

13 hours ago

[-]

Python was a great language for data science, when data science become a mainstream thing.

it was easy to think about the structures (iterators) it was easy to extend. it had a good community.

And for that, people start extending it via libraries.

There are plenty more alternatives now.

huherto

14 hours ago

[-]

Isn't the author saying that Python + Pandas is almost as good as R, but Python without Pandas is less powerful than R.

I can't help to conclude that Python is as good as R because I still have the choice of using Pandas when I need it. What did I get wrong?

paddleon

14 hours ago

[-]

you missed the "almost as" in your first sentence.

also, we didn't define "good".

_ZeD_

5 hours ago

[-]

Sooo... Is this a post about python envy?

moi2388

2 hours ago

[-]

You had me at “Python is not a great language”

exabrial

15 hours ago

[-]

The problem is there's so much momentum behind it that's hard to course correct. PyTorch is now a goliath.

lenerdenator

15 hours ago

[-]

> I think people way over-index Python as the language for data science. It has limitations that I think are quite noteworthy. There are many data-science tasks I’d much rather do in R than in Python.1 I believe the reason Python is so widely used in data science is a historical accident, plus it being sort-of Ok at most things, rather than an expression of its inherent suitability for data-science work.

Python doesn't need to be the best at any one thing; it just has to be serviceable for a lot of things. You can take someone who has expertise in a completely different domain in software (web dev, devops, sysadmin, etc.) and introduce them to the data science domain without making them learn an entirely new language and toolchain.

dmurray

15 hours ago

[-]

That's not why it's used in data science though. Lots of data scientists use Python all day and have no concept of ever working in a different field.

It's used in data science because it's used in data science.

vkazanov

15 hours ago

[-]

It's used in data science because no other language has this level of library support.

And it got this unprecedented level of support because right from the start it made its focus clear syntax and (perceived) simplicity.

There is also a sort of cumulative effect from being nice for algorithmic work.

Guido's long-term strategy won over numerous other strong candidates for this role.

passivegains

13 hours ago

[-]

I think the key thing not obvious to most data scientists is they're not using python because it meets their needs, it's because we've failed them. twice.

1. data scientists aren't programmers, so why do they need a programming language? the tools they should be using don't exist. they'd need programmers to make them, and all we have to offer is... more programming languages.

2. the giant problem at the heart of modern software: the most important feature of a modern programming language is being easy to read and write. this feature is conspicuously absent from most important languages.

they're trapped. they can't do what they need without a programming language but there are only a handful they can possibly use. the real reason python ended up with such good library support is they never really had a choice.

mohaine

15 hours ago

[-]

But data science usually isn't an island.

Use whatever you want on your one off personal projects but use something more non-data science friendly if you ever want your model to run directly in a production workflow.

Productionizing R models is quite painful. The normal way is to just rewrite it not in R.

dmurray

13 hours ago

[-]

I've soured a lot on directly productionizing data science code. It's normally an unmaintainable mess.

If you write it in R and then rewrite it in C (better: rewrite it in English with the R as helpful annotations, then have someone else rewrite it in C), at least there is some chance you've thought about the abstractions and operations that are actually necessary for your problem.

bsder

8 hours ago

[-]

Partially, but it's also because 90% of your work in "data science" isn't direct analysis.

You need to get the data from somewhere. Do you need to scrape that because Python is okay at scraping? Oh, after its scraped, we looked at it and it's in ObtuseBinaryFormat0.0.LOL.Beta and, what do you know, somebody wrote a converter for that for Python. And we need to clean all the broken entries out of that and Python is decent at that. etc.

The trick is that while Python may or may not be anybody's first choice for a particular task, Python is an okay second or third choice for most tasks.

So, you can learn Python. Or you learn <best language> and <something else>. And if <something else> is Python, was <best language> sufficiently better than Python to be worth spending the time learning?

lenerdenator

15 hours ago

[-]

That's probably true now, but at one point, they were looking for people to start doing data science, and were pulling people from other domains.

dcreater

9 hours ago

[-]

Fixed title: Python is not a great language for data science if pandas/polars/ibis did not exist

mike_ivanov

6 hours ago

[-]

Please read the article. It literally shows pandas code as an example.

slowhadoken

4 hours ago

[-]

Sounds like a skill issue

shevy-java

10 hours ago

[-]

R is kind of a super-specialized language. Python is much more general purpose.

R failed to evolve, let's be honest. Python won via jupyter - I see this used ALL the time in universities. R is used too, but mostly for statistics related courses only, give or take.

Perhaps R is better for its niche, but Python has more momentum and in thus, dominates over R. That's simply the reality of the situation. It is like the bulldozer moving forward, at a fast speed.

> I say “This is great, but could you quickly plot the data in this other way?”

Ok so ... he would have to adjust R code too, right? And finding good info on that is simply harder. He says he has experience with universities. Well, I do too, and my experience is that people are WAY better with python than with R. You simply see that more students will drop out from R than from python. That's also simply the reality of the situation.

> They appear to be sufficiently cumbersome or confusing that requests that I think should be trivial frequently are not.

I am sure the reverse also applies. Pick some python library, do something awesome, then tell the R students to do the same. I bet he will have the same problems.

> So many times, I felt that things that would be just a few lines of simple R code turned out to be quite a bit longer and fairly convoluted.

Ok, so here he is trolling. Flat out - I said it.

I wrote a LOT of python and quite a bit of R. There is no way in life that the R code is more succinct than the python code for about 90% of the use cases out there. Sorry, that's simply not the case. R is more verbose.

> Here is the relevant code in R, using the tidyverse approach:

    penguins |>
      filter(!is.na(body_mass_g)) |>
      group_by(species, island) |>
      summarize(

This is like perl. They also don't adapt. R is going to lose grounds.

This professor just hasn't realised that he is slowly becoming a fossil himself, by being unable to see that x is better than y.

oivey

4 hours ago

[-]

> R failed to evolve, let's be honest. Python won via jupyter

Ju = Julia Pyt = Python Er = R

R is not only supported in Jupyter, it was there from the start. I’ve never written a single line of R. It is bizarre how little people know about their tools.

IshKebab

11 hours ago

[-]

Python's not a great language for anything. Maybe for teaching programming I guess (except then you end up with people that only know Python).

thom

13 hours ago

[-]

I think this expectation that data science code is a thing you write basically top to bottom to get some answers out, put them in a graph and move on with your life is not a useful lens through which to evaluate two programming languages. R definitely is an efficient DSL for doing stats this way, but it’s a painful way to build a durable piece of software. Python is nowhere near perfect but I’ve seen fewer codebases that made my eyes bleed, however pretty the graphs might look.

hekkle

10 hours ago

[-]

For those who thought the article was TL;DR, the author argues.

- A General programming language like Python is good enough for data science but isn't specifically designed for it.

- A language that is specifically designed for Data Science like R is better at Data Science.

Who would have thought?

Lyngbakr

15 hours ago

[-]

I was a bit disappointed to discover that this was essentially an R vs. Python article, which is a data science trope. I've been in the field for 20+ years now and while I used to be firmly on team R, I now think that we don't really have a good language for data science. I had high hopes for Julia and even Clojure's data landscape looks interesting, but given the momentum of Python I don't see how it could be usurped at this point.

vkazanov

15 hours ago

[-]

It is EVERYWHERE. I recently had to interview a bunch of data scientists, and only one of them knew SQL. Surely, all of then worked with python. I bet none of them even heard of R.

garciasn

15 hours ago

[-]

SAS > R > Python.

The focus of SAS and R were primarily limited to data science-related fields; however, Python is a far more generic programming language, thus the number of folks exposed to it is wider and thus the hiring pool of those who come in exposed to Python is FAR LARGER than SAS/R ever were, even when SAS was actively taught/utilized in undergraduate/graduate programs.

As a hiring leader in the Data Science and Engineering space, I have extensive experience with all of these + SQL, among others. Hiring has become much easier to go cross-field/post-secondary experience and find capable folks who can hit the ground running.

username135

15 hours ago

[-]

you beat me to it. i understand why sas gets hate but I think that comes with simply not understanding how powerful it is.

garciasn

14 hours ago

[-]

It was a great language, but it was/is extremely cost-prohibitive plus it simply fell out of favor in academia, for many of the same reasons, and thus was supplanted by free alternatives.

Lyngbakr

15 hours ago

[-]

Yikes. Were they experienced data scientists or straight out of school? I find it very odd (and a bit scary) that they didn't know SQL.

garciasn

15 hours ago

[-]

Experienced Data Scientists and/or those straight out of school are EXTREMELY lacking in valuable SQL experience and always have been. Take a DS with 25 years experience in SAS, many of them are great with DATAstep, but have far less experience using PROC SQL for querying the data in the most effective way--even if they were pulling the data down with pass-through via SAS/ACCESS.

Often they'd be doing very simplistic querying and then manipulating via DATAstep prior to running whatever modeling and/or reporting PROCs later, rather than pushing it upstream into a far faster native database SQL pull via pass-through.

Back in 2008/2009, I saved 30h+ runtime on a regular report by refactoring everything in SQL via pass-through as opposed to the data scientists' original code that simply pulled the data down from the external source and manipulated it in DATAstep. Moving from 30h to 3m (Oracle backend) freed up an entire FTE to do more than babysit a long-running job 3x a week to multiple times per day.

SiempreViernes

15 hours ago

[-]

What would it even mean to be a "good language for data science"?

In the first place data science is more a label someone put on bag full of cats, rather than a vast field covered by similarly sized boxes.

username135

15 hours ago

[-]

SAS has entered the chat

semiinfinitely

11 hours ago

[-]

correct, its only the best on that we have

KaiserPro

11 hours ago

[-]

The observation I make here is in that first python example with the penguins, what the fuck is that?

It makes it look like perl, on a bad day, or worse autogenerated javascript.

Why on earth is it so many levels deep in objects?

aussieguy1234

10 hours ago

[-]

I felt forced to use python when I gave langgraph agents a go.

Worked quite well, but the TS/JS langgraph version is way behind. React agents are just a few lines of code, compared to 50 odd lines for the same thing in JS/TS.

Better to use a different language, even one i'm not familiar with, to be able to maintain a few lines of code vs 50 lines.

fnord77

10 hours ago

[-]

Python is not a great language

hekkle

10 hours ago

[-]

Not great at what?

I agree that Python is not great at anything specifically, but it is good at almost everything, and that's what makes it great.

johnea

11 hours ago

[-]

They could have just left the last three words off of that title 8-/

Python is not a great language

First, the white space requirements are a bad flashback to 1970s fortran.

Second, it is the language that is least compatible with itself.

coolThingsFirst

12 hours ago

[-]

Python just has poor aesthetics. __init__(self) is unacceptable in a language in 2025. Ruby would've been a much better choice. Sloppiness in language design is just a bad idea.

stOneskull

8 hours ago

[-]

there's @dataclass in 2025

BiteCode_dev

1 hour ago

[-]

Notice how the article load_penguins() example starts neatly after all the messy parts of data science are done and stops right before the next pain starts.

It lives in a sterile, idealized world.

Python is a great language for data science in practice because it turns out data science is also:

   - gluing a lot of data sources

   - cleaning up a ton of terribly shaped data

   - validation and error handling

   - I/O, networking, and format conversion

   - emboarding non-programmers into programming

   - wrapping a lot of compiled languages' libs or plugging system

   - prototyping stuff and exposing that prototype to some people

   - turning prototypes into more permanent projects

And it turns out Python and its ecosystem are good at those while remaining decent at the other things.

There are other languages excellent at some of those, or some of the other things, but rarely good at most. And because humanity is vast, diverse, and constantly renewing, being the second best at those is eventually always winning.

Because whoever you are, you will be annoyed at not having the best experience at task X. But you would be mortified if you had the worst experience at doing task Y and Z. And task X, Y, and Z change depending on who you ask.

And you want to get things done, while days have 24 hours.

As usual, to understand the Python phenomenon, you have to see the whole picture. Not your little corner of the bubble. Not the ideal world in your head either. Life is not a maths problem with a clearly laid out premise and an elegant answer.

That's the same debate about why PHP won the web in 2000 no matter the size of the spaghetti plate, why Windows stayed used for so long despite it being terrible, why people keep using iphones after all the abuses, etc. There is more to it than the use case you have every day. People have needs you don't haven't thought about.

So it's not "let the language war begin". It's, "dude, get more experience, go work with accountants, ngos, govs and logistic chains, go work in china, africa and south america, go from a startup to schools to corporate, satisfy the geeks, the artists and the business people, than we'll talk".

constantcrying

12 hours ago

[-]

Python is also an embarrassingly bad language for numerics. It comes without support for different floating point types does not have an n-D Array data type and is extremely slow.

At the same time it is an absolute necessity to know if you are doing numerics. What this shows, at least to me, is that it is "good enough" and that the million integrations, examples and pieces of documentation matter more than whether the peculiarities of the language work in favor of its given use case, as long as the shortcomings can be mostly addressed.

slashdave

8 hours ago

[-]

Native python is hopeless for numerics, which is why just about everyone just uses numpy, which solves all of these issues. Of course, a separate package. But the strength of python is that it can fairly seamlessly incorporate these kinds of packages that add core capabilities. Another important example: pytorch.

rob_c

10 hours ago

[-]

Refuses to learn tool so tool is broken... There is no problem with python for this. If you hate boiler plate job the club, get llms to generate it for you and move on to doing real work (or get involved in improving the language or libraries directly)