> By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
In other words, it's free only to trap you.
I nearly made the mistake of merging Akka into a codebase recently; fortunately I double-checked the license and noticed it was the bullshit BUSL and it would have potentially cost my employer tens of thousands of dollars a year [1]. I ended up switching everything to Vert.x, but I really hate how normalized these ostensibly open source projects are sneaking scary expensive licenses into things now.
[1] Yes I'm aware of Pekko now, and my stuff probably would have worked with Pekko, but I didn't really want to deal with something that by design is 3 years out of date.
Vert.x and other frameworks are far better and easier for most devs to grok.
I would imaging the non-Scala use case to be less than ideal.
In Scala land, Pekko - the open source fork of Akka is the way to go if you need compatibility. Personally, I'd avoid new versions of Akka like the plague, and just use more modern alternatives to Pekko/Akka anyway.
I'm not sure what Lightbend's target market is? Maybe they think they have enough critical mass to merit the price tag for companies like Sony/Netflix/Lyft, etc. But they've burnt their bridge right into the water with everyone else, so I see them fading into irrelevance over the next few years.
I'm sure that Lightbend feels that their support contract is the bee's knees and worth whatever they charge for it, but it's a complete non-starter for me, and so I look elsewhere.
Vert.x actor-ish model is a bit different, but it's not the that different, and considering that Vert.x tends to perform extremely well in benchmarks, it doesn't really feel like I'm losing a lot by using it instead of Akka, particularly since I'm not using Akka Streams.
[1] Normal disclaimer: I don't hide my employment history, and it's not hard to find, but I politely ask that you do not post it here.
Plus the license isn't stupid.
Basically it's a debate about how many dark patterns can you squeeze next to that "upfront language" before "marketing" slides into "bait-n-switch."
In fact, what's stopping the pandas library from incorporating fireducks code into the mainline branch? pandas itself is BSD.
So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.
I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).
To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...
For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).
Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.
I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.
That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:
df = (
pd
.read_csv("./file.csv")
.rename(columns={"value":"x"})
.assign(y=lambda d: d["x"] * 2)
.loc[lambda d: d["y"] > 0.5]
)
Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.
.loc[lambda d: d["y"] > 0.5]
Is stylistically superior to [df.y > 0.5]
I agree it comes in handy quite often, but that still doesn’t make it great to write compared to what sql or dplyr offers in terms of choosing columns to filter on (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)For the rest of your comment: it's the best you can do in python. Sure you could write SQL, but then you're mixing text queries with python data manipulation and I would dread that. And SQL-only scripting is really out of question.
Big problem with pandas is that you still have to load the dataframe into memory to work with it. My data's too big for that and postgres makes that problem go away almost entirely.
(I'm the first to complain about the many warts in Pandas. Have written multiple books about it. This is annoying, but it is much better than [df.y > 0.5].)
You are probably thinking about `df.apply(lambda row: ..., axis=1)` which operates on each row at a time and is indeed very slow since it's not vectorized. Here this is different and vectorized.
I find I basically never write myself into a corner with initially expedient but ultimately awkward data structures like I often did with pandas, the expression API makes the semantics a lot clearer, and I don't have to "guess" the API nearly as much.
So even for this usecase, I would recommend trying out polars for anyone reading this and seeing how it feels after the initial learning phase is over.
However, I still find myself using pandas for the timestamps, timedeltas, and date offsets, and even still, I need a whole extra column just to hold time zones, since polars maps everything to UTC storage zone, you lose the origin / local TZ which screws up heterogeneous time zone datasets. (And I learned you really need to enforce careful manual thoughtful consideration of time zone replacement vs offsetting at the API level)
Had to write a ton of code to deal with this, I wish polars had explicit separation of local vs storage zones on the Datetime data type
IMO Polars sets a different goal of what's the most pandas like thing that we can build that is fast (and leaves open the possibility for more optimization), and clean.
Polars feels like you are obviously manipulating an advanced query engine. Pandas feels like manipulating this squishy datastructure that should be super useful and friendly, but sometimes it does something dumb and slow
My conclusion was that pandas is not for developers. But for one-offs by managers, data-scientists, scientists, and so on. And maybe for "hackers" who cludge together stuff 'till it works and then hopefully never touch it.
Which made me realize such thoughts can come over as smug, patronizing or belittling. But they do show how software can be optimized for different use-cases.
The danger then lies into not recognizing these use-cases when you pull in smth like pandas. "Maybe using panda's to map and reduce the CSVs that our users upload to insert batches isn't a good idea at all".
This is often worsened by the tools/platforms/lib devs or communities not advertising these sweet spots and limitations. Not in the case of Pandas though: that's really clear about this not being a lib or framework for devs, but a tool(kit) to do data analysis with. Kudo's for that.
I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.
On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.
But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.
I 100% agree that pandas addresses all the pain points of data analysis in the wild, and this is precisely why it is so popular. My point is, it doesn't address them well. It seems like a conglomerate of special cases, written for a specific problem it's author was facing, with little concern for consistency, generality or other use cases that might arise.
In my usage, any time saved by its (very useful) methods tends to be lost on fixing subtle bugs introduced by strange pandas behaviours.
In my use cases, I reindex the data using pandas and get it to numpy arrays as soon as I can, and work with those, with a small library of utilities I wrote over the years. I'd gladly use a "sane pandas" instead.
I get it doesn't follow best practices, but it does do what it needs to. Speed has been an issue, and it's exciting seeing that problem being solved.
Interesting to see so many people recently saying "polars looks great, but no way I'll rewrite". This library seems to give a lot of people, myself included, exactly what we want. I look forward to trying it.
Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.
I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.
It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.
You can do the same with Polars, but you have to start messing about with datetimes and convert the simple problem "I want to calculate a monthly sum anchored on the last business day of the month" to SQL-like operations.
Pandas grew a large and obtuse API because it provides specialized functions for 99% of the tasks one needs to do on timeseries. If I want to calculate an exponential weighted covariance between two time series, I can trivially do this with pandas: series1.ewm(...).cov(series2). I welcome people to try and do this with Polars. It'll be a horrible and barely readable contraption.
YC is mostly populated by technologists, and technologists are often completely ignorant about what makes pandas useful and popular. It was built by quants/scientists, for doing (interactive) research. In this respect it is similar to R, which is not a language well liked by technologists, but it is (surprise) deeply loved by many scientists.
I've had trouble determining whether one timestamp falls between two others across tens of thousands of rows (with the polars team suggesting I use a massive cross product and filter -- which worked but excludes the memory requirement), whereas in pandas I was able to sort the timestamps and thereby only need to compare against the preceding / following few based on the index of the last match.
The other issue I've had with resampling is with polars automatically dropping time periods with zero events, giving me a null instead of zero for the count of events in certain time periods (which then gets dropped from aggregations). This has caught me a few times.
But other than that I've had good luck.
`.join_where()`[1] was also added recently.
[1]: https://docs.pola.rs/api/python/stable/reference/dataframe/a...
> my_df.resample("BME").apply(...)
Done. I don't think it gets any easier than this. Every time I tried something similar with polars, I got bogged down in calendar treatment hell and large and obscure SQL like contraptions.
Edit: original tone was unintentionally combative - apologies.
But I'm guessing it's something like this:
import pandas as pd
def calculate_monthly_business_sum(df, date_column, value_column):
"""
Calculate monthly sums anchored to the last business day of each month
Parameters:
df: DataFrame with dates and values
date_column: name of date column
value_column: name of value column to sum
Returns:
DataFrame with sums anchored to last business day
"""
# Ensure date column is datetime
df[date_column] = pd.to_datetime(df[date_column])
# Group by end of business month and sum
monthly_sum = df.groupby(pd.Grouper(
key=date_column,
freq='BME' # Business Month End frequency
))[value_column].sum().reset_index()
return monthly_sum
# Example usage:df = pd.DataFrame({ 'date': ['2024-01-01', '2024-01-31', '2024-02-29'], 'amount': [100, 200, 300] })
result = calculate_monthly_business_sum(df, 'date', 'amount')
print(result)
Which you can run here => https://python-fiddle.com/examples/pandas?checkpoint=1732114...
df.resample("BME").sum()
Done. One line of code and it is quite obvious what it is doing - with perhaps the small exception of BME, but if you want max readability you could do:
df.resample(pd.offsets.BusinessMonthEnd()).sum()
This is why people use pandas.
> df.resample("BME").sum()
Assuming `df` is a dataframe (ie table) indexed by a timestamp index, which is usual for timeseries analysis.
"BME" stands for BusinessMonthEnd, which you can type out if you want the code to be easier to read by someone not familiar with pandas.
It so easy for my analyst team because of daily uses but my developers probavly will never thought/know BME and decided to implement the code again.
https://hexdocs.pm/explorer/exploring_explorer.html
It runs on top of Polars so you get those speed gains, but uses the Elixir programming language. This gives the benefit of a simple finctional syntax w/ pipelines & whatnot.
It also benefits from the excellent Livebook (a Jupyter alternative specific to Elixir) ecosystem, which provides all kinds of benefits.
Here's my alternative: https://github.com/otsaloma/dataiter https://dataiter.readthedocs.io/en/latest/_static/comparison...
Planning to switch to NumPy 2.0 strings soon. Other than that I feel all the basic operations are fine and solid.
Note for anyone else rolling up their sleeves: You can get quite far with pure Python when building on top of NumPy (or maybe Arrow). The only thing I found needing more performance was group-by-aggregate, where Numba seems to work OK, although a bit difficult as a dependency.
I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...
Also, having (already a while ago) looked at the implementation of the magic `_` object, it seemed like an awful hack that will serve only a part of use cases. Maybe someone can correct me if I'm wrong, but I get the impression you can do e.g. `summarize(x=_.x.mean())` but not `summarize(x=median(_.x))`. I'm guessing you don't get autocompletion in your editor or useful error messages and it can then get painful using this kind of a magic.
A pity their compares don’t have tidyverse or R’s data.table. I think R would look simpler but now it remains unclear.
Yeah, Pandas has that early PHP feel to it, probably out of being a successful first mover.
Writing pandas code is a bit redundant. So what?
Who is to say that fireducks won't make their own API?
Polars rocked my world by having a sane API, not by being fast. I can see the value in this approach if, like the author, you have a large amount of pandas code you don't want to rewrite, but personally I'm extremely glad to be leaving the pandas API behind.
Is there anything specific you prefer moving from the pandas API to polars?
Say you want to take an aggergation like "the mean of all values over the 75th percentile" algonside a few other aggregations. In pandas, this means you're gonna be in for a bunch of hoops and messing around with stuff because you can't express it via the api. Polars' api lets you express this directly without having to implement any kind of workaround.
Nice article on it here: https://labs.quansight.org/blog/dataframe-group-by
I don't understand what it means. It looks like a contradiction. Does it have a BSD-3 licence or not?
> While the wheel packages are available at https://pypi.org/project/fireducks/#files, and while they do contain Python files, most of the magic happens inside a (BSD-3-licensed) shared object library, for which source code is not provided.
Edit: To use, redistribute, and modify, and distribute modified versions.
Imagine being like "the project is GPL - just the compiled machine code".
GitHub always been a platform for "We love to host FOSS but we won't be 100% FOSS ourselves", so makes sense they allow that kind of usage for others too.
I think what you want, is something like Codeberg instead, which is explicitly for FOSS and 100% FOSS themselves.
Is this shitware? It seems to be very high quality code
How would you define "quality" in this context?
One could also say that quality is related to the functional output.
Right, I said nothing that contradicts that ("High quality code isn't just code that performs well when executed, but also ..."). High quality functional output is a necessary requirement, but it isn't sufficient to determine if code is high quality.
My point was that it's all subjective in the end.
Imagine writing a very good program, running it through an obfuscator, and throwing away the original code. Is the obfuscated code "high quality code" now, because the output of the compilation still works as before?
https://fireducks-dev.github.io/files/20241003_PyConZA.pdf
The main reasons are
* multithreading
* rewriting base pandas functions like dropna in c++
* in-built compiler to remove unused code
Pretty impressive especially given you import fireducks.pandas as pd instead of import pandas as pd, and you are good to go
However I think if you are using a pandas function that wasn't rewritten, you might not see the speedups
They are showing a 20-30% improvement over Polars, Clickhouse and Duckdb. But those 3 tools are SOTA in this area and generally rank near eachother in every benchmark.
So 20-30% improvement over that cluster makes me interested to know what techniques they are using to achieve that over their peers.
> Future Plans By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
Its freeware under an open source license. Really misleading.
It looks like something you should stay away from unless you need it REALLY badly. Its a proprietary product with unknown pricing and no indication of what their plans are.
Does the fact that the binary is BSD licensed allow reverse-engineering?
Reversing and re-compiling should count as modification?
I've had the chance to play with it on some of my code it queries than ran in 8+ minutes come down to 20 seconds.
Re-writing in Polars involves more code changes.
However, with Pandas 2.2+ and arrow, you can use .pipe to move data to Polars, run the slow computation there, and then zero copy back to Pandas. Like so...
(df
# slow part
.groupby(...)
.agg(...)
)
to: def polars_agg(df):
return (pl.from_pandas(df)
.group_by(...)
.agg(...)
.to_pandas()
)
(df
.pipe(polars_agg)
)
Where can I find the code? I don't see it on GitHub.
> contact@fireducks.jp.nec.com
So it's from NEC (a major Japanese computer company), presumably a research artifact?
> https://fireducks-dev.github.io/docs/about-us/ Looks like so.
I wonder how much of this is fundamental to the common approach of writing libraries in Python with the processing-heavy parts delegated to C/C++ -- that the expressive parts cannot be fast and the fast parts cannot be expressive. Also, whether Rust (for polars, and other newer generation of libraries) changes this tradeoff substantially enough.
This really does seem like a rare thing that everything speeds up without breaking compatability. If you want a fast revised API for your new project (or to rework your existing one) then you have a solution for that with Polars. If you just want your existing code/workloads to work faster, you have a solution for that now.
It's OK to have a slow, compatible, static codebase to build things on then optimize as-needed.
Trying to "fix" the api would break a ton of existing code, including existing plugins. Orphaning those projects and codebases would be the wrong move, those things take a decade to flesh out.
This really doesn't seem like the worst outcome, and doesn't seem to be creating a huge fragmented mess.
Don't come to old web-devs with those complains, every single one of them had to write at least one open source javascript library just to create their linkedin account!
Is it actually? Do people see that level of compatibility in practice?
It should be pretty close, though.
We found `numpy` and `jax` to be a good trade-off between "too high level to optimize" and "too low level to understand". Therefore in our hedge fund we just build data structures and helper functions on top of them. The downside of the above combination is on sparse data, for which we call wrapped c++/rust code in python.
I wrote a nice article about chaining for Ponder. (Sadly, it looks like the Snowflake acquisition has removed that. My book, Effective Pandas 2, goes deep into my best practices.)
(Disclaimer: I'm a corporate trainer and feed my family teaching folks how to work with their data using Pandas.)
When I teach about "readable" code, I caveat that it should be "readable for a specific audience". I hold that if you are a professional, that audience is other professionals. You should write code for professionals and not for newbies. Newbies should be trained up to write professional code. YMMV, but that is my bias based on experience seeing this work at some of the biggest companies in the world.
My easy guess is that compared to pandas, it's multi-threaded by default, which makes for an easy perf win. But even then, 130-200x feels extreme for a simple sum/mean benchmark. I see they are also doing lazy evaluation and some MLIR/LLVM based JIT work, which is probably enough to get an edge over polars; though its wins over DuckDB _and_ Clickhouse are also surprising out of nowhere.
Also, I thought one of the reasons for Polars's API was that Pandas API is way harder to retrofit lazy evaluation to, so I'm curious how they did that.
```
>>> df['year'].dtype == np.dtype('int32')
True
```
EDIT: I've found some benchmarks https://fireducks-dev.github.io/docs/benchmarks/
Would be nice to know what are internals of FireDucks
The promise of a 100x speedup with 0 changes to your codebase is pretty huge, but even a few correctness / incompatibility issues would probably make it a no-go for a bunch of potential users.
I haven’t seen that in other system like Polars, but maybe I’m wrong.
edit: I know pandas uses numpy under the hood, but "raw" numpy is typically faster (and more flexible), so curious as to why it's not mentioned
At some point I think it's more honest to say "the python ecosystem keeps getting more awesome".
Q: Why do ducks have big flat feet?
A: So they can stomp out forest fires.
Q: Why do elephants have big flat feet?
A: So they can stomp out flaming ducks.
While for most of my jobs I ended up being able to evade the use of HPC by simply being smarter and discovering better algorithms to process information, I recall like pyspark decently, but preferring the simplicity of ballista over pyspark due to the simpler installation of Rust over managing Java and JVM junk. The constant problems caused by anything using JVM backend and the environment config with it was terrible to add to a new system every time I ran a new program.
In this regard, ballista is a enormous improvement. Anything that is a one-line install via pip on any new system, runs local-first without any cloud or telemetry, and requires no change in code to run on a laptop vs HPC is the only option worth even beginning to look into and use.
Unless I had thousands of files to work with, I would be loathe to use cluster computing. There's so much overhead, cost, waiting for nodes to spin up, and cloud architecture nonsense.
My "single node" computer is a refurbished tower server with 256GB RAM and 50 threads.
Most of these distributed computing solutions arose before data processing tools started taking multi-threading seriously.