FWIW, aside from manual filter pushdown, I consider the JOIN variant the canonical / "default" way to merge multiple tables; it keeps all the join-related logic in one place, while mixing both joining conditions and filtering conditions in WHERE always felt more error-prone to me.
Pushdown is actually a more subtle problem than people give it credit for; it inhabits this weird space where things are too trivial to be covered in papers but is too hard to cover properly in textbooks (e.g., no coverage of multiple equalities or outer joins).
Note: the dig at dataframe libs is worth some care in case you think that means duckdb can optimize and they cannot
Dask, Polars, and others pick a lazy default in order to make distribution and other optimizations easier. When staying in their pure fragments ('vectorized'), the same scheduler rewriting opportunity is here.
This is a subtle but important distinction when looking at these frameworks. We are making our new graph query language 'gfql' to be dataframe-native so it can run naturally & natively as a step of pipelines people are already doing, but also to ensure we automatically run as optimized CPU/GPU columnar opts. At the same time, because of the intent to allow room for query plan optimization, we are staying declarative / lazy, even if the generated & interpreted code uses an eager DF runtime . I'm optimistic about output target lazy DF systems doing query planner work for us long-term here, but for the eager framework targets, the query planning has to be on our side.
That gives a less subtle clue that it's about databases than looking at the domain.
Seems like a missing "un" here
Compelling article! I've already found DuckDB to be the most ergonomic tool for quick and dirty wrangling, it's good to know it can handle massive jobs too.
SELECT
pickup.zone AS pickup_zone,
dropoff.zone AS dropoff_zone,
cnt AS num_trips
FROM
(select pickup_location_id, dropoff_location_id, count(*) as cnt from taxi_data_2019 group by 1,2) data
INNER JOIN
(SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') pickup
ON pickup.LocationID = data.pickup_location_id
INNER JOIN
(SELECT * FROM zone_lookups WHERE Borough = 'Manhattan') dropoff
ON dropoff.LocationID = data.dropoff_location_id
ORDER BY num_trips desc
LIMIT 5;
┌───────────────────────┬───────────────────────┬───────────┐
│ pickup_zone │ dropoff_zone │ num_trips │
│ varchar │ varchar │ int64 │
├───────────────────────┼───────────────────────┼───────────┤
│ Upper East Side South │ Upper East Side North │ 536621 │
│ Upper East Side North │ Upper East Side South │ 455954 │
│ Upper East Side North │ Upper East Side North │ 451805 │
│ Upper East Side South │ Upper East Side South │ 435054 │
│ Upper West Side South │ Upper West Side North │ 236737 │
└───────────────────────┴───────────────────────┴───────────┘
Run Time (s): real 0.304 user 1.791931 sys 0.132745
(unedited query is similar to theirs, about .9s)And, doing the same to the optimized query drops it to about .265s.
Thanks for the reference; this question has been on my mind for some time.