I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.
The first part, using this command [1], produces output lines that look like this:
219,/blog/php-81-before-and-after,2021-06-21
and is sorted by URL path and then date.With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.
Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).
So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.
[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'
[2]
#!/bin/zsh
echo "{"
PAGE=none
while read LINE; do
COLS=("${(@s/,/)LINE}")
COUNT=${COLS[1]}
URL=${COLS[2]}
DATE=${COLS[3]}
if [ $URL != $PAGE ]; then
if [ $PAGE != "none" ]; then
echo
echo " },"
fi
PAGE=$URL
echo " \"\\$URL\": {"
FINISHDATE=no
else
if [ $FINISHDATE = "yes" ]; then
echo ","
fi
fi
echo -n " \"$DATE\": $COUNT"
FINISHDATE=yes
done
echo
echo "}"You mention in the README that the goal is to run things in a standard environment, but then you're using a near bleeding edge PHP version that people are unlikely to be using?
I thought I'd just quickly spin up a container and take a look out of interest, but now it looks like I'll have to go dig into building my own PHP packages, or compiling my own version from scratch to even begin to look at things?
That's why I built a performance challenge for the PHP community
The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).
I hope people will have fun with it :)
A Wordpress instance will happily take over 20 seconds to fully load if you disable cache.
If you're talking about a WordPress instance with arbitrary plugins running an arbitrary theme, then sure — but that's an observation about those plugins and themes, not core.
As someone who has to work with WordPress, I have all kinds of issues with it, but "20 seconds to load core with caching disabled" isn't one of them.
Naked Wordpress is plenty fast, but as soon as you start adding sketchy plugins and Themes, things can spiral out of control.
It seems like something like vercel/cloudflare could host the content-side published as a worker for mostly-static content from a larger application and that would be more beneficial and run better with less risk, for that matter. Having the app editing and auth served from the same location is just begging for the issues WP and plugins have seen.
There are tens of thousands of plugins and themes to make a Wordpress website do whatever you want and look however you want, either for free or a very low fee. You have to replace that entire ecosystem for the same price to replace Wordpress.
No matter how many times people get hacked, the perceived value of getting something for nothing outweighs the eventual cost.
The thing about WordPress is you can put it on a box and lock it down so hard you just treat it as an untrusted process on your server.
In other words, every property can be modified through global event callbacks. Some events are called very early in the whole pipeline that let plugins just render whatever they want (e.g. render custom XML sitemaps).
Citation needed? You only need cache if a render is expensive to produce.
What takes 5 days to run
That's a huge improvement! How much was low hanging fruit unrelated to the PHP interpreter itself, out of curiosity? (E.g. parallelism, faster SQL queries etc)
A couple of things I did:
- Cursor based pagination - Combining insert statements - Using database transactions to prevent fsync calls - Moving calculations from the database to PHP - Avoiding serialization where possible
Making your application fast is less about tuning your runtime and more about carefully selecting what you do at runtime.
Runtime choice does still matter, an environment where you can reasonably separate sending database queries and receiving the result (async communication) or otherwise lets you pipeline requests will tend to have higher throughput, if used appropriately, batching queries can narrow the gap though. Languages with easy parallelism can make individual requests faster at least while you have available resources. Etc.
A lot of popular PHP programs and frameworks start by spending lots of time assembling a beautiful sculpture of objects that will be thrown away at the end of the request. Almost everything is going to be thrown away at the end of the request; making your garbage beautiful doesn't usually help performance.
Well on the subject of PHP, I think I've got a nice story.
The more recent one is about Wordpress. One day, I had this conversation:
Boss: "will the blog stay up?"
toast0: "yeah, nobody goes to the blog, it's no big deal"
Boss: "they will"
toast0: "oh, ummmm we can serve a static index.html and that should work"
Later that day, he posted https://blog.whatsapp.com/facebook I took a snapshot to serve as index.html and the blog stayed up. A few months later, I had a good reason to tear out WordPress (which I had been wanting to do for a long time), so I spent a week and made FakePress which only did exactly what we needed and could serve our very exciting blog posts in something like 10-20 ms per page view instead of whatever WordPress took (which was especially not very fast if you hit a www server that wasn't in the same colo as our database servers). That worked pretty well, until the blog was rewritten to run on the FB stack --- page weight doubled, but since it was served by the FB CDN, load time stayed about the same. The process to create and translate blog entries was completely different, and the RSS was non-compliant: I didn't want to include a time with the date, and there is/was no available timeless date field in any of the RSS specs, so I just left the time out ... but it was sooo much nicer to run.
Sadly, I haven't been doing any large scale optimization stuff lately. My work stuff doesn't scale much at the moment, and personal small scale fun things include polishing up my crazierl [1] demo (will update the published demo in the next few days or email me for the release candidate url), added IPv6 to my Path MTU Discovery Test [2] since I have somewhere to run IPv6 at MTU 1500, and I wrote memdisk_uefi [3], which is like Syslinux's MEMDISK but in UEFI. My goal with memdisk_uefi is to get FreeBSD's installer images to be usable with PXE in UEFI ... as of FreeBSD 15.0, in BIOS mode you can use PXE and MEMDISK to boot an installer image; but UEFI is elusive --- I got some feedback from FreeBSD suggesting a different approach than what I have, but I haven't had time to work on that; hopefully soonish. Oh and my Vanagon doesn't want to run anymore ... but it's cold out and I don't seem to want to follow the steps in the fuel system diagnosis, so that's not progressing much... I did get a back seat in good shape though so now it can carry 5 people nowhere instead of only two (caveat: I don't have seat belts for the rear passengers, which would be unsafe if the van was running)
Depending on the SQL engine, there are many PHP Cursor optimizations that save moving around large chunks of data.
Clean cached PHP can be fast for REST transactional data parsing, but it is also often used as a bodge language by amateurs. PHP is not slow by default or meant to run persistently (low memory use is nice), but it still gets a lot of justified criticism.
Erlang and Elixir are much better for clients/host budgets, but less intuitive than PHP =3
Other than the obvious point that writing an enormous JSON file is a dubious goal in the first place (really), while PHP can be very fast this is probably faster to implement in shell with sed/grep, or ... almost certainly better ... by loading to sqlite then dumping out from there. Your optimization path then likely becomes index specification and processing, and after the initial load potentially query or instance parallelization.
The page confirms sqlite is available.
If the judges whinge and shell_exec() is unavailable as a path, as a more acceptable path that's whinge-tolerant, use PHP's sqlite feature then dump to JSON.
If I wanted to achieve this for some reason in reality, I'd have the file on a memory-backed blockstore before processing, which would yield further gains.
Frankly, this is not much of a programming problem, it's more a system problem, but it's not being specced as such. This shows, in my view, immaturity of conception of the real problem domain (likely IO bound). Right tool for the job.
Where do I get my prize? ;)
When people say leetcode interviews are pointless I might share a link to this post. If that sort of optimization is possible there is a structures and algorithms problem in the background somewhere.
They tend to be about the implementation details of specific algorithms and data structures. Whereas the important skill in most real-world scenarios would be to understand the trade-offs between different algorithms and data structures so that you pick an appropriate off-the-shelf implementation to use.
Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.
Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.
If Rust scares you use Go. If Go scares you use Rust.
You don't even have to ask. They will tell you and usually add nothing to the conversation while doing so.
Quite off-putting.
I'm just glad modern languages stopped throwing and catching exceptions at random levels in their call chain. PHP, JavaScript and Java can (not always) have unreadable error handling paths not to mention hardly augmenting the error with any useful information and you're left relying on the stack trace to try to piece together what happened.
{
"\/blog\/11-million-rows-in-seconds": {
"2025-01-24": 1,
"2026-01-24": 2
},
"\/blog\/php-enums": {
"2024-01-24": 1
}
}See page 4, section 9 of the latest ECMA for JSON: https://ecma-international.org/wp-content/uploads/ECMA-404_2...
JSON_PRETTY_PRINT is irrelevant. Escaping slashes is the default behavior of json_encode(). To switch it off, use JSON_UNESCAPED_SLASHES.
So apparently that is what they consider "pretty JSON". I really don't want to see what they would consider "ugly JSON".
(I think the term they may have been looking for is "pretty-printed JSON" which implies something about the formatting rather than being a completely subjective term)
https://github.com/kjdev/php-ext-jq
And replicate this command:
jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv
And it will be faster than anything you can do in native php
Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it
Except that the generator script generates dates relative to time() ?
https://dev.to/realflowcontrol/processing-one-billion-rows-i...
> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"
Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.
https://github.com/tempestphp/100-million-row-challenge/pull...
...
> Your parser should store the following output in $outputPath as a JSON file:
{
"\/blog\/11-million-rows-in-seconds": {
"2025-01-24": 1,
"2026-01-24": 2
},
"\/blog\/php-enums": {
"2024-01-24": 1
}
}
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.