As the "proper solution" here is of course not using PDFs that are hard-to-parse, but force elections to have machine parseable outputs. And LLMs can "fix in place" stupid solutions.
That's not a hate on the author though. I needed to do some PDF parsing for bank statements before; but also; the proper long-term solution is force banks (by law or by public interest) to have parseable statements, not parse it!
Like putting LLMs to understand bad codebase will not fix the bad codebase, but will build on top of it.
oh well c'est la vie
Tangentially- I appreciate what OpenElections does- however, I wish there was a similar organization that did not limit themselves to officially certified results. There are already other organizations who collect precinct results post-2016, and using only official results basically limits you to 2008 and afterwards, but historical election results are the real intrigue. Not to mention that I have noticed many blatant errors in election results that have supposedly been “certified” by a state/county government. The precinct results Pennsylvania publishes, for example, are riddled with issues.
We went with official precinct results for two main reasons: there are differences between election night and final results (some of them non-trivial) and to make the work more manageable. Agree that historical results are a real problem, and as a PA native I know only too well the errors that the state data contains, which is why we go county-by-county there.
The way that OpenElections handles this, with 'sources' and 'data' directories I think is a good way to bridge the gap.
We do this pretty decently in India - the results of pretty much every election run by the Election commission is updated on https://results.eci.gov.in/# and it's the same for the whole country.
In other news, any bank that does not produce a standard CSV file for their bank statements should be fined $1m per day until they do. It's ridiculous that this isn't the first option when you go to download them.
I had Gemini convert a bunch of charity forms yesterday, and the deviation was significant and problematic. Rephrasing questions, inventing new questions, changing the emphasis; it might be performing a lot better for numerical data sets, but it's rare to have one without a meaningful textual component.
I suspect if I'd done some corrective transformations before LLM scanning the success rate would have been higher, but the cost threshold of the project didn't warrant it.
I just quickly took a scanned document and the transcription looks good.
https://19january2021snapshot.epa.gov/sites/static/files/201...
https://g.co/gemini/share/d315b4047224
It even got the faded partial date stamp.
Thats one of the best scanned documents I've seen in years. Most scanning now is via phone.
This application seems very good - but still a bit amazing that lawmakers haven't just required that all data be uploaded via csv! Even if every csv was slightly different format, it would be way easier for everyone (LLM or not).
Not, you know, for any nefarious purpose…but because what we’ve used forever was good enough for grandpappy, so it’s obviously good enough for us.
/cough
And paid polls that the author claims will replace prediction markets:
Since it's printed it is clearly already in a database somewhere. Why can't that just be made public too.
Seems bizarre to OCR printed documents (although I am aware of many companies doing this to parse invoices, etc.)
One key problem is that the US has tens of thousands of local governments, and each of them get to solve problems in their own way.
Digital literacy of the kind that understands why releasing a CSV file is more valuable than a PDF is rare enough that most of them won't have someone with that level of thinking in a decision making role.
That is an unfair take on it. Come out to the midwest and talk to some of the clerks in the small townships and counties out here. They do know the value of improved data and tech. And they know that investing in better tech can result in a little less money in the bank, which results in less gas to plow the roads, less money to pay someone to mow the ditches, which means on more car wrecked by hitting a deer. So the question is often not about CSV vs. PDF. It is about overall budget to do all the things that matter to the people of their town. Tech sometime just doesn't make the cut.
Besides, elections tend to have their own tech provided by the county or state, so there is standardization and additional help on such critical processes.
People running the smallest of government entities in this country tend to have pretty good heads on their shoulders. They get voted out pdq when they don't.
This isn't meant as shade! I have full respect for people working in those roles. Knowing the difference between a CSV file and a PDF file - and understanding why there are people out there who curse the existence of PDFs and celebrate CSVs - is pretty arcane knowledge.
Also note that I blamed people in "a decision making role" - changing procedures requires buy-in from management. People in management roles are less likely to be thinking about CSVs v.s. PDFs than the people actually executing on the work.
As Derek pointed out in https://news.ycombinator.com/item?id=44320001#44322987 this may often be a vendor limitation - in which case there is a cost factor to consider, and the blame can also be shared between the vendor and the person who made the purchasing decision without understanding the difference between PDF and CSV export.
There's fifty states and almost 4000 counties in the US, not to mention territories. Even if it was only fifty different standards, that's still an overwhelming amount of work and exactly the problem you're replying about.
Don't have to bother with gerrymandering, or slick legal ways to arrest people for voting with the wrong documents. Or just good old fashioned intimidation, like making the polling place the police station or the ICE detention facility.
It's just a lot smoother process when you can simply write some software to manipulate the count.
Who's gonna check?
(No, seriously, Who's gonna check? Because you also need to layoff everyone in that department once you're in power.)
The problem is that once the counts are done and have been reported a lot of places then print those results out on paper and then scan those papers into a PDF for anyone who asks for a copy!