One of my first jobs as an analyst was to clean up messy spreadsheets made by people, even very senior employees, who never bothered to learn excel properly.
Let alone column sorting and joining of data.
I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before
* It's not standardised
* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.
* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.
* Errors are not even detectable in many cases.
* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.
* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.
* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.
* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.
This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?
This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.
There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.
The people who get caught red handed like this are lazy, incompetent and stupid. Makes you wonder what about the ones not getting caught.
Being a cheat significantly correlates with laziness, incompetence and stupidity so there are probably very few cheats smart and diligent enough to not get caught.
Handwaving correlations between cheating/criminality and most personality/intelligence aspects is an error, not least because there is a selection bias problem (eg. who gets caught).
There is no evidence for this.
Expensive tools, expensive test setups, live, gene-altered animals, etc.
In fields such as deep learning or other more digital fields (my field is using a lot of freely available satellite data) replication is often cheaper and actual application of research outcomes is a lot more common.
A LOT of labour goes into making it work. Most scientists I know and work with are very diligent people who care a lot about the outputs being as correct as possible, but wow, their workflows aren't great.
My job is to try and address this in whatever ways are practical for the data and the people doing the science, and it's kind of like Saas in that you think it should be easy enough to spot problems, solve them, and carry on/become a billionaire, but... The world is much more complicated than that, and it's easier to fail in this endeavour than it is to break even.
The classic "DropBox is just rsync" or "I could build Airbnb in a weekend" sentiments have their commonalities and counterparts in science, and the reality is similarly defeating and punishing on both sides. Making science go faster while maintaining correctness is exceedingly difficult. There are so many moving parts. So many disparate participants who are wildly technical and capable, or brilliant at studying bacteria in starfish yet terrified to run a command in a terminal. Your user base has virtually nothing in common in terms of ability and willingness to do anything other than get their own work done. It's brutal.
So, I sympathize with the authors of these papers and I hope readers don't assume they're bad at what they do or that it's done in bad faith. It's genuinely difficult.
An anecdote: I created a tool for validating biodiversity data against a specification called Darwin Core. Initially our published data was failing to validate so much that I thought I'd made the tool wrong. Rather, the spec is so complex and vast that the people I work with were unable to manage to get valid data into the public repositories. And yet! They were able to publish, because the public repositories' own validation is... Invalid. That's the state of things.
Granted, the data is still correct enough to be useful, and the errors don't cause the results to indicate anything that they shouldn't. It's more like minor metadata issues, failures to maintain referential integrity across different datasets, etc. But it's a very real, very difficult problem.
Science isn't easy at all. So many hoops to jump through, so much rigor, so much data. Mistakes are inevitable.
In a lot of cases (where data is being collected by humans with a tape measure, say) there is room for error. But one of the things that's getting traction in some fields is open-source publication of both raw datasets and the evaluation/processing methods (in a Jupyter Notebook, say) in a way that lets other people run their analysis on your data, your analysis on their data, or at least re-run your start-to-finish pipeline and look for errors!
As is often the case, the holdups are mostly political: methods papers are less prestigious than the "real science" ones, and it takes journals / funders to mandate these things and provide funding/hosting for datasets for 10+ years, etc - researchers are a time-poor bunch and often won't do things unless there's an incentive to!
There are incentives for these spreadsheets having the values that they do, and also there is no conceivable way that the values are correct, and on top of that, the most likely ways to get these values are to copy and paste large amounts of numbers, and even perturb some of them manually.
If you see this in accounting,(where there are also mistakes), it’s definitely fraud. (Awww man - we accidentally inflated our revenue and profit to meet expectations by accidentally duplicating numerous revenue lines and no one internally caught it! Dang interns!) If you see it in science, you ask the authors about it and they shrug and mumble a semi plausible explanation if you’re lucky? I can totally imagine a lab tech or grad student making a large copy paste mistake. I can’t imagine them making a series of them in such a way that it bolsters or proves the author’s claim AND goes completely undetected by everyone involved.
The small minority of cases that do fit this pattern get selected to be on the front page of HN. So we aren't drawing from a random sample of mistakes. All the selection effects work against the more common categories of mistakes showing up on the HN front page, such as author disinterest, reader disinterest, to rejection by the journal, to a lack of publicity if the null result is published. The more reliable tell that it's a fraud is that the authors didn't respond when the errors were discovered.
Sounds like a startup idea.
One example of these might be systems like S3 and distributed computing in AWS. Like, huge ideas that take massive initiatives to implement, but make science meaningfully easier. I can't think of many other modern technologies we use that the team doesn't mostly resent (like Slack or Google Drive). They're largely interested in just doing the science, the rest eats into funding (which is increasingly sparse these days).
The solutions these scientists need are bespoke and share little in common. They also have fixed grant funding.
In 2009 I made $15/hr working with some PhDs and grad students in a couple different labs to automate their workflows - I was the highest paid person in the room most of the time.
In one case, we used mdftools to literally use the original excel spreadsheet as our logic engine.
I can easily imagine after spending years or decades devoted to discovering a scientific breakthrough that some could be tempted to slightly alter the data. I believe there was some scandal about this a few years back with climate data. Fixing this is however something that AI would do fairly well.
Identifying it is something AI could do well, though. It’s very good at finding patterns - that’s kind of essential to how it works.
But AI can also hallucinate data. I am not sure this is an area for an automatic "AI is better than humans". Honesty is very important in science. There were even fake articles generated:
https://www.thelancet.com/journals/lancet/article/PIIS0140-6...
And some other article I forgot, about arsene or some other ion being used in/for DNA or so. Turned out to be totally fabricated. Right now I don't remember the name of the article; was from some years ago.
Recent example I found (semi-accidentally, I was only looking for microscopy related courses):
https://ufind.univie.ac.at/de/course.html?lv=301053&semester...
At the end of the description it has:
"Übersetzt mit DeepL.com (kostenlose Version)"
This means, in english, "translated via DeepL.com (free version)" aka the not-paid-for version. What I found baffling is that even for a single paragraph, some are too lazy to write stuff on their own - or, at the least, remove that disclaimer. Other people also pointed out that they saw this in autogenerated brochures/booklets, in the USA for instance; think I saw this about 3 months ago but I forgot which booklet it was. But the whole booklet was AI-autogenerated. To me this is all spam. I can not want to be bothered to read AI "content" when it is really just glorified slop-spam.