But if institutions are expected to release clear data or nothing, almost always it is the later.
What is important, is to offer as much methodology and caveats as possible, even if in an informal way. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.
(Source: 10 years ago I worked a lot with official data. All data requires cleaning.)
To be clear, I'm not saying that we should accept messy data. Just, reality is messy and it's naive to think we can catch and remove all of reality's messiness -- which includes the bureaucratic slop that led to the data being published in the first place.
On the other hand, I agree that bad (but usually fixable) data is better than no data.
I prefer to get data with swapped lat/lng (a trivial fix), or prices said in dollars but being in cents, to no data.
Those seem reasonable asks.
Edit to add: the tragedy of the school in Minab is an example of how bad things can go--and it just hints at how much worse bad data can bem
Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?
If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?
Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.
This isn't possible to answer generally, but I'm sure you know that.
Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant.
Again, I'd rather have the data and publish it with known gotchas.
Here's an example: https://mchap.io/using-foia-data-and-unix-to-halve-major-sou...
Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time.
Since this is not the data you collected, I understand you have to work with what you have, by the way very interesting post, and nice job!
One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.
In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.
As you can imagine, this is expensive.
Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.
Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.
But sometimes the "provenance" of the data is important. I want to know whether I'm getting data straight from some source (even with errors) rather than having some intermediary make fixes that I don't know about.
For example, in the case where maybe they flipped the latitude and longitude, I don't want them to just automatically "fix" the data (especially not without disclosing that).
What they need to do is verify the outliers with the original gas station and fix the data from the source. But that's much more expensive.
The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it, than a random person looking into that dataset. So I do not think it is unreasonable to have people in organisations take a second look into the datasets they publish.
The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it
You'd think so, but just like most other systems, systems are often inherited or not thought out, so the understanding is external and we can't assume expertise within.This can skew the dataset and lead to misinterpreted results, if which rows are wrong is not completely random.
Eg if all data from a specific location (or year etc) comes wrong, then this kind of cleaning would just completely exclude this location, which depending on the context may or may not be a problem. Or if values come wrong above a specific threshold. Or any other way that the errors are not in some way randomly distributed.
Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed).
Absolutely. If you have obviously wrong data your choices are generally:
1. Leave the bad data in.
2. Leave the bad data in and flag it as suspect.
3. Omit the dad data.
4. Correct the bad data.
Which is the best choice depends on context and requires judgement. But I find it hard to imagine any situation where option 1 is the right choice.
Obviously the best solution is to do basic validation as the data is entered, so that people can't add a location in the Indian ocean to a UK dataset. It seems rather negligent that they didn't do this.
Messy data is a signal. You're wrong to omit signal.
A better solution is to add a field to indicate that "the row looks funny to the person who published the data". Which, I guess is useful to someone?
But deleting data or changing data is effectively corrupting source data, and now I can't trust it.
People who don't heed this advice get to discover it for themselves (I sure did)
IF you can't make the data convincing, you'll lose all trust, and nobody will do business with you.
I have also learned that rarely does anyone care if it’s any good, or means anything. This is generally true, but it’s especially true if you are going with the prevailing winds of whatever management fads are going on.
Like, right now, you can definitely get away with inflating the efficacy of “AI” any way you can, in almost any company. Nobody with any authority will call you on it.
Look at what management’s talking about and any pro-that numbers you come up with can be total gibberish, nobody minds. “Oh man, collecting good numbers for this and getting a baseline etc etc is practically impossible” ok so don’t and just use bad numbers that align with what management wants to do anyway. You’ll do great.
If the company were an airplane essentially upper management were flying it by instrument. It would've been a scandal if the metrics had serious issues.
Some of the metrics less directly tied to business stuff were a bit more 'creative' - as in I could justify why I did them that way, but still not 100% solid.
Stuff like optimizing data pipelines, where data scientist experiments which tended to take 1hr, now only took 10 mins.
I could say that data people were 6x as productive, but it's just as well possible they were just more careless with what they ran, but whatever, a white lie.
However saying that stuff takes 1/6th the time, when in fact it doesn't, is an absolute no go. Neither is not knowing why is there a run that took 500 hours or 5 seconds, both of which should be impossible.
Doing that stuff destroys the confidence in the rest of the data.
But fake data or garbage data without the method, is better left unpublished !
I found the errors in a few minutes with a $198 tool.
Clean vs not clean data is the wrong fight.
> Authors should have their work proof read
Agreed.
Opening passage:
> A quick plot of the latitude and longitude shows some clear outliners
"outliners"
Ouch!
Now fixed.
But, hey, we’re all wise after the event. To their credit though, they do seem to be actively reacting to feedback. I also contacted them about the bad data issue, and they are now adding user warnings about bad price values at the point of data entry (according to https://www.developer.fuel-finder.service.gov.uk/release-not...).
"Stop Publishing Garbage Data, It’s Embarrassing"
To the rather lamer:
"Twice this week, I have come across embarassingly bad data"
?
I have written my own Home Assistant custom component for the UK fuel finder data, and yes, the data really is that bad.
Easy type to make, but seriously, does no one even take a cursory look at the charts when publishing articles like this? The chart looks _obviously_ wrong, so imagine how many are only slightly wrong and are missed.
The fuel prices one could surely be solved with a tiny bit of validation; are the coordinates even within a reasonable range? Fortunately, in the UK, it's really easy to tell which is latitude and which is longitude due to one of them being within a digit or two of zero on either side.
What about people who don't know how their own code works? Despite it working flawlessly? I'm asking because I don't really know.
Yes.
Sure it is expensive to check every number, but at least some of it can be automated and flagged for human review, no? Switching lat/long numbers. For example
And if someone publishes a flawless code but have no idea how it works, its not their code, quite clearly, AMD they should be ashamed if they lie it is.
It's just, like, my opinion, but I like it :)
Yes. Lying is bad, even if some people are trying hard to normalise it.
>What about people who don't know how their own code works? Despite it working flawlessly?
I think that is fine, as long as you aren't making untrue claims.