My job was to examine blood lead data from our local Hurley Children's Hospital in Flint for spatial patterns, or neighborhood-level clusters of elevated levels, so we could quash the doubts of state officials and confirm our concerns. Unbeknownst to me, this research project would ultimately help blow the lid off the water crisis, vindicating months of activism and outcry by dedicated Flint residents.
As I ran the addresses through a precise parcel-level geocoding process and visually inspected individual blood lead levels, I was immediately struck by the disparity in the spatial pattern. It was obvious Flint children had become far more likely than out-county children to experience elevated blood lead when compared to two years prior.
How had the state so blatantly and callously disregarded such information? To me – a geographer trained extensively in geographic information science, or computer mapping – the answer was obvious upon hearing their unit of analysis: the ZIP code.
ZIP codes – the bane of my existence as a geographer. They confused my childhood friends into believing they lived in an entirely different city. They add cachet to parts of our communities (think 90210) while generating skepticism toward others relegated to less sexy ZIP codes.
A tale to remind the scientists and technologists among us why it's important to do our jobs well.
(Score: 5, Informative) by AthanasiusKircher on Wednesday September 21 2016, @03:08PM
This problem shouldn't be surprising to anyone who has any training in basic stats. ZIP codes are mostly arbitrary divisions. Yes, they're often roughly organized around municipal divisions and such, but that may not always track with the variables you're actually looking at. In this case, you need a divisions that tracks "attached to city water" vs. "not attached to city water." ZIP codes didn't meet that criterion. Superimposing arbitrary divisions onto a pool of data can mask patterns in the data, or it can make patterns appear which aren't really there. Or it can even make trends in data apparently reverse (known as Simpson's paradox [wikipedia.org]).
There's a much broader lesson here than ZIP codes. If you're analyzing data, you need to be certain the way you're grouping it is meaningful to your analysis. Moreover, you should generally check for statistical artifacts by looking at patterns with and without divisions (or with different divisions) to check for robustness in correlations, but also in case your groupings are masking a broader pattern.