Large datasets and predictive analytics software are a fertile field for innovation, but while excellent open source tools like Sci-Py, R, etc are freely available, the datasets are not. A Computerworld article notes that the scarcity of large publicly available data collections has led to a database released for a competition by Netflix half a decade ago now being constantly used in computer science research.
Australia's government does provide an easy way to find, access and reuse some public datasets, but most public and private databases are silo-ed away from experimenters. The Open Data Handbook offers some guidelines for defining openness in data, but offers little in ways to drive organisations to make their datasets available.
So do we need a GPL for data, and if so, what would it look like?
(Score: 4, Insightful) by wantkitteh on Wednesday March 18 2015, @07:41PM
Ah, hells nah! Anonymising data sets like this is a problem [oreilly.com] - anyone who's ever done a logic puzzle [logic-puzzles.org] will know exactly how much it's possible to work out given a comparatively small data set and some well-defined parameters to work in. The more data sets available, the more they can be integrated into each other and the less anonymous all that data will inherently become, no matter how "anonymised" it was when it was downloaded.
(Score: 0) by Anonymous Coward on Wednesday March 18 2015, @08:06PM
Good point. Many data sets at various institutions could be useful if publicized, but often contain various forms of personal information that needs to be anonymized, which may be trickier than it seems. In either case, both this cleaning and the exporting and packaging of the data in a suitable format for publishing takes time and effort, which translates to labor costs.
Why spend a significant fraction of your operating budget to give outsiders free stuff when you could also spend that towards your mission statement or a better coffee machine. Other than some good pr, the incentives are just not there.
(Score: 2, Informative) by Anonymous Coward on Wednesday March 18 2015, @08:18PM
> Other than some good pr, the incentives are just not there.
It should be made part of the grant requirements. Just like it is starting to be for open access publication. [wikipedia.org]
However, I too am super wary of any 'anonymization' of personal data. Anything that is not personal, like climate data, or number of animals per square meter, etc, that's no problem. Personal data though is so easy to fuck up when anonymize because much of the problem comes from cross-referencing it with other databases so what seems anonymous to the person who only looks at that one data set turns out not to be anonymous once in the full context of the world at large... [techdirt.com] Anything close to that may need a different policy.
(Score: 2) by kaszz on Wednesday March 18 2015, @11:08PM
Next question is if the person doing the anonymization is competent enough. How does one measure? And can the workplace hire and retain such people?
(Score: 3, Insightful) by Nerdfest on Wednesday March 18 2015, @08:18PM
That applies whether the data is 'open' or not. The only difference is that in one case people other than you are making money from it. The data is still out there, and is still out of your control.
(Score: 4, Informative) by wantkitteh on Wednesday March 18 2015, @10:14PM
You don't appear to understand the situation at all. This isn't about making money from data, it's about extrapolating data that no-one asked my permission to collect or hold by using datasets collected in different contexts by different entities. The more open data about me that exists, the more people can make inferences about me that I never intended to be known by any 3rd party at all.
(Score: 2, Insightful) by Nerdfest on Thursday March 19 2015, @12:50AM
Your intentions are not particularly relevant. Once the data is out there, it is available to people wishing to pay for it, or to those that exploit the security of those holding it. Once someone new has it, the problem just gets bigger. No it's not *about* money, money is just the difference between 'open' and 'not open'. In one case anyone can get the data, and in others, anyone with money, connections, or partnerships can get the data. Yes, it's easier and cheaper when the data is open, but it's really no different in the end. The only things that stops all data from being open is security, privacy policies (where they are respected and actually respect privacy) and laws. Only *one* of those things needs to fail for control of the data to be lost.
(Score: 2) by wantkitteh on Thursday March 19 2015, @04:11AM
So your argument is that, because all someone has to get hold of private data is actively commit a crime, all data should be open? That's got to be the most retarded argument for anything being open that I've ever heard in my life. How about you dox yourself to prove your point?
(Score: 2) by Nerdfest on Thursday March 19 2015, @09:59AM
What I'm saying is that disparate data sources can be combined to reduce anonymity whether they're open or not. Just because the data's not open and you don't know who has what data doesn't mean it isn't happening. It just means that you don't know who's doing it.
(Score: 1, Insightful) by Anonymous Coward on Thursday March 19 2015, @12:19PM
And open access data means you still don't know who's doing it and now the barrier for those unknown people to do it is significantly reduced.
(Score: 2) by wantkitteh on Thursday March 19 2015, @02:50PM
Okay - put your data where you mouth is - d0x yourself. Release every scrap of data you can about yourself under a creative commons license. It can all be used whether it's open or not, right? It won't make any difference to you if you just make it more convenient for everyone to access, right? Do it or stfufe.