Large datasets and predictive analytics software are a fertile field for innovation, but while excellent open source tools like Sci-Py, R, etc are freely available, the datasets are not. A Computerworld article notes that the scarcity of large publicly available data collections has led to a database released for a competition by Netflix half a decade ago now being constantly used in computer science research.
Australia's government does provide an easy way to find, access and reuse some public datasets, but most public and private databases are silo-ed away from experimenters. The Open Data Handbook offers some guidelines for defining openness in data, but offers little in ways to drive organisations to make their datasets available.
So do we need a GPL for data, and if so, what would it look like?
(Score: 4, Interesting) by TLA on Wednesday March 18 2015, @07:24PM
My idea would be something like:
1. a open compliant dataset is one which is human readable in its raw form, where practical, for instance comma-separated values. Examples here would be the Project Gutenberg index or the VAST small molecule database.
That was easy.
And then we can talk about availability:
1. An open dataset is one which is accessible without any convoluted logging in process, ie one which is accessible anonymously.
That was easy, too.
Now we can talk about what you're allowed to do with the dataset, and here we can apply different licences that already exist:
1. Derivative works (including condensed datasets) must include the original dataset or simple instructions on how to access it (similar to most Creative Commons/GPL licences).
How simple is this?
Oh, and:
2. Any datasets that do not comply with 1, 1, *and* 1, can *not* be referred to as open datasets.
Excuse me, I think I need to reboot my horse. - NCommander
(Score: 2) by kaszz on Wednesday March 18 2015, @07:39PM
Human readable in its raw form may be hard for really large datasets. But one can document the format properly. And including the original dataset may be quite unpractical.
Some governments also publish datasets freely.
(Score: 2, Disagree) by TLA on Wednesday March 18 2015, @07:46PM
that's why I qualified it with "...where practical". :)
Excuse me, I think I need to reboot my horse. - NCommander
(Score: 1, Insightful) by Anonymous Coward on Wednesday March 18 2015, @09:36PM
1. a open compliant dataset is one which is human readable in its raw form, where practical, for instance comma-separated values. Examples here would be the Project Gutenberg index or the VAST small molecule database.
That was easy.
That's the current fashion, which could change. It's like saying that all character text APIs should use UTF-8. Well, that's certainly a popular choice, but there are special circumstances (people in the Far East might not want to pay the 30-50 percent size penalty if most of their data consists of Han characters) and technology advances.
(Score: 0) by Anonymous Coward on Thursday March 19 2015, @09:52AM
Requirements 1, 1 and 1? I think you want to refine your numbering a bit ;-)
Anyway, I'd put different requirements:
1. The data must be available in a standardized, open and well-documented form which can be read be widely available open-source software, and is in a form where it can be processed by computer software.
2. The provider of the data set gives everyone a worldwide, irrevocable, royalty-free license to use the data in any way, and to generate derived data from it.
3. The provider of the data set gives everyone a worldwide, irrevocable, royalty-free license to further distribute the original data set under the same conditions he received it under.
4. The provider of the data set gives everyone a worldwide, irrevocable, royalty-free license to distribute data sets derived from that data under the same conditions as the original data, as long as he clearly states that the data was derived, and provides any information needed to reproduce that derivation (like additional data, used algorithms, etc.) under the same conditions.