Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Wednesday March 18 2015, @06:51PM   Printer-friendly
from the data-is-power dept.

Large datasets and predictive analytics software are a fertile field for innovation, but while excellent open source tools like Sci-Py, R, etc are freely available, the datasets are not. A Computerworld article notes that the scarcity of large publicly available data collections has led to a database released for a competition by Netflix half a decade ago now being constantly used in computer science research.

Australia's government does provide an easy way to find, access and reuse some public datasets, but most public and private databases are silo-ed away from experimenters. The Open Data Handbook offers some guidelines for defining openness in data, but offers little in ways to drive organisations to make their datasets available.

So do we need a GPL for data, and if so, what would it look like?

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Interesting) by TLA on Wednesday March 18 2015, @07:24PM

    by TLA (5128) on Wednesday March 18 2015, @07:24PM (#159542) Journal

    My idea would be something like:

    1. a open compliant dataset is one which is human readable in its raw form, where practical, for instance comma-separated values. Examples here would be the Project Gutenberg index or the VAST small molecule database.

      That was easy.

    And then we can talk about availability:

    1. An open dataset is one which is accessible without any convoluted logging in process, ie one which is accessible anonymously.

    That was easy, too.

    Now we can talk about what you're allowed to do with the dataset, and here we can apply different licences that already exist:

    1. Derivative works (including condensed datasets) must include the original dataset or simple instructions on how to access it (similar to most Creative Commons/GPL licences).

    How simple is this?

    Oh, and:

    2. Any datasets that do not comply with 1, 1, *and* 1, can *not* be referred to as open datasets.

    --
    Excuse me, I think I need to reboot my horse. - NCommander
    • (Score: 2) by kaszz on Wednesday March 18 2015, @07:39PM

      by kaszz (4211) on Wednesday March 18 2015, @07:39PM (#159555) Journal

      Human readable in its raw form may be hard for really large datasets. But one can document the format properly. And including the original dataset may be quite unpractical.

      Some governments also publish datasets freely.

      • (Score: 2, Disagree) by TLA on Wednesday March 18 2015, @07:46PM

        by TLA (5128) on Wednesday March 18 2015, @07:46PM (#159562) Journal

        that's why I qualified it with "...where practical". :)

        --
        Excuse me, I think I need to reboot my horse. - NCommander
    • (Score: 1, Insightful) by Anonymous Coward on Wednesday March 18 2015, @09:36PM

      by Anonymous Coward on Wednesday March 18 2015, @09:36PM (#159609)

      1. a open compliant dataset is one which is human readable in its raw form, where practical, for instance comma-separated values. Examples here would be the Project Gutenberg index or the VAST small molecule database.

          That was easy.

      That's the current fashion, which could change. It's like saying that all character text APIs should use UTF-8. Well, that's certainly a popular choice, but there are special circumstances (people in the Far East might not want to pay the 30-50 percent size penalty if most of their data consists of Han characters) and technology advances.

    • (Score: 0) by Anonymous Coward on Thursday March 19 2015, @09:52AM

      by Anonymous Coward on Thursday March 19 2015, @09:52AM (#159836)

      Requirements 1, 1 and 1? I think you want to refine your numbering a bit ;-)

      Anyway, I'd put different requirements:

      1. The data must be available in a standardized, open and well-documented form which can be read be widely available open-source software, and is in a form where it can be processed by computer software.

      2. The provider of the data set gives everyone a worldwide, irrevocable, royalty-free license to use the data in any way, and to generate derived data from it.

      3. The provider of the data set gives everyone a worldwide, irrevocable, royalty-free license to further distribute the original data set under the same conditions he received it under.

      4. The provider of the data set gives everyone a worldwide, irrevocable, royalty-free license to distribute data sets derived from that data under the same conditions as the original data, as long as he clearly states that the data was derived, and provides any information needed to reproduce that derivation (like additional data, used algorithms, etc.) under the same conditions.

  • (Score: 5, Informative) by Phoenix666 on Wednesday March 18 2015, @07:30PM

    by Phoenix666 (552) on Wednesday March 18 2015, @07:30PM (#159549) Journal

    Open data is essential. Private company data, sure, there are reasons to lock that away. But data generated by government, that we have already paid them to collect and publish via our tax dollars, absolutely should be available to the public in electronic form, free of charge. It's ridiculous when they try to charge you fees for the stuff. I've been participating with Code for America for the past several years and the developer community has been able to do interesting things with the public data sets they have been able to get their hands on, like heat maps for traffic accidents in NYC that have helped inform public policy (for example, thanks to the better data visualization the Mayor's office recently announced $250M to implement traffic safety measures on the most dangerous roads). Those are actual uses of Open Data to help save actual lives.

    A lot more can be done, too, just in the realm of publicly-funded research. Every research outfit that takes government funding from, say, the NSF or NIH should be required to make all their stuff available to the public, free of charge. I realize some of it is, and some of it isn't, but it should be a universal standard, universally applied. I would even say that we should have an amendment to the Constitution to guarantee it, but, heh, who in government pays any attention to that hackneyed old thing anymore?

    --
    Washington DC delenda est.
    • (Score: 2) by wantkitteh on Friday March 20 2015, @04:36AM

      by wantkitteh (3362) on Friday March 20 2015, @04:36AM (#160268) Homepage Journal

      When you say "available to the public", do you mean that everyone should be able to read everyone else's government data, or everyone should be able to read their own government data?

      • (Score: 2) by Phoenix666 on Friday March 20 2015, @12:44PM

        by Phoenix666 (552) on Friday March 20 2015, @12:44PM (#160358) Journal

        I'm talking about data sets you can get, but which the government currently charges you for. For example, if I want to get the data set of who voted in the last election from the New York Board of Elections, a common thing to do if you're a candidate or organization that wants to find out which voters vote the most (they're called "1's," or people who vote in every election, even judicial ones and special elections) so you can focus on winning them over, you have to spend $50-100 for a *CD* with that info. But that is data we paid them to gather already in the form of their salaries and benefits. So why should we have to pay them for it again? It should be freely available online without even so much as having to register on their website.

        Because consider that it's not only the voter data you need, but maps with neighborhood shape files and demographics (income, household size, etc) gathered by the US Census. That stuff, I can tell you from experience, can very quickly run into the thousands of dollars. So you see, it doesn't take very long at all before the upfront cost for "public" data becomes prohibitive for the independent developer who wants to build apps or new ways to see, understand, and act on that information.

        It's incredibly irritating. One of my favorite dodges is when the government agency in question claims the data itself is free, but you can only get it packaged from their favorite company X for $500. In essence, it's a corrupt deal between the head of the agency and the CEO of the company, who were roomies at Yale or something.

        So when asked the question, "Does Open Data Need to be the Next Open Source?" I say, "Hell yes!"

        --
        Washington DC delenda est.
        • (Score: 2) by wantkitteh on Friday March 20 2015, @02:51PM

          by wantkitteh (3362) on Friday March 20 2015, @02:51PM (#160412) Homepage Journal

          I don't think that's quite what the original article is referring to - you want data that's already available to be free as in beer, rather than free as in speech which is more the issue at hand. What I'm trying to ask is whether you include other people's personal, private data in the collection of data that your government has assembled and should be available for free. The argument "I contributed towards my neighbourhood's government subsidised drug rehab and mental health facilities, I should be entitled to all their data!" is indicative of a pretty sick attitude, so I hope that's not what you mean.

          • (Score: 2) by Phoenix666 on Friday March 20 2015, @06:22PM

            by Phoenix666 (552) on Friday March 20 2015, @06:22PM (#160515) Journal

            No, that's not what I mean. Privacy is big with me. I think there would be value in an anonymized db of everyone's DNA, for example, because it would do so much for archaeology, epidemiology, etc., but I have no trust in the government whatsoever so scratch that idea.

            --
            Washington DC delenda est.
            • (Score: 2) by wantkitteh on Friday March 20 2015, @08:43PM

              by wantkitteh (3362) on Friday March 20 2015, @08:43PM (#160567) Homepage Journal

              Ok, misunderstanding cleared up ;) See comments/links elsewhere in this comment section for details on how hard anonymising data really is.

  • (Score: 4, Insightful) by wantkitteh on Wednesday March 18 2015, @07:41PM

    by wantkitteh (3362) on Wednesday March 18 2015, @07:41PM (#159559) Homepage Journal

    Ah, hells nah! Anonymising data sets like this is a problem [oreilly.com] - anyone who's ever done a logic puzzle [logic-puzzles.org] will know exactly how much it's possible to work out given a comparatively small data set and some well-defined parameters to work in. The more data sets available, the more they can be integrated into each other and the less anonymous all that data will inherently become, no matter how "anonymised" it was when it was downloaded.

    • (Score: 0) by Anonymous Coward on Wednesday March 18 2015, @08:06PM

      by Anonymous Coward on Wednesday March 18 2015, @08:06PM (#159571)

      Good point. Many data sets at various institutions could be useful if publicized, but often contain various forms of personal information that needs to be anonymized, which may be trickier than it seems. In either case, both this cleaning and the exporting and packaging of the data in a suitable format for publishing takes time and effort, which translates to labor costs.

      Why spend a significant fraction of your operating budget to give outsiders free stuff when you could also spend that towards your mission statement or a better coffee machine. Other than some good pr, the incentives are just not there.

      • (Score: 2, Informative) by Anonymous Coward on Wednesday March 18 2015, @08:18PM

        by Anonymous Coward on Wednesday March 18 2015, @08:18PM (#159579)

        > Other than some good pr, the incentives are just not there.

        It should be made part of the grant requirements. Just like it is starting to be for open access publication. [wikipedia.org]

        However, I too am super wary of any 'anonymization' of personal data. Anything that is not personal, like climate data, or number of animals per square meter, etc, that's no problem. Personal data though is so easy to fuck up when anonymize because much of the problem comes from cross-referencing it with other databases so what seems anonymous to the person who only looks at that one data set turns out not to be anonymous once in the full context of the world at large... [techdirt.com] Anything close to that may need a different policy.

        • (Score: 2) by kaszz on Wednesday March 18 2015, @11:08PM

          by kaszz (4211) on Wednesday March 18 2015, @11:08PM (#159643) Journal

          Next question is if the person doing the anonymization is competent enough. How does one measure? And can the workplace hire and retain such people?

    • (Score: 3, Insightful) by Nerdfest on Wednesday March 18 2015, @08:18PM

      by Nerdfest (80) on Wednesday March 18 2015, @08:18PM (#159578)

      That applies whether the data is 'open' or not. The only difference is that in one case people other than you are making money from it. The data is still out there, and is still out of your control.

      • (Score: 4, Informative) by wantkitteh on Wednesday March 18 2015, @10:14PM

        by wantkitteh (3362) on Wednesday March 18 2015, @10:14PM (#159624) Homepage Journal

        You don't appear to understand the situation at all. This isn't about making money from data, it's about extrapolating data that no-one asked my permission to collect or hold by using datasets collected in different contexts by different entities. The more open data about me that exists, the more people can make inferences about me that I never intended to be known by any 3rd party at all.

        • (Score: 2, Insightful) by Nerdfest on Thursday March 19 2015, @12:50AM

          by Nerdfest (80) on Thursday March 19 2015, @12:50AM (#159666)

          Your intentions are not particularly relevant. Once the data is out there, it is available to people wishing to pay for it, or to those that exploit the security of those holding it. Once someone new has it, the problem just gets bigger. No it's not *about* money, money is just the difference between 'open' and 'not open'. In one case anyone can get the data, and in others, anyone with money, connections, or partnerships can get the data. Yes, it's easier and cheaper when the data is open, but it's really no different in the end. The only things that stops all data from being open is security, privacy policies (where they are respected and actually respect privacy) and laws. Only *one* of those things needs to fail for control of the data to be lost.

          • (Score: 2) by wantkitteh on Thursday March 19 2015, @04:11AM

            by wantkitteh (3362) on Thursday March 19 2015, @04:11AM (#159748) Homepage Journal

            So your argument is that, because all someone has to get hold of private data is actively commit a crime, all data should be open? That's got to be the most retarded argument for anything being open that I've ever heard in my life. How about you dox yourself to prove your point?

            • (Score: 2) by Nerdfest on Thursday March 19 2015, @09:59AM

              by Nerdfest (80) on Thursday March 19 2015, @09:59AM (#159840)

              What I'm saying is that disparate data sources can be combined to reduce anonymity whether they're open or not. Just because the data's not open and you don't know who has what data doesn't mean it isn't happening. It just means that you don't know who's doing it.

              • (Score: 1, Insightful) by Anonymous Coward on Thursday March 19 2015, @12:19PM

                by Anonymous Coward on Thursday March 19 2015, @12:19PM (#159910)

                And open access data means you still don't know who's doing it and now the barrier for those unknown people to do it is significantly reduced.

              • (Score: 2) by wantkitteh on Thursday March 19 2015, @02:50PM

                by wantkitteh (3362) on Thursday March 19 2015, @02:50PM (#159970) Homepage Journal

                "...disparate data sources can be combined to reduce anonymity whether they're open or not."

                Okay - put your data where you mouth is - d0x yourself. Release every scrap of data you can about yourself under a creative commons license. It can all be used whether it's open or not, right? It won't make any difference to you if you just make it more convenient for everyone to access, right? Do it or stfufe.

  • (Score: 1, Informative) by Anonymous Coward on Wednesday March 18 2015, @08:25PM

    by Anonymous Coward on Wednesday March 18 2015, @08:25PM (#159582)

    This is the license you are looking for

    https://creativecommons.org/licenses/by-sa/2.0/ [creativecommons.org]

    • (Score: 2) by wantkitteh on Wednesday March 18 2015, @10:16PM

      by wantkitteh (3362) on Wednesday March 18 2015, @10:16PM (#159627) Homepage Journal

      ...and in the UK, here's the act you'd be prosecuted under for freely publishing personal data under any license at all:

      1998 Data Protection Act [www.gov.uk]

      • (Score: 0) by Anonymous Coward on Wednesday March 18 2015, @10:42PM

        by Anonymous Coward on Wednesday March 18 2015, @10:42PM (#159631)

        Good thing I dont live there....

    • (Score: 3, Insightful) by DeathMonkey on Thursday March 19 2015, @05:43PM

      by DeathMonkey (1380) on Thursday March 19 2015, @05:43PM (#160054) Journal

      I was under the impression that you can't copyright a data set in the first place. So wouldn't that make any license invalid/unenforceable?