Two de-identification methods, k-anonymization and adding a "fuzzy factor," significantly reduced the risk of re-identification of patients in a dataset of 5 million patient records from a large cervical cancer screening program in Norway, according to results published in Cancer Epidemiology, Biomarkers & Prevention, a journal of the American Association for Cancer Research.
"Researchers typically get access to de-identified data, that is, data without any personal identifying information, such as names, addresses, and Social Security numbers. However, this may not be sufficient to protect the privacy of individuals participating in a research study," said Giske Ursin, MD, PhD, director of Cancer Registry of Norway, Institute of Population-based Research.
Patient datasets often have sensitive data, such as information about a person's health and disease diagnosis that an individual may not want to share publicly, and data custodians are responsible for safeguarding such information, Ursin added. "People who have the permission to access such datasets have to abide by the laws and ethical guidelines, but there is always this concern that the data might fall into the wrong hands and be misused," she added. "As a data custodian, that's my worst nightmare."
http://www.aacr.org/Newsroom/Pages/News-Release-Detail.aspx?ItemID=1074
Journal reference:
Giske Ursin, Sagar Sen, Jean-Marie Mottu and Mari Nygård, Protecting Privacy in Large Datasets—First We Assess the Risk; Then We Fuzzy the Data, Cancer Epidemiology, Biomarkers & Prevention, http://dx.doi.org/10.1158/1055-9965.EPI-17-0172
-- submitted from IRC
(Score: 3, Interesting) by Runaway1956 on Wednesday August 02 2017, @12:04AM (1 child)
Why is it necessary to put all that identifying information into the database to start with? Your family doctor can treat you for whatever ails you, taking all of your information. Insurance, address, etc, ad nauseum. All those fields on his forms should just be flagged, so that those data bits never leave his office. If the data is never input into the database, the database can't leak the data.
Of course, it becomes a minor issue to determine what must and must not be included in the data. Age is pertinent to many medical research projects. Ethnic background is important to some others. Medical people often demand information that is probably irrelevant to a lot of research, such as place of birth, number of siblings, and more. Being a twin/trip/octo MIGHT be important to some research, but that bit of data need not be available to the entire world of medical personnel.
Clean up the input, and the output will require a lot less attention for "security".
(Score: 0) by Anonymous Coward on Wednesday August 02 2017, @03:13PM
Ah, just stick it in "The Cloud". Hey, ask IBM, like Sweden (Norway's neighbour) did. Certainly it will be fine, and secure!