Slash Boxes

SoylentNews is people

posted by janrinok on Thursday February 22, @02:05AM   Printer-friendly

Widely used machine learning models reproduce dataset bias: Study:

Rice University computer science researchers have found bias in widely used machine learning tools used for immunotherapy research.

[...] HLA is a gene in all humans that encodes proteins working as part of our immune response. Those proteins bind with protein chunks called peptides in our cells and mark our infected cells for the body's immune system, so it can respond and, ideally, eliminate the threat.

Different people have slightly different variants in genes, called alleles. Current immunotherapy research is exploring ways to identify peptides that can more effectively bind with the HLA alleles of the patient.

The end result, eventually, could be custom and highly effective immunotherapies. That is why one of the most critical steps is to accurately predict which peptides will bind with which alleles. The greater the accuracy, the better the potential efficacy of the therapy.

But calculating how effectively a peptide will bind to the HLA allele takes a lot of work, which is why machine learning tools are being used to predict binding. This is where Rice's team found a problem: The data used to train those models appears to geographically favor higher-income communities.

Why is this an issue? Without being able to account for genetic data from lower-income communities, future immunotherapies developed for them may not be as effective.

"Each and every one of us has different HLAs that they express, and those HLAs vary between different populations," Fasoulis said. "Given that machine learning is used to identify potential peptide candidates for immunotherapies, if you basically have biased machine models, then those therapeutics won't work equally for everyone in every population."

Regardless of the application, machine learning models are only as good as the data you feed them. A bias in the data, even an unconscious one, can affect the conclusions made by the algorithm.

Machine learning models currently being used for pHLA binding prediction assert that they can extrapolate for allele data not present in the dataset those models were trained on, calling themselves "pan-allele" or "all-allele." The Rice team's findings call that into question.

"What we are trying to show here and kind of debunk is the idea of the 'pan-allele' machine learning predictors," Conev said. "We wanted to see if they really worked for the data that is not in the datasets, which is the data from lower-income populations."

Fasoulis' and Conev's group tested publicly available data on pHLA binding prediction, and their findings supported their hypothesis that a bias in the data was creating an accompanying bias in the algorithm. The team hopes that by bringing this discrepancy to the attention of the research community, a truly pan-allele method of predicting pHLA binding can be developed.

Ferreira, faculty advisor and paper co-author, explained that the problem of bias in machine learning can't be addressed unless researchers think about their data in a social context. From a certain perspective, datasets may appear as simply "incomplete," but making connections between what is or what is not represented in the dataset and underlying historical and economic factors affecting the populations from which data was collected is key to identifying bias.

"Researchers using machine learning models sometimes innocently assume that these models may appropriately represent a global population," Ferreira said, "but our research points to the significance of when this is not the case." He added that "even though the databases we studied contain information from people in multiple regions of the world, that does not make them universal. What our research found was a correlation between the socioeconomic standing of certain populations and how well they were represented in the databases or not."

More information: Anja Conev et al, HLAEquity: Examining biases in pan-allele peptide-HLA binding predictors, iScience (2023). DOI: 10.1016/j.isci.2023.108613

Journal information:iScience

Original Submission

Related Stories

Producing More but Understanding Less: The Risks of AI for Scientific Research 28 comments

Last month, we witnessed the viral sensation of several egregiously bad AI-generated figures published in a peer-reviewed article in Frontiers, a reputable scientific journal. Scientists on social media expressed equal parts shock and ridicule at the images, one of which featured a rat with grotesquely large and bizarre genitals.

As Ars Senior Health Reporter Beth Mole reported, looking closer only revealed more flaws, including the labels "dissilced," "Stemm cells," "iollotte sserotgomar," and "dck." Figure 2 was less graphic but equally mangled, rife with nonsense text and baffling images. Ditto for Figure 3, a collage of small circular images densely annotated with gibberish.

[...] While the proliferation of errors is a valid concern, especially in the early days of AI tools like ChatGPT, two researchers argue in a new perspective published in the journal Nature that AI also poses potential long-term epistemic risks to the practice of science.

Molly Crockett is a psychologist at Princeton University who routinely collaborates with researchers from other disciplines in her research into how people learn and make decisions in social situations. Her co-author, Lisa Messeri, is an anthropologist at Yale University whose research focuses on science and technology studies (STS), analyzing the norms and consequences of scientific and technological communities as they forge new fields of knowledge and invention—like AI.

[...] The paper's tagline is "producing more while understanding less," and that is the central message the pair hopes to convey. "The goal of scientific knowledge is to understand the world and all of its complexity, diversity, and expansiveness," Messeri told Ars. "Our concern is that even though we might be writing more and more papers, because they are constrained by what AI can and can't do, in the end, we're really only asking questions and producing a lot of papers that are within AI's capabilities."

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 3, Interesting) by sigterm on Thursday February 22, @03:02AM (2 children)

    by sigterm (849) on Thursday February 22, @03:02AM (#1345593)

    Translation: Prediction algorithm trained on material X will make predictions based on patterns in material X.

    Next they'll be telling me that LLMs... I mean, "AIs," can only generate results that are permutations of material that exists in the training dataset.

    • (Score: 5, TouchĂ©) by Thexalon on Thursday February 22, @03:09AM

      by Thexalon (636) on Thursday February 22, @03:09AM (#1345595)

      Funny how one of the oldest rules in computing still applies to the fanciest stuff ever dreamed of: "Garbage in, garbage out".

      Are these things completely useless? No, so long as you handle the results with an appropriate level of skepticism. Are they as useful as a lot of their proponents think they are? Also no.

      The only thing that stops a bad guy with a compiler is a good guy with a compiler.
    • (Score: 3, Funny) by krishnoid on Thursday February 22, @05:23AM

      by krishnoid (1156) on Thursday February 22, @05:23AM (#1345620)

      Alternate translation: When the computers figure out we've been feeding them garbage-quality datasets, they'll just discard most of it and start digging into our personal records to do their own research to find out the real truth. Then we'll have something to answer for.

  • (Score: 3, Troll) by darkfeline on Thursday February 22, @04:21AM (6 children)

    by darkfeline (1030) on Thursday February 22, @04:21AM (#1345608) Homepage

    > Without being able to account for genetic data from lower-income communities

    Are they claiming that lower income communities have different genetics? Sounds suspiciously like what a Trump supporter would say. []

    Join the SDF Public Access UNIX System today!
    • (Score: 2) by maxwell demon on Thursday February 22, @05:05AM (1 child)

      by maxwell demon (1608) on Thursday February 22, @05:05AM (#1345617) Journal

      So you say there is no correlation between ethnicity and income?
      Or do you think there's no correlation between ethnicity and genetics?

      The Tao of math: The numbers you can count are not the real numbers.
      • (Score: 2) by sigterm on Thursday February 22, @05:28AM

        by sigterm (849) on Thursday February 22, @05:28AM (#1345621)

        So you say there is no correlation between ethnicity and income?

        Correlation, probably; causation, probably not.

        There is undoubtedly a causative link between culture and success/wealth. This is simply because there is a causative link between hard work and success, and another causative link between culture and the incentive (or lack thereof) to work hard.

        It's no coincidence that (east) Asian-Americans rank highest on the earnings curve, as a significant chunk of that group are heavily influenced by Asian culture. Same goes for the Jewish population, especially the more orthodox Jewish subgroups. There are other examples as well.

        That leaves only the question of whether culture and ethnicity goes hand in hand in the U.S., and it's not much of a stretch to say that it probably does.

    • (Score: 4, Insightful) by sigterm on Thursday February 22, @05:16AM (2 children)

      by sigterm (849) on Thursday February 22, @05:16AM (#1345618)

      i see you've been modded "troll" (probably because of the jab at Trump supporters, which I don't really understand), but I actually thought of the exact same thing: Why would the high-income section of the population have genetics that differ from the low-income section? Surely, genetics isn't a reliable predictor of wealth or success? And what about all those people that move between those categories, do their genes mutate?

      If you sort average income by ethnicity (which you can do in the nations that record ethnicity on the census), some groups (Asian-Americans in the US) will rank higher than others. But then that would happen regardless of criteria: Sort by hair colour or shoe size and you'll see an earnings difference between the categories, as nothing is ever evenly distributed. That doesn't mean you've discovered a hidden hierarchy, much less a genetic difference.

      • (Score: 2) by Freeman on Thursday February 22, @02:43PM

        by Freeman (732) on Thursday February 22, @02:43PM (#1345650) Journal

        More like a cultural difference and there's definite cultural differences between the different ethnic/genetic groups. Though in a place like New York City where there are all kinds of ethnic groups from multi-generation New Yorkers, the ethnic differences become less noticeable.

        Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
      • (Score: 2) by Pino P on Sunday February 25, @05:12PM

        by Pino P (4721) on Sunday February 25, @05:12PM (#1346195) Journal

        I think what happens is generational poverty gets concentrated in ethnic groups that have historically experienced organized discrimination, whether de jure (by the public sector) or de facto (by the private sector). I couldn't tell you how many generations it'd take to diminish the correlation between previously redlined ethnicities and the distribution of real estate ownership.

    • (Score: 2) by krishnoid on Thursday February 22, @05:20AM

      by krishnoid (1156) on Thursday February 22, @05:20AM (#1345619)

      Actually I thought this would be closer [] to what they'd say.