Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Tuesday June 14 2022, @06:24AM   Printer-friendly
from the but-what-do-you-teach-machines? dept.

Over at ACM.org, Orit Hazzan (a professor in the Technion's Department of Education in Science and Technology) and Koby Mike (a Ph.D student) make the case that machine learning guides learners to ignore the application domain even when it is relevant for the modeling phase of data science:

From a historical perspective, machine learning was considered, for the past 50 years or so, as part of artificial intelligence. It was taught mainly in computer science departments to scientists and engineers and the focus was placed, accordingly, on the mathematical and algorithmic aspects of machine learning, regardless of the application domain. Thus, although machine learning deals also with statistics, which focuses on data and does consider the application domain, up until recently, most machine learning activities took place in the context of computer science, where it began, and which focuses traditionally on algorithms.

Two processes, however, have taken place in parallel to the accelerated growth of data science in the last decade. First, machine learning, as a sub-field of data science, flourished and its implementation and use in a variety of disciplines began. As a result, researchers realized that the application domain cannot be neglected and that it should be considered in any data science problem-solving situation. For example, it is essential to know the meaning of the data in the context of the application domain to prepare the data for the training phase and to evaluate the algorithm's performance based on the meaning of the results in the real world. Second, a variety of population began taking machine learning courses, people for whom, as experts in their disciplines, it is inherent and essential to consider the application domain in data science problem-solving processes.

[...] For example, consider a researcher in the discipline of social work who took a machine learning course but was not educated to consider the application domain in the interpretation of the data analysis. The researcher is now asked to recommend an intervention program. Since the researcher was not educated to consider the application domain, he or she may ignore crucial factors in this examination and rely only on the recommendation of the machine learning algorithm.

Other examples are education and transportation, fields that everyone feels they understand. As a result of a machine learning education that does not consider the application domain, non-experts in these fields may assume that they have enough knowledge in these fields, and may not understand the crucial role that professional knowledge in these fields plays in decision-making processes that are based on the examination of the output of machine learning algorithms. This phenomenon is further highlighted when medical doctors or food engineers, for example, are not trained or educated in machine learning courses to criticize the results of machine learning algorithms based on their professionalism in medicine and food engineering, respectively.


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: -1, Offtopic) by Anonymous Coward on Tuesday June 14 2022, @06:32AM (1 child)

    by Anonymous Coward on Tuesday June 14 2022, @06:32AM (#1253115)

    I have no clue what these clowns are arguing about, but I'm pretty sure you gen-zs down to maybe the millenials are fucked.

    Me? I'm so-called gen-xer, but I don't give a fuck - i'd be dead before you losers can say "hah-ha."

    Funny, damn boomers probably outlive me and other so-called gen-xers.

    Yeah, fuck all'y'all.

    • (Score: -1, Offtopic) by Anonymous Coward on Tuesday June 14 2022, @08:10AM

      by Anonymous Coward on Tuesday June 14 2022, @08:10AM (#1253122)

      You are right, we are all fucked in the end.

      My existence here for the time I am here has been made immensely more enjoyable thanks to the efforts of many of those who came before me.

      That said, I feel if I can leave any useful arts before I go, by all means try to pass it on.

      Anything I build will turn to dust, but anything I invent, if it's useful, and shared, will make things even better for those who inherit this world.

      My take..see what's here, try to make it better, and don't make a mess for everyone else! And share.

  • (Score: 0, Flamebait) by shrewdsheep on Tuesday June 14 2022, @07:59AM (11 children)

    by shrewdsheep (5215) on Tuesday June 14 2022, @07:59AM (#1253120)

    Data analysis should indeed disregard the application domain. It is precisely required to distill the abstract data features and then analyze the data without further referral to the application domain. It is true that specialized models work better in some fields but usually they can be absorbed quickly and are common knowledge to trained statisticians. The analyst therefore does not have to have any knowledge of the domain, he has to be able to elicit the important data features from the domain expert (definition of outcome, dependency structure, etc), which is what statistical consultation is about.
    The conclusion therefore has to be: teach statistics and stop talking nonsense.

    • (Score: 4, Insightful) by BsAtHome on Tuesday June 14 2022, @08:15AM (2 children)

      by BsAtHome (889) on Tuesday June 14 2022, @08:15AM (#1253123)

      Disregarding the application domain and only look at data analysis is the equivalent of going on a fishing expedition. You do analysis and get "interesting" results, but have no clue what they mean in domain context. Then, just like many already do, you interpret the wrong application domain meaning into the results because it "looked" right.

      Let me know when you are ready to enter the plane controlled by ML where only data analysis was considered.

      • (Score: 2, Insightful) by shrewdsheep on Tuesday June 14 2022, @08:26AM (1 child)

        by shrewdsheep (5215) on Tuesday June 14 2022, @08:26AM (#1253126)

        There has to be a well defined research question to begin with. This is indeed something that many "data scientists" have never heard about. After the analysis, results go back to the domain expert for interpretation. The statistician has to help with this interpretation to prevent over-interpretation purely from the statistical point of view, again disregarding the domain.

        Fishing expeditions, by definition, are what you get when you change you research question after seeing the data.

        • (Score: 5, Touché) by BsAtHome on Tuesday June 14 2022, @11:30AM

          by BsAtHome (889) on Tuesday June 14 2022, @11:30AM (#1253143)

          If you have defined "a well defined research question", then you have an hypothesis (I certainly hope you do). That means you are looking at data analysis in context of the application domain. So, you are not doing pure data analysis as your original post suggested.

    • (Score: 0) by Anonymous Coward on Tuesday June 14 2022, @08:27AM (1 child)

      by Anonymous Coward on Tuesday June 14 2022, @08:27AM (#1253127)

      Statistics, calculus, data structures, numerical methods, linear algebra ( matrix math ), C++, physics, chemistry, biochemistry, thermodynamics.

      They all tied together for me.

      And left me in overwhelming awe.

      No wonder I don't seem to get out much.

      I don't play games. I play with the real thing.

      And I see a lot of people right on these forums doing the same thing...just as addicted as I am.

      • (Score: 0) by Anonymous Coward on Tuesday June 14 2022, @06:54PM

        by Anonymous Coward on Tuesday June 14 2022, @06:54PM (#1253260)

        > I don't play games. I play with the real thing.

        u webcam?

    • (Score: 4, Interesting) by MostCynical on Tuesday June 14 2022, @09:10AM

      by MostCynical (2589) on Tuesday June 14 2022, @09:10AM (#1253131) Journal

      data requires data entry
      any data entry that has interpretational aspects (eg: pick one of these criteria (when more than one could apply)) then requires.. interpretation.

      any data that has been entered using a rules-based UI needs interpretation (data entry after job completion, but UI will not allow retrospective start date - get negative job duration)

      any data set that has missing data (MS applications notoriously can't handle missing dates in date fields - so you see lots of 1/1/1900 in excel - and jobs now took over 100 years)

      any data set that has outcomes without reasons- a no is a no to a data scientist, when a no could be an 'incomplete' or 'inaccessible' or 'unable to work' or 'not good enough'..

      without detailed domain/application knowledge, you won't even know when the data is wrong

      --
      "I guess once you start doubting, there's no end to it." -Batou, Ghost in the Shell: Stand Alone Complex
    • (Score: 0) by Anonymous Coward on Tuesday June 14 2022, @11:23AM

      by Anonymous Coward on Tuesday June 14 2022, @11:23AM (#1253142)

      It is precisely required to distill the abstract data features and then analyze the data without further referral to the application domain.

      Without domain knowledge, you don't know which features of the data are relevant and how to extract them.
      If you apply OCR techniques to the linguistic domain or to speech recognition, you are going to get shit out.

    • (Score: 4, Interesting) by Thexalon on Tuesday June 14 2022, @11:32AM (3 children)

      by Thexalon (636) on Tuesday June 14 2022, @11:32AM (#1253144)

      Data analysis can tell you what's happening that might be statistically interesting.

      What it can't tell you is why it's interesting, or what you should and shouldn't do with that information. For example, if your extremely sophisticated algorithms link variable X and Y as being correlated in some way, you're in the classic "correlation != causation" problem and can't determine if there's actually a causal link in one direction or the other, some unknown third variable that might not be a part of your data set, or just plain old coincidence. And that's where understanding the application domain matters, a lot.

      If you skip the step where you understand why the algorithm is telling you what it's telling you, the result is that you have a system that, say, discriminates against people named "Charles" because 100% of the people who led the Manson Family are named "Charles".

      --
      The only thing that stops a bad guy with a compiler is a good guy with a compiler.
      • (Score: 0) by Anonymous Coward on Tuesday June 14 2022, @12:40PM

        by Anonymous Coward on Tuesday June 14 2022, @12:40PM (#1253157)

        exactly, the machine minds says:
          "I am right, because i say I am right"

      • (Score: 0) by Anonymous Coward on Tuesday June 14 2022, @02:26PM (1 child)

        by Anonymous Coward on Tuesday June 14 2022, @02:26PM (#1253180)

        Hear, hear. Not grokkking the domain leads to spurious correleation [wikipedia.org].

        No, not that [tylervigen.com]. But that's teh wierds.

        • (Score: 0) by Anonymous Coward on Wednesday June 15 2022, @04:52AM

          by Anonymous Coward on Wednesday June 15 2022, @04:52AM (#1253379)

          Excellent link (the first one). I had not come across that before!

  • (Score: 3, Interesting) by bzipitidoo on Tuesday June 14 2022, @12:44PM

    by bzipitidoo (4388) on Tuesday June 14 2022, @12:44PM (#1253158) Journal

    That "consider a researcher in ... social work" "example" was pretty bad, basically a non-example. I wonder if the CACM editors chopped out too much. Reads rather typical of an article that, out of considerations of limited space, omitted a lot of background. Ironic. Still chopping out details as if we're still stuck with print, and don't have the vast space of digital storage. And it's the ACM of all organizations doing that. Further ironical in that one of the core problems in the use of ML is good and sufficient data to work with, and here these editors have not allowed sufficient data to those of us humans who are not intimately familiar with this area, to really grasp what the article is saying.

    For a better example, how would ML be applied to a big and complicated problem such as Global Warming? We could try to ask ML for recommendations, as in, build lots more wind and solar power plants as fast as possible, or improve battery electric vehicles, or maybe stick with combustion and use only biofuels, or work on carbon capture and sequestration facilities. Or maybe it would recommend we build sea walls in preparation for sea level rise. Yet another recommendation might be recognition that a lower world population would help a whole lot with this problem, and what should be done is a "one child" policy on the entire world, despite the extreme difficulties in incentivizing and/or enforcing that.

    Recommending among such options are the sorts of questions ML can't answer. That takes AI beyond what we have now, not mere ML. The questions need to be narrower. Like, suppose you have a budget of $1 billion, how much should be invested in wind and solar power generation, and how much into planting forests for carbon capture? That's what ML can handle, if it is given good data. And you'd want to feed it lots of variations, to see what results you get under different scenarios. Like, suppose the price of land rises sharply? Another narrower question could be about vehicles: traditional combustion or battery electric? Or fuel cells, ethanol, methanol, propane, hydrogen, flywheels, or some other energy storage or fuel system? The thing about the vehicles question is that maybe what should be considered is how to reduce vehicle usage, for instance, maybe more telecommuting and less suburban sprawl is the best bang for the buck? A simple way to account for that is to include in the propulsion method question an estimate of how much distance will be traveled collectively, without asking ML to pick between less travel and more efficient travel methods. Still other factors are such notions as lower rolling resistance tires, better aerodynamics, public transportation options, and self-driving capability.

    How all these things are modeled is crucial, and ML cannot help with that. We have to build the models, then apply ML. We have to decide whether and how to include in any model a particular factor such as better aerodynamics.

  • (Score: 2) by ikanreed on Tuesday June 14 2022, @02:10PM (1 child)

    by ikanreed (3164) Subscriber Badge on Tuesday June 14 2022, @02:10PM (#1253176) Journal

    Understanding what a given statistical analysis actually means should be really important to people involved in either domain of data analysis, but machine learning has a number of extremely important concepts that don't fit that model. Good luck finding a statistician who understands what attention in a transformer model represents, from just learning data science, for example.

    • (Score: 0) by Anonymous Coward on Wednesday June 15 2022, @04:50AM

      by Anonymous Coward on Wednesday June 15 2022, @04:50AM (#1253378)

      I'm not sure if anyone really understands what attention represents in a transformer. I certainly don't, but maybe that's just because I am single-headed, instead of multi-headed...

(1)