Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Thursday September 10 2015, @02:42AM   Printer-friendly
from the put-up-or-shut-up dept.

I've been looking into jobs for data analysts, sometimes called data scientists. I see that there is lots and lots of money being thrown at people to take "big data" (eg millions of data points for hundreds of different variables) and plug it into a sort of black box algorithm.

Roughly, these algorithms look at how the various inputs are correlated with each other and some outcome of interest, then assign a set of model parameters (sometimes called coefficients) that minimizes some kind of error metric. Some percent of the dataset is used for training, then the other part is "held out" for testing. It is then called an "accurate prediction" when the model can fit the testing dataset relatively well (ie the error metric is less than some threshold).

It is then assumed that the relationship between future input variables and outcome will be similar to that observed for the testing dataset. Based on this, business/policy decisions are made. There are some simpler situations like facial recognition where I would be optimistic regarding this final assumption. However, I have my doubts it is approximately true when it comes to human behavior, and never seem to see any actual predictive skill being assessed. See, for example, the press release associated with this story: https://soylentnews.org/article.pl?sid=15/09/08/1437220 .

What examples are there of these "predictive" machine learning algorithms being accurate? I mean using the same parameters (no tweaking) and new data that was unavailable at the time the model was developed. If you have an algorithm that really worked, there should be a webpage listing all the predictions. It'd be easy to prove you know what you are doing. Right?


Original Submission

Related Stories

Technology That Uses Machine Learning to Quickly Generate Predictive Models From Massive Datasets 13 comments

Fujitsu Laboratories today announced the development of a machine-learning technology that can generate highly accurate predictive models from datasets of more than 50 million records in a matter of hours.

Current techniques for generating highly accurate machine-learning results from a small set of sample data and the accuracy of past predictive models, extracts the learning algorithm and configuration combination that produce the most accurate result, and applies it to the larger dataset. This results in highly accurate predictive models from datasets of 50 million records in a few hours. Predictive models produced by this technology can work to quickly make improvements, such as minimizing membership cancellations on e-commerce websites and enhancing response times to equipment failures. Details of this technology are being presented at the meeting of the Information-Based Induction Sciences and Machine Learning (ISIMBL), opening Monday, September 14 at Ehime University in Japan.

The popularity of smartphones and other advances make it possible to gather massive quantities of sensor data, and machine learning and other advanced analytic techniques are being used extensively to extract valuable information from that data. Using the access logs of e-commerce websites, for example, it is possible to discover when people are most likely to cancel memberships on a given website, to identify those people quickly, and to take measures to discourage cancellation. Using detailed daily power-consumption data, it is possible to discover patterns of increased or decreased usage and to predict periods and times when power usage will increase. This can lead to a reduction in power costs by applying more precise controls over power generation, transmission, and storage. Developing predictive models by machine learning is considered an effective way to obtain accurate predictions. There are numerous methods for accurate predictions will also depend on fine-tuning its configuration. Therefore, generating an effective predictive model requires examining combinations of algorithms and configurations.

Attempting to examine every possible combination of algorithm and conditions causes the number of combinations to balloon quickly. Furthermore, learning time of a combination can take days to examine, making it impractical to use machine learning extensively. Instead, algorithms and conditions are typically selected by analysts based on their experience, so the results ultimately depend heavily on the analyst's skill. In cases where the volume of data is great and analysis ends up taking more than one night, examinations are usually limited to a restricted number of combinations, or analysis can only be applied to a small portion of the data, and it is not possible to automatically derive accurate predictive models in a limited period of time.

Fujitsu Laboratories has developed a technology that estimates machine-learning results, able to generate and automatically tune an accurate predictive model from a small amount of sample data. It has prototyped this on Apache Spark, an open-source platform for parallel execution.

More after the break...

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 2) by meisterister on Thursday September 10 2015, @03:03AM

    by meisterister (949) on Thursday September 10 2015, @03:03AM (#234492) Journal

    ...is actually generally pretty good, and has been for a very long time.

    There's currently a very big push in computer vision to make models applicable for the world at large. For examples of this, look at Microsoft's COCO dataset. At the same time, there's a small movement in Open Set Recognition for the algorithms to be able to eventually label and improve themselves so that each predictor could become increasingly accurate.

    --
    (May or may not have been) Posted from my K6-2, Athlon XP, or Pentium I/II/III.
    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @03:20AM

      by Anonymous Coward on Thursday September 10 2015, @03:20AM (#234500)

      I mention facial recognition in the question, saying I believe it is reasonable that past relationships between inputs will be similar to the future. The properties of faces are rather stable. That is more like categorization than prediction as used by marketers, etc. Another aspect is that the data (images) is relatively high quality.

    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @03:44AM

      by Anonymous Coward on Thursday September 10 2015, @03:44AM (#234507)

      Facial recognition will almost entirely be used to violate our privacy en masse by corporations and the government, so I'm not sure how thrilled I am about that.

    • (Score: 2) by q.kontinuum on Thursday September 10 2015, @07:18AM

      by q.kontinuum (532) on Thursday September 10 2015, @07:18AM (#234562) Journal

      ...is actually generally pretty good

      Indeed. The last time I entered "Facial" in google image search, nearly all pictures matched the criteria.

      --
      Registered IRC nick on chat.soylentnews.org: qkontinuum
  • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @03:10AM

    by Anonymous Coward on Thursday September 10 2015, @03:10AM (#234493)

    This website promises big data analysis, but for smaller companies (big companies have already been doing this for years): http://www.prenostik.com/?page_id=144 [prenostik.com]

    Their black box does all the hard work of analyzing your sales or other data...

  • (Score: 2) by frojack on Thursday September 10 2015, @03:45AM

    by frojack (1554) Subscriber Badge on Thursday September 10 2015, @03:45AM (#234508) Journal

    If you have an algorithm that really worked, there should be a webpage listing all the predictions. It'd be easy to prove you know what you are doing. Right?

    If you have a predictive engine, say for stock trading, would you give that away on a web page? Or would you sell it?
    If it gave accurate military predictions of enemy activity, would you publish that?

    You will never know how many such prediction engines exist, or how accurate they are.
    You might find out about a few of them, but not the most successful ones.

    But mostly they exist everywhere, but they don't exactly stand out as prediction engines or machine learning examples.

    --
    No, you are mistaken. I've always had this sig.
    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @04:04AM

      by Anonymous Coward on Thursday September 10 2015, @04:04AM (#234514)

      I agree. Still, shouldn't there be someone somewhere who published the algorithm (including all parameters, etc) so that it can be checked against new data they lacked access to at the time of publication? Then later people checked the predictions against that new data? With all the hype, this must have occurred at least a few times...

      -OP anon

      • (Score: 2) by frojack on Thursday September 10 2015, @04:47AM

        by frojack (1554) Subscriber Badge on Thursday September 10 2015, @04:47AM (#234526) Journal

        The thing is, these things tend to change names once they are completed.

        Once completed, these things were often called Expert Systems. They are canned and put into production in many different industries. Some Examples of mature expert systems [wikipedia.org] can be found in medicine, baseball, and engineering, etc.

        Some of these early ones were developed by just picking the brains of experts, and trying to encode it. But that gave way to simply entering hundreds of cases and doing massive amounts of multivariate analysis until they could explain most of the data elements.

        The end result were built into software. and deployed, sometimes, world wide.

         

        --
        No, you are mistaken. I've always had this sig.
        • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @04:55AM

          by Anonymous Coward on Thursday September 10 2015, @04:55AM (#234530)

          Thanks for that term. I would expect that at least some of these projects were "completed" after a public demonstration of predictive skill, I am looking for a few examples of this.

    • (Score: 2) by q.kontinuum on Thursday September 10 2015, @05:44AM

      by q.kontinuum (532) on Thursday September 10 2015, @05:44AM (#234540) Journal

      Funny thing is that by publishing the predicted results for the topics you mentioned, they would heavily influence the results as well. Imagine a credible source publishing tomorrows stock prices: People would react to the news, and the stock price would change based on the prediction. Same for enemy troop movements.

      Such a website could only work for interesting facts with either little incentive or no possibility for the audience to impact the results.

      --
      Registered IRC nick on chat.soylentnews.org: qkontinuum
      • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @12:38PM

        by Anonymous Coward on Thursday September 10 2015, @12:38PM (#234620)

        Fort stock prices, such predictions might work very well: I they predict a rising stock, people will buy the stock, and thus it will raise. If it predicts the stock to fall, people will sell the stock, and it will fall.

        For enemy movements, publishing will not work well, however. You can bet the enemy will do everything to make those predictions as useless as possible.

    • (Score: 2) by q.kontinuum on Thursday September 10 2015, @05:56AM

      by q.kontinuum (532) on Thursday September 10 2015, @05:56AM (#234545) Journal

      One option would be to publish e.g. the predicted stock prices today, but encrypted. The decryption key could than be published after the prediction period.

      <toung-in-cheek>Best effect can be achieved when results are encrypted using a one-time key pad.</toung-in-cheek>

      --
      Registered IRC nick on chat.soylentnews.org: qkontinuum
      • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @01:51PM

        by Anonymous Coward on Thursday September 10 2015, @01:51PM (#234649)

        Yes, good idea. This is the old school approach: smaismrmilmepoetaleumibunenugttauiras
        https://en.wikipedia.org/wiki/Anagram#Establishment_of_priority [wikipedia.org]

        Most of these models have predictions that are dependent on varying input though, so they would not be in the form of a simple number or existence of something. We would need all the data and parameters used to really check them.

        -OP anon

  • (Score: 2, Interesting) by Ethanol-fueled on Thursday September 10 2015, @03:52AM

    by Ethanol-fueled (2792) on Thursday September 10 2015, @03:52AM (#234511) Homepage

    Sounds to me like guerrilla job posting. NSA pre-crime and all.

    But isn't a lot of this kind of thing still verified by humans? For example, captchas with address numbers and other challenges that could easily be seen as training machines?

    Hell, companies like Samsung (and likely Apple with Siri) are still sending juicy bits of your voice conversations [vice.com] to be verified by real meatbags, and there's probably some higher-context human training elsewhere we haven't heard about yet.

    What you need is to get together with geniuses, preferably those with some technical background, who have an intuitive feel of the context of data you're trying to mine. Maybe try to get more psychologists and social scientists along with your drab machine-learning types.

    • (Score: 1, Interesting) by Anonymous Coward on Thursday September 10 2015, @04:13AM

      by Anonymous Coward on Thursday September 10 2015, @04:13AM (#234520)

      Sounds to me like guerrilla job posting. NSA pre-crime and all.

      If I follow you, that is wrong. I am actually the one looking for a job, when I saw what these companies are asking for I started suspecting they are being sold snake oil. Of course if all the successful algorithms are secret, who would know?

      What you need is to get together with geniuses, preferably those with some technical background, who have an intuitive feel of the context of data you're trying to mine. Maybe try to get more psychologists and social scientists along with your drab machine-learning types.

      Yes, I suspect a theory driven approach would perform better. There seems to be no one who actually (knows they) want to pay for this though. People think "psychology is so complicated, that is too hard", but if you look at the early days of psychology you find they were making progress in this area. That mostly stopped around 1940.

  • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @04:22AM

    by Anonymous Coward on Thursday September 10 2015, @04:22AM (#234522)

    Autonomous car's computers are mainly using predictive algorithms to try to figure out what all the meat computers surrounding them are about to do. They are not perfect yet (see issues with hipsters riding fixies), but they are pretty darn good-- there still hasn't been an accident that was the car's fault.

    Apparently Netflix is very good at suggesting movies people will like based on past selections.

    Amazon recommendation engine is pretty good as well.

    Store layouts, product packaging, etc. are all driven by predictions of human behavior based on past examples. And, for some of these for some products (e.g., political candidates for national office), "data scientists" using machine learning are probably involved in the recommendations.

    Humans are probably more predictable than you think.

    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @04:31AM

      by Anonymous Coward on Thursday September 10 2015, @04:31AM (#234523)

      Yes, there are many press releases claiming such things. Where is the evidence available for any though? I don't expect everyone to be sharing their secret sauce, but it appears exceptionally rare.

      It is relatively straight forward to prove it works, and I really only want a few examples. All it takes is sharing the initial dataset, code, and coefficients output from the code at a given date. Then get a new dataset (collected after that date), plug it in to the code, and compare the predictions. I am sure a successful example of this exists somewhere...

  • (Score: 4, Interesting) by c0lo on Thursday September 10 2015, @04:50AM

    by c0lo (156) on Thursday September 10 2015, @04:50AM (#234528) Journal

    If we are to trust Frank Herbert (Dune... remember?), there's no distinction between accurately predicting the future and actually determining it.
    My point? I doubt that ever such an algorithm will exist in the hard sense. But...

    This is not to say that predictive technologies cannot be useful in routine/mundane activities... closer to reflex arcs, like "predicting" that a ball suddenly on the road may be followed by a kid running after it; even if the prediction will be wrong, I'd feel better if a self-driven car would slow down. And I reckon the current "predictive" technologies may be quite a good tool for it.

    --
    https://www.youtube.com/watch?v=aoFiw2jMy-0
    • (Score: 2) by Gravis on Thursday September 10 2015, @12:33PM

      by Gravis (4596) on Thursday September 10 2015, @12:33PM (#234618)

      like "predicting" that a ball suddenly on the road may be followed by a kid running after it; even if the prediction will be wrong, I'd feel better if a self-driven car would slow down.

      and sacrifice an easy 10 points? well, you're a better man than i am. #CarmageddonLife

    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @02:21PM

      by Anonymous Coward on Thursday September 10 2015, @02:21PM (#234667)

      "predicting" that a ball suddenly on the road may be followed by a kid running after it

      This is a lower level machine vision application (distinguish ball from tumbleweed) which I am more optimistic about. Also, I would bet they hard code some stuff like that which would be kind of cheating in the context of this question. That is something that may be done though. The programmer maybe hardcodes certain relationships via assumptions or the tweaking process, possibly even unconsciously. Then the actual algorithm does "better than chance" because it has been informed by human intuition and experience (ie "common sense").

        Essentially the team creating the algorithm would be paid for their opinion, but this is obfuscated by all the code and math.

  • (Score: 1, Insightful) by Anonymous Coward on Thursday September 10 2015, @06:32AM

    by Anonymous Coward on Thursday September 10 2015, @06:32AM (#234555)

    The question seems ill-formed. "Machine learning" is just mathematical modeling, albeit usually referring to cases where the equations generated by the model are too complicated to be understood directly. Linear regression for instance is a very simple case, which can be quite useful for modeling simple phenomena. If you want a good example of mathematical model that makes predictions that are constantly checked, then look at weather forecasts.

    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @01:31PM

      by Anonymous Coward on Thursday September 10 2015, @01:31PM (#234641)

      I agree, these are examples of empirical models just like linear regression. I would not so easily accept that linear regression has been useful. Most of those "models" also would not withstand the test I am asking for. Look at the fields that rely heavily on such non-theory driven models (eg, medicine and psychology), they have all practically abandoned direct experimental replication because it makes everyone feel bad. When the results are checked only 10-40% of effects are even in the same direction.

      Weather forecasting I would expect to be theory driven rather than totally empirical, but thanks for that idea, I will look into that.

      -OP anon

    • (Score: 2) by Fnord666 on Thursday September 10 2015, @03:51PM

      by Fnord666 (652) on Thursday September 10 2015, @03:51PM (#234707) Homepage

      If you want a good example of mathematical model that makes predictions that are constantly checked, then look at weather forecasts.

      But the OP was asking for models that work.

  • (Score: 3, Informative) by mtrycz on Thursday September 10 2015, @01:18PM

    by mtrycz (60) on Thursday September 10 2015, @01:18PM (#234634)

    I'm fairly familiar with machine learning both as a grad student and personal interest, but I have never make any professional work with it. For my exams, I prepared with the excellent course by Ng at coursera (the ML course was actually produced to promote the coursera site, it's really well done. I highly recommend it).

    You're getting it right: the maths behind all the various models are the same (you train a model to fit a training set and test it with a test set).

    What's difficult is making any sense of the data. Your datapoints are commonly multi-dimensional, so it's a pain to process it. You need to focus on the most siginificant features/dimensions of your dataset to make it processable. To do that you need to have both: a strong competence with ML methods and a strong knowledge of the dataset's domain. Most basically: you have to know (in advance) what to search for. Also, you need to demonstrate that the output does make some sense. (ML methods most of the time are black boxes, and this is very difficult)

    Hence: yes, they are being sold snake oil. *But* if there is a domain expert that will be working with you, then maybe it can work. No free lunch though. It will be a lot of preparatory work and trial and error before getting something right. It's quite empirical, and certainly you need to both know what those little parameters do and have some experience with fiddling with them.

    ---
    Look at it this way: having a general method for making sense of data with minimal tweaking is roughly equivalent of having Strong AI. It doesn't exists (and personally, I argue it won't exist).

    --
    In capitalist America, ads view YOU!
    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @02:10PM

      by Anonymous Coward on Thursday September 10 2015, @02:10PM (#234660)

      Thanks, what you describe is what I would expect. I do have some experience in this area, for example using symbolic regression. In the end there is an equation like y=(54.564x^3.2 +4.214(z/x)^4.32)/(1-.0005*a) that pretty much fits the data, but even if I know all the variables wtf do the coefficients and relationships between them mean?

      I just have trouble believing such methods are useful beyond detecting possibly interesting patterns, more for hypothesis generation than anything. As mentioned in other posts, I this may work for some simpler problems like machine vision which is more about identifying and categorizing patterns than prediction.

      Did that course include any examples of success stories where the data and model is publicly available?

      -OP anon

      • (Score: 2) by mtrycz on Friday September 11 2015, @12:20PM

        by mtrycz (60) on Friday September 11 2015, @12:20PM (#235202)

        Did that course include any examples of success stories where the data and model is publicly available?

        Can't remember, but I don't think so. You could be able to find some around, since ML is getting a lot of hype these years, but since a working model is a valuable asset, I'm don't think businesses would just share them. Big data processing is entering the not-for-profit too, but I don't know.

        Anyway, generally, excluding ad hoc expert systems, you'd use a model to discover correlations, and sort out the spuroius ones with a domain expert. Then try to find a causal meaning to the correlations. It's an empirical science.

        --
        In capitalist America, ads view YOU!
  • (Score: 2) by Fnord666 on Thursday September 10 2015, @03:57PM

    by Fnord666 (652) on Thursday September 10 2015, @03:57PM (#234711) Homepage
    Have a look at the Netflix Prize Contest [netflixprize.com] for an example of what can be done and how improvements can be made along the way.
  • (Score: 5, Interesting) by AnonymousCowardNoMore on Thursday September 10 2015, @06:17PM

    by AnonymousCowardNoMore (5416) on Thursday September 10 2015, @06:17PM (#234797)

    It is an old joke in AI circles that "a neural network is the second best solution to any problem".

    I can tell you that, while they may not be perfect, neural networks are very well suited to a wide variety of industrial control applications. The usual case is that there is some machine or process that needs to be controlled but the standard textbook models are not a sufficiently good approximation (or there may not be any). Regression models and neural networks (which can be seen as a generalisation of regression modelling) are commonly used in these cases.

    I don't really want to give details from my workplace. But let me give a hypothetical example:

    Let's say that you are a chemical company which produces unobtanium adamantate. During your process, there is a certain amount of undesirable contact with the air, which results in the formation of oxides and dissolved nitrogen in the final and intermediate products. The unobtanium compounds are also quite corrosive, and result in some contamination from and degradation of the reactor vessel. There are also impurities in your feedstock. These and similar problems exist at every step in the process. Now, all of this means that instead of a few clean reactions, you have multiple parasitic processes forming undesirable contaminants and you need to control the rates and equilibria of multiple reactions. You add one reactant to react with a contaminant so that it can be removed, then add another reactant to react with the unwanted by-products caused by the first reactant, several levels deep. During all of this, the equiplment is degrading as I have already stated, and its performance changes in subtle but important ways. Also, despite the best efforts of the manufacturers, every piece of equipment even in the same line performs slightly differently.

    You can not model these subtleties with a simple textbook reaction. You need some sort of machine learning algorithm to get good performance, unless you want an army of engineers constantly studying the process and making little adjustments by the hour.

    • (Score: 0) by Anonymous Coward on Thursday September 10 2015, @09:25PM

      by Anonymous Coward on Thursday September 10 2015, @09:25PM (#234919)

      I think this is another case of a stable situation with high quality data. My pessimism is regarding the assumption that things will remain stable enough to get any use out of it.

      • (Score: 2) by AnonymousCowardNoMore on Saturday September 12 2015, @07:55AM

        by AnonymousCowardNoMore (5416) on Saturday September 12 2015, @07:55AM (#235515)

        stable situation with high quality data

        Hehe. I take it you haven't worked with mass-produced lab results before.

        If you read my post again, you will notice that I specifically mention how the situation is constantly changing, and that is the reason why you should use machine learning. It is true that you should often want to use simple semi-empirical models. It is also true that you can often get away with a very crude application of Gaussian models. But there is a time and a place in industry for some reasonably complicated machine learning methods, even if they don't look as flashy as object recognition.