Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Monday April 25 2022, @06:46AM   Printer-friendly
from the just-one-blur-I-recognize dept.

MIT's newest computer vision algorithm identifies images down to the pixel:

For humans, identifying items in a scene [...] is as simple as looking at them. But for artificial intelligence and computer vision systems, developing a high-fidelity understanding of their surroundings takes a bit more effort. Well, a lot more effort. Around 800 hours of hand-labeling training images effort, if we're being specific. To help machines better see the way people do, a team of researchers at MIT CSAIL in collaboration with Cornell University and Microsoft have developed STEGO, an algorithm able to identify images down to the individual pixel.

Normally, creating CV training data involves a human drawing boxes around specific objects within an image — say, a box around the dog sitting in a field of grass — and labeling those boxes with what's inside ("dog"), so that the AI trained on it will be able to tell the dog from the grass. STEGO (Self-supervised Transformer with Energy-based Graph Optimization), conversely, uses a technique known as semantic segmentation, which applies a class label to each pixel in the image to give the AI a more accurate view of the world around it.

Whereas a labeled box would have the object plus other items in the surrounding pixels within the boxed-in boundary, semantic segmentation labels every pixel in the object, but only the pixels that comprise the object — you get just dog pixels, not dog pixels plus some grass too. It's the machine learning equivalent of using the Smart Lasso in Photoshop versus the Rectangular Marquee tool.

The problem with this technique is one of scope. Conventional multi-shot supervised systems often demand thousands, if not hundreds of thousands, of labeled images with which to train the algorithm. Multiply that by the 65,536 individual pixels that make up even a single 256x256 image, all of which now need to be individually labeled as well, and the workload required quickly spirals into impossibility.

Instead, "STEGO looks for similar objects that appear throughout a dataset," the CSAIL team wrote in a press release Thursday. "It then associates these similar objects together to construct a consistent view of the world across all of the images it learns from."

"If you're looking at oncological scans, the surface of planets, or high-resolution biological images, it's hard to know what objects to look for without expert knowledge. In emerging domains, sometimes even human experts don't know what the right objects should be," MIT CSAIL PhD student, Microsoft Software Engineer, and the paper's lead author Mark Hamilton said. "In these types of situations where you want to design a method to operate at the boundaries of science, you can't rely on humans to figure it out before machines do."

[...] Despite its superior performance to the systems that came before it, STEGO does have limitations. For example, it can identify both pasta and grits as "food-stuffs" but doesn't differentiate between them very well. It also gets confused by nonsensical images, such as a banana sitting on a phone receiver. Is this a food-stuff? Is this a pigeon? STEGO can't tell. The team hopes to build a bit more flexibility into future iterations, allowing the system to identify objects under multiple classes.


Original Submission

Related Stories

New AI Model Can “Cut Out” Any Object Within an Image—and Meta is Sharing the Code 15 comments

https://arstechnica.com/information-technology/2023/04/meta-introduces-ai-model-that-can-isolate-and-mask-objects-within-images/

On Wednesday, Meta announced an AI model called the Segment Anything Model (SAM) that can identify individual objects in images and videos, even those not encountered during training, reports Reuters.

According to a blog post from Meta, SAM is an image segmentation model that can respond to text prompts or user clicks to isolate specific objects within an image. Image segmentation is a process in computer vision that involves dividing an image into multiple segments or regions, each representing a specific object or area of interest.

The purpose of image segmentation is to make an image easier to analyze or process. Meta also sees the technology as being useful for understanding webpage content, augmented reality applications, image editing, and aiding scientific study by automatically localizing animals or objects to track on video.

Related:
MIT's Computer Vision (CV) Algorithm Identifies Images Down to the Pixel (20220424)
NVIDIA Research's GauGAN AI Art Demo Responds to Words (20211130)
Ask Soylent: Beginning in Artificial Intelligence Methods (20150629)


Original Submission

This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1, Funny) by Anonymous Coward on Monday April 25 2022, @09:36AM (1 child)

    by Anonymous Coward on Monday April 25 2022, @09:36AM (#1239290)

    When are they going to invent an AI for creating acronyms for the AI projects?

    • (Score: 2) by Freeman on Monday April 25 2022, @02:05PM

      by Freeman (732) Subscriber Badge on Monday April 25 2022, @02:05PM (#1239333) Journal

      They tried that, but they trained it on the same set of Nazi propaganda that killed Microsoft's chat bot. (Okay, no, but that's kinda funny and creepy all at the same time.)

      --
      Joshua 1:9 "Be strong and of a good courage; be not afraid, neither be thou dismayed: for the Lord thy God is with thee"
  • (Score: 3, Interesting) by The Vocal Minority on Monday April 25 2022, @10:20AM (2 children)

    by The Vocal Minority (2765) on Monday April 25 2022, @10:20AM (#1239294) Journal

    Had a quick look at the paper. Performance isn't great compared to Convolutional Neural Networks, but being able to get a semi-decent segmentation map on unlabeled data is a big plus.

    • (Score: 0) by Anonymous Coward on Monday April 25 2022, @01:44PM (1 child)

      by Anonymous Coward on Monday April 25 2022, @01:44PM (#1239321)

      Just a guess...manufacturers of memory (RAM, SSD, hard drives) are going to love this.

      Also, how do they decompose an object, for example the dog's eye(s), are they also identified down to the pixel?

      • (Score: 2) by The Vocal Minority on Saturday April 30 2022, @06:32AM

        by The Vocal Minority (2765) on Saturday April 30 2022, @06:32AM (#1240952) Journal

        Sorry for the late reply, I'm super busy these days and posting on SN isn't at the top of my priority list. All deep learning models require a literal fuckton (that's the technical term) of RAM, and this one is probably no different.

        Usually semantic segmentation will identify objects, at the pixel level, as being a specific type of thing. So for something like an autonomous vehicle it is marking cars, roads, pedestrians etc. From my quick read of the paper it looks like it is using K nearest neighbor to do some of the heavy lifting (It's a technique that's been around since the 1950s but the AI people seem to find it useful), and feeding it a similar and dissimilar image as well along with the input image, so I guess it is doing some sort of transformation on the differences between the images to set up appropriate boundaries before feeding the output into the KNN module. No sure how it is determining the actual label themselves. I'm not an expert in this area and didn't read the paper very closely.

  • (Score: 1, Interesting) by Anonymous Coward on Monday April 25 2022, @02:57PM (1 child)

    by Anonymous Coward on Monday April 25 2022, @02:57PM (#1239350)

    STEGO does have limitations. For example, it can identify both pasta and grits as "food-stuffs" but doesn't differentiate between them very well. It also gets confused by nonsensical images, such as a banana sitting on a phone receiver.

    It's the mistakes that show you how much actual intelligence was involved in making the decisions. If you make smart mistakes then you were being intelligent at that time. If you make stupid mistakes you were being stupid at that time.

    In this case it shows the system is actually quite stupid.

    Cornell University and Microsoft have developed STEGO, an algorithm able to identify images down to the individual pixel.

    Thinking about stuff at the pixel level is completely the wrong approach.

    Use the pixels to figure out the 3D objects in a scene. The system has to build models of the world and then compare its predicted worlds with perceived reality and notice the differences. Then only can the system start making better models and better decisions.

    A dog or even a bird can figure out the difference between a bus and a car without thousands of training samples.

    • (Score: 0) by Anonymous Coward on Monday April 25 2022, @08:44PM

      by Anonymous Coward on Monday April 25 2022, @08:44PM (#1239451)

      > ... such as a banana sitting on a phone receiver.

      So, I wonder how it deals with this 2D banana image?

      https://liveforlivemusic.com/features/velvet-underground-debut-album/ [liveforlivemusic.com]
      If it's any good at all it should figure out that it's a Warhol painting, and an LP cover.

(1)