Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Monday April 10, @05:49PM   Printer-friendly
from the cut-and-paste dept.

https://arstechnica.com/information-technology/2023/04/meta-introduces-ai-model-that-can-isolate-and-mask-objects-within-images/

On Wednesday, Meta announced an AI model called the Segment Anything Model (SAM) that can identify individual objects in images and videos, even those not encountered during training, reports Reuters.

According to a blog post from Meta, SAM is an image segmentation model that can respond to text prompts or user clicks to isolate specific objects within an image. Image segmentation is a process in computer vision that involves dividing an image into multiple segments or regions, each representing a specific object or area of interest.

The purpose of image segmentation is to make an image easier to analyze or process. Meta also sees the technology as being useful for understanding webpage content, augmented reality applications, image editing, and aiding scientific study by automatically localizing animals or objects to track on video.

Related:
MIT's Computer Vision (CV) Algorithm Identifies Images Down to the Pixel (20220424)
NVIDIA Research's GauGAN AI Art Demo Responds to Words (20211130)
Ask Soylent: Beginning in Artificial Intelligence Methods (20150629)


Original Submission

Related Stories

Ask Soylent: Beginning in Artificial Intelligence Methods 38 comments

I'm a neuroscientist in a doctoral program but I have a growing interest in deep learning methods (e.g., http://deeplearning.net/ ). As a neuroscientist using MR imaging methods, I often rely on tools to help me classify and define brain structures and functional activations. Some of the most advanced tools for image segmentation are being innovated using magical-sounding terms like Adaboosted weak-learners, auto-encoders, Support Vector Machines, and the like.

While I do not have the time to become a computer-science expert in artificial intelligence methods, I would like to establish a basic skill level in the application of some of these methods. Soylenters, "Do I need to know the mathematical foundation of these methods intimately to be able to employ them effectively or intelligently?" and "What would be a good way of becoming more familiar with these methods, given my circumstances?"


Original Submission

NVIDIA Research's GauGAN AI Art Demo Responds to Words 4 comments

NVIDIA Research's GauGAN AI Art Demo Responds to Words:

A picture worth a thousand words now takes just three or four words to create, thanks to GauGAN2, the latest version of NVIDIA Research's wildly popular AI painting demo.

The deep learning model behind GauGAN allows anyone to channel their imagination into photorealistic masterpieces — and it's easier than ever. Simply type a phrase like "sunset at a beach" and AI generates the scene in real time. Add an additional adjective like "sunset at a rocky beach," or swap "sunset" to "afternoon" or "rainy day" and the model, based on generative adversarial networks, instantly modifies the picture.

With the press of a button, users can generate a segmentation map, a high-level outline that shows the location of objects in the scene. From there, they can switch to drawing, tweaking the scene with rough sketches using labels like sky, tree, rock and river, allowing the smart paintbrush to incorporate these doodles into stunning images.

The new GauGAN2 text-to-image feature can now be experienced on NVIDIA AI Demos, where visitors to the site can experience AI through the latest demos from NVIDIA Research. With the versatility of text prompts and sketches, GauGAN2 lets users create and customize scenes more quickly and with finer control.

Direct link to YouTube video.

Kinda makes Turtle graphics from the 70s look rather basic. However, beware Rule 34…


Original Submission

MIT's Computer Vision (CV) Algorithm Identifies Images Down to the Pixel 7 comments

MIT's newest computer vision algorithm identifies images down to the pixel:

For humans, identifying items in a scene [...] is as simple as looking at them. But for artificial intelligence and computer vision systems, developing a high-fidelity understanding of their surroundings takes a bit more effort. Well, a lot more effort. Around 800 hours of hand-labeling training images effort, if we're being specific. To help machines better see the way people do, a team of researchers at MIT CSAIL in collaboration with Cornell University and Microsoft have developed STEGO, an algorithm able to identify images down to the individual pixel.

Normally, creating CV training data involves a human drawing boxes around specific objects within an image — say, a box around the dog sitting in a field of grass — and labeling those boxes with what's inside ("dog"), so that the AI trained on it will be able to tell the dog from the grass. STEGO (Self-supervised Transformer with Energy-based Graph Optimization), conversely, uses a technique known as semantic segmentation, which applies a class label to each pixel in the image to give the AI a more accurate view of the world around it.

Whereas a labeled box would have the object plus other items in the surrounding pixels within the boxed-in boundary, semantic segmentation labels every pixel in the object, but only the pixels that comprise the object — you get just dog pixels, not dog pixels plus some grass too. It's the machine learning equivalent of using the Smart Lasso in Photoshop versus the Rectangular Marquee tool.

The problem with this technique is one of scope. Conventional multi-shot supervised systems often demand thousands, if not hundreds of thousands, of labeled images with which to train the algorithm. Multiply that by the 65,536 individual pixels that make up even a single 256x256 image, all of which now need to be individually labeled as well, and the workload required quickly spirals into impossibility.

Instead, "STEGO looks for similar objects that appear throughout a dataset," the CSAIL team wrote in a press release Thursday. "It then associates these similar objects together to construct a consistent view of the world across all of the images it learns from."

This discussion was created by janrinok (52) for logged-in users only, but now has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Interesting) by Anonymous Coward on Monday April 10, @06:12PM (9 children)

    by Anonymous Coward on Monday April 10, @06:12PM (#1300781)

    Great, even more tools to enable fakery. Exactly what we needed.
    These people are fuckin' lemmings(*), marching blindly but briskly towards the abyss, all the while chanting "technology is morally neutral, it's not our fault (x2), what could possibly go wrong"

    (*) Apologies to all lemmings, we humans use this as figure of speech. I realize you have a higher sense of self-preservation than those I applied this figure of speech to.

    • (Score: 2) by gznork26 on Monday April 10, @06:30PM (1 child)

      by gznork26 (1159) on Monday April 10, @06:30PM (#1300785) Homepage Journal

      Speaking of lemmings, James Thurber wrote a short story called "Interview with a Lemming" in which a scientist is having a discussion with one. The scientist says that he's made his life's work of studying lemmings, but there's one thing I don't understand, why you run off a cliff to your deaths. The lemming replies that he's made his life's work of studying humans, and he doesn't understand why we don't.

    • (Score: 1) by khallow on Monday April 10, @06:39PM

      by khallow (3766) Subscriber Badge on Monday April 10, @06:39PM (#1300787) Journal

      Great, even more tools to enable fakery. Exactly what we needed.

      Well, I doubt you can show we need more or less fakery so it's not saying much.

    • (Score: 2) by sjames on Monday April 10, @07:40PM (4 children)

      by sjames (2882) on Monday April 10, @07:40PM (#1300796) Journal

      The jinni is already out of the bottle. My phone can remove objects from photos. StableDiffusion is pretty good at it too, and it's free software.

      The only thing we have left is to make sure our legal system gets itself up to date to understand that pictures can be altered before it makes even more really bad life altering decisions based on fake evidence and testimony (especially 'testilying'). And no, you CANNOT "tell by the pixels".

      • (Score: 0) by Anonymous Coward on Monday April 10, @07:57PM (3 children)

        by Anonymous Coward on Monday April 10, @07:57PM (#1300798)

        Are you sure? I've seen this thing on the telly where you can "zoom in" and "enhance".
        I've seen it here, so clearly it must be true: https://www.youtube.com/watch?v=2aINa6tg3fo [youtube.com]

        • (Score: 3, Informative) by sjames on Monday April 10, @08:13PM (2 children)

          by sjames (2882) on Monday April 10, @08:13PM (#1300801) Journal

          That's where it gets REALLY fun in court. SD and friends CAN do crazy zooms and enhance. The catch is that they generate visually plausible images that way that may or may not be what was actually in the photo. If the resoloution in the photo was too low, the 'zoomed image' almost certainly will not reflect reality.

          • (Score: 2) by maxwell demon on Tuesday April 11, @08:35AM (1 child)

            by maxwell demon (1608) Subscriber Badge on Tuesday April 11, @08:35AM (#1300927) Journal

            What if you repeat the same “zoom out” operation several times on the original image? If the result is different each time, then clearly it was invented by the algorithm. The question is whether the reverse is also true: If, say, ten AI zooms give you the very same image, does this mean that the information is reliable, or just that the AI is consistent in its “hallucinations”?

            --
            The Tao of math: The numbers you can count are not the real numbers.
            • (Score: 2) by sjames on Tuesday April 11, @05:17PM

              by sjames (2882) on Tuesday April 11, @05:17PM (#1300971) Journal

              The latter. Given the same inputs and parameters, the output will be the same. But as a chaotic system, minor perturbations of the inputs may either make no difference or produce wildly different outputs. Unlike algorithmic zooming and enhancement, no matter how many times the neural net produces a substantially similar output, the next try could wildly vary.

              You'll never be justifiably more certain of the results than if you have human artists look at the photo and paint photorealistic impressions of the zoomed portion.

              From a law enforcement perspective, it may be good enough to direct investigative effort, but it would never legitimately meet standards of evidence in court.

              I just hope the courts will know that. The track record there is questionable.

    • (Score: 4, Interesting) by maxwell demon on Tuesday April 11, @08:54AM

      by maxwell demon (1608) Subscriber Badge on Tuesday April 11, @08:54AM (#1300929) Journal

      But this in particular doesn't bring anything a human couldn't already do with Photoshop and enough time and patience. Note that I didn't even mention skill here, as while skill certainly helps doing that task in a reasonable time, it definitely isn't strictly needed, as the only truly relevant skill, to figure out where an object ends, is already present in the average human.

      --
      The Tao of math: The numbers you can count are not the real numbers.
  • (Score: 2) by ikanreed on Monday April 10, @08:59PM (2 children)

    by ikanreed (3164) on Monday April 10, @08:59PM (#1300808) Journal

    Especially the broken LaTex in the abstract

    We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at \href{https://segment-anything.com}{https://segment-anything.com} to foster research into foundation models for computer vision.

    Definetly speaks to high technology skills.

    • (Score: 2) by jb on Tuesday April 11, @07:56AM (1 child)

      by jb (338) on Tuesday April 11, @07:56AM (#1300922)

      Especially the broken LaTex in the abstract

      We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at \href{https://segment-anything.com}{https://segment-anything.com} to foster research into foundation models for computer vision.

      Definetly speaks to high technology skills.

      Looks like someone went out of his way to introduce that "error".

      Surely, if the only mistake was \\href instead of \href in the source (which is the error that the above seems to be masquerading as), the two sets of braces would not have rendered at all (as they would simply define block boundaries)?

      • (Score: 3, Insightful) by maxwell demon on Tuesday April 11, @08:44AM

        by maxwell demon (1608) Subscriber Badge on Tuesday April 11, @08:44AM (#1300928) Journal

        If you look at the actual paper, then you'll see the link rendered correctly. I suspect that the web site's code simply copied everything between \begin{abstract} and \end{abstract} verbatim into the HTML, without even attempting to interpret any LaTeX inside.

        I doubt that the authors of the paper are in any way involved in the development of the web site on which the paper is presented.

        --
        The Tao of math: The numbers you can count are not the real numbers.
  • (Score: 3, Interesting) by Ingar on Tuesday April 11, @10:15AM

    by Ingar (801) on Tuesday April 11, @10:15AM (#1300937) Homepage

    Adobe's Jason Levine demoed a similar technology in Intel's CES 2020 keynote [youtube.com].

  • (Score: 1, Touché) by Anonymous Coward on Tuesday April 11, @05:57PM

    by Anonymous Coward on Tuesday April 11, @05:57PM (#1300977)

    ...as their segmentation algorithms become obsolete. Or more likely, the "new AI model" does an okay job sometimes and not other times just like all the others.

(1)