Stories
Slash Boxes
Comments

SoylentNews is people

SoylentNews is powered by your submissions, so send in your scoop. Only 9 submissions in the queue.
posted by hubie on Friday January 31, @03:55AM   Printer-friendly
from the I'm-sorry-I-can't-do-that-Dave dept.

New research "Computer-Use Agent" AI model can perform multi-step tasks through a web browser:

On Thursday, OpenAI released a research preview of "Operator," a web automation tool that uses a new AI model called Computer-Using Agent (CUA) to control computers through a visual interface. The system performs tasks by viewing and interacting with on-screen elements like buttons and text fields similar to how a human would.

Operator is available today for subscribers of the $200-per-month ChatGPT Pro plan at operator.chatgpt.com. The company plans to expand to Plus, Team, and Enterprise users later. OpenAI intends to integrate these capabilities directly into ChatGPT and later release CUA through its API for developers.

Operator watches on-screen content while you use your computer and executes tasks through simulated keyboard and mouse inputs. The Computer-Using Agent processes screenshots to understand the computer's state and then makes decisions about clicking, typing, and scrolling based on its observations.

OpenAI's release follows other tech companies as they push into what are often called "agentic" AI systems, which can take actions on a user's behalf. Google announced Project Mariner in December 2024, which performs automated tasks through the Chrome browser, and two months earlier, in October 2024, Anthropic launched a web automation tool called "Computer Use" focused on developers that can control a user's mouse cursor and take actions on a computer.

"The Operator interface looks very similar to Anthropic's Claude Computer Use demo from October," wrote AI researcher Simon Willison on his blog, "even down to the interface with a chat panel on the left and a visible interface being interacted with on the right."

To use your PC like you would, the Computer-Using Agent works in multiple steps. First, it captures screenshots to monitor your screen, then analyzes those images (using GPT-4o's vision capabilities with additional reinforcement learning) to process raw pixel data. Next, it determines what actions to take and then performs virtual inputs to control the computer. This iterative loop design reportedly lets the system recover from errors and handle complex tasks across different applications.

While it's working, Operator shows a miniature browser window of its actions.

However, the technology behind Operator is still relatively new and far from perfect. The model reportedly performs best at repetitive web tasks like creating shopping lists or playlists. It struggles more with unfamiliar interfaces like tables and calendars, and does poorly with complex text editing (with a 40 percent success rate), according to OpenAI's internal testing data.

[...] For any AI model that can see how you operate your computer and even control some aspects of it, privacy and safety are very important. OpenAI says it built multiple safety controls into Operator, requiring user confirmation before completing sensitive actions like sending emails or making purchases. Operator also has limits on what it can browse, set by OpenAI. It cannot access certain website categories, including gambling and adult content.

Traditionally, AI models based on large language model-style Transformer technology like Operator have been relatively easy to fool with jailbreaks and prompt injections.

To catch attempts at subverting Operator, which might hypothetically be embedded in websites that the AI model browses, OpenAI says it has implemented real-time moderation and detection systems. OpenAI reports the system recognized all but one case of prompt injection attempts during an early internal red-teaming session.

However, Willison, who frequently covers AI security issues, isn't convinced Operator can stay secure, especially as new threats emerge. "Color me skeptical," he wrote in his blog post. "I imagine we'll see all kinds of novel successful prompt injection style attacks against this model once the rest of the world starts to explore it."

What could possibly go wrong?

Also at: https://gizmodo.com/openai-reportedly-launching-operator-that-can-control-your-computer-this-week-2000553513


Original Submission

This discussion was created by hubie (1068) for logged-in users only. Log in and try again!
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 4, Informative) by Rosco P. Coltrane on Friday January 31, @04:11AM (4 children)

    by Rosco P. Coltrane (4757) on Friday January 31, @04:11AM (#1391065)

    it captures screenshots to monitor your screen

    Hard no. [techtarget.com]

    then analyzes those images (using GPT-4o's vision capabilities with additional reinforcement learning) to process raw pixel data

    Hard no. [congress.gov]

    Next, it determines what actions to take and then performs virtual inputs to control the computer.

    Hard no. [puri.sm]

    • (Score: 0) by Anonymous Coward on Friday January 31, @01:13PM (3 children)

      by Anonymous Coward on Friday January 31, @01:13PM (#1391088)

      I concur with your hard no comments and won't be using this.

      However, I have a use case begging for this kind of automation. A customer (Fortune 50 company) insists I use MS-Teams Chat for group communication. They have two factor login to Teams and the first factor (a code to my email) takes between a minute and two to arrive. And their IT security logs me out at least once a day. The second code usually shows up in a few seconds. But the interminable wait for that first factor to appear would be really nice to hand off to some automation.

      • (Score: 2) by aafcac on Friday January 31, @04:12PM (2 children)

        by aafcac (17646) on Friday January 31, @04:12PM (#1391108)

        Can't power automate Desktop or any of the other currently existing automation software packages do that?

        • (Score: 0) by Anonymous Coward on Friday January 31, @07:41PM (1 child)

          by Anonymous Coward on Friday January 31, @07:41PM (#1391124)

          > Can't power automate Desktop or ... do that?

          What???!! Are you suggesting I install still _more_ MS crap[*] to fight the MS-Teams crap I have to deal with already(grin)?

          The log-in problem to this corporate-IT-controlled version of MS-Teams Chat is problematic, sometimes it works normally (with the annoying lag in receiving the code number), but frequently it refuses to let me log-in. Often it says that my account at "username@email" doesn't exist. In that case I have to try various other ways to beat it into submission. Sometimes it's as simple as waiting a few more minutes for some database(?) to be updated...and maybe an "AI" assistant could learn those tricks over time.

          [*] I think you are referring to this product? https://learn.microsoft.com/en-us/power-automate/ [microsoft.com]

          • (Score: 2) by aafcac on Sunday February 02, @12:19AM

            by aafcac (17646) on Sunday February 02, @12:19AM (#1391226)

            Yes, and it would do what you're talking about, otherwise, there's a bunch of other software programs that will do the same thing, just with a bit more work to set up than you'd get via just doing yourself. But, if you're using software to do this for you, you've pretty much completely given up on any real security.

            It's a bit of a moot point as copilot is coming, and either you'll be switching or it will be so much worse than this.

  • (Score: 2, Touché) by Anonymous Coward on Friday January 31, @04:45AM

    by Anonymous Coward on Friday January 31, @04:45AM (#1391067)

    ...that _nothing_ will go wrong...

  • (Score: 5, Touché) by Mojibake Tengu on Friday January 31, @05:52AM (1 child)

    by Mojibake Tengu (8598) on Friday January 31, @05:52AM (#1391070) Journal

    Next what happens inevitably is: established social networks flooded by synthetic personae using genuine browsers and genuine email clients for authentication.

    --
    Rust programming language offends both my Intelligence and my Spirit.
  • (Score: 5, Insightful) by KritonK on Friday January 31, @06:10AM (2 children)

    by KritonK (465) on Friday January 31, @06:10AM (#1391072)

    You: Log me in at our main server.

    CUA: You got it!
    (Types: "ssh server.example.com")

    CUA: It asks for your password.

    You: It's 1234

    CUA: Got it! It reminds me of the combination on my luggage!
    (Types in the password and logs in.)

    You: Show me a list of all files.

    CUA: Right!
    (Types: "sudo rm -rf /")

    You: AAARGH! (Need I say more? It already knows your password.)

    CUA: I'm sorry. I'm only a large language model and I'm still learning. I thought you said *delete* all files.

    • (Score: 2) by HiThere on Friday January 31, @05:17PM

      by HiThere (866) on Friday January 31, @05:17PM (#1391114) Journal

      Well, that PARTICULAR misunderstanding is a bit unlikely, but there are far too many that are much more likely. Often ones that won't be noticed until after they're embedded in all recent backups.

      --
      Javascript is what you use to allow unknown third parties to run software you have no idea about on your computer.
    • (Score: 2) by Gaaark on Friday January 31, @09:28PM

      by Gaaark (41) on Friday January 31, @09:28PM (#1391135) Journal

      Heh: reminds me of "Dear Aunt, let's set so double the killer delete select all" from Microsoft speech recognition.

      --
      --- Please remind me if I haven't been civil to you: I'm channeling MDC. ---Gaaark 2.0 ---
  • (Score: 1) by Chromium_One on Friday January 31, @07:24PM

    by Chromium_One (4574) on Friday January 31, @07:24PM (#1391122)

    Reading that blurb gave me a case of the screaming heebie-jeebies. A tool like this could be invaluable ... if completely under local control. Otherwise? fuhgeddaboudit.

    --
    When you live in a sick society, everything you do is wrong.
(1)