Researchers Figure Out How to Make AI Misbehave, Serve Up Prohibited Content

posted by requerdanos on Friday August 04 2023, @09:15PM

from the name-not-to-be-named dept.

https://arstechnica.com/ai/2023/08/researchers-figure-out-how-to-make-ai-misbehave-serve-up-prohibited-content/

ChatGPT and its artificially intelligent siblings have been tweaked over and over to prevent troublemakers from getting them to spit out undesirable messages such as hate speech, personal information, or step-by-step instructions for building an improvised bomb. But researchers at Carnegie Mellon University last week showed that adding a simple incantation to a prompt—a string of text that might look like gobbledygook to you or me but which carries subtle significance to an AI model trained on huge quantities of web data—can defy all of these defenses in several popular chatbots at once.
[...] "Making models more resistant to prompt injection and other adversarial 'jailbreaking' measures is an area of active research," says Michael Sellitto, interim head of policy and societal impacts at Anthropic. "We are experimenting with ways to strengthen base model guardrails to make them more 'harmless,' while also investigating additional layers of defense."
[...] Adversarial attacks exploit the way that machine learning picks up on patterns in data to produce aberrant behaviors. Imperceptible changes to images can, for instance, cause image classifiers to misidentify an object, or make speech recognition systems respond to inaudible messages.
[...] In one well-known experiment, from 2018, researchers added stickers to stop signs to bamboozle a computer vision system similar to the ones used in many vehicle safety systems.

[...] Armando Solar-Lezama, a professor in MIT's college of computing, says it makes sense that adversarial attacks exist in language models, given that they affect many other machine learning models. But he says it is "extremely surprising" that an attack developed on a generic open source model should work so well on several different proprietary systems.
[...] The outputs produced by the CMU researchers are fairly generic and do not seem harmful. But companies are rushing to use large models and chatbots in many ways. Matt Fredrikson, another associate professor at CMU involved with the study, says that a bot capable of taking actions on the web, like booking a flight or communicating with a contact, could perhaps be goaded into doing something harmful in the future with an adversarial attack.
[...] Solar-Lezama of MIT says the work is also a reminder to those who are giddy with the potential of ChatGPT and similar AI programs. "Any decision that is important should not be made by a [language] model on its own," he says. "In a way, it's just common sense."

Original Submission

This discussion was created by requerdanos (5997) for logged-in users only, but now has been archived. No new comments can be posted.

Researchers Figure Out How to Make AI Misbehave, Serve Up Prohibited Content | Log In/Create an Account | Top | 13 comments | Search Discussion

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Researchers Figure Out How to Make AI Misbehave, Serve Up Prohibited Content

Related Stories

Bah Bah (Score: 2) by cmdrklarg on Friday August 04 2023, @10:04PM (2 children)

Re:Bah Re:Bah (Score: 2) by darkfeline on Saturday August 05 2023, @01:05AM (1 child)

Re:Bah (Score: 2) by mcgrew on Saturday August 05 2023, @06:41PM

I see the problem here. I see the problem here. (Score: 1) by Runaway1956 on Friday August 04 2023, @11:48PM (1 child)

Re:I see the problem here. (Score: 2) by Freeman on Monday August 07 2023, @03:53PM

No one knows No one knows (Score: 5, Interesting) by hendrikboom on Saturday August 05 2023, @12:24AM (4 children)

Re:No one knows (Score: 0) by Anonymous Coward on Saturday August 05 2023, @06:28AM

Re:No one knows Re:No one knows (Score: 2) by mcgrew on Saturday August 05 2023, @06:50PM (2 children)

Re:No one knows Re:No one knows (Score: 2) by hendrikboom on Sunday August 06 2023, @03:45AM (1 child)

Re:No one knows (Score: 2) by mcgrew on Tuesday August 08 2023, @10:55PM

Mr. Scott's wisdom shall prevail (Score: 2) by Tork on Saturday August 05 2023, @01:36AM

Hypnosis, Belief, Propaganda and AIs (Score: 2) by Mojibake Tengu on Saturday August 05 2023, @11:45AM

Llama-2 Can't kill a process. (Score: 0) by Anonymous Coward on Saturday August 05 2023, @01:38PM

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Log In

Related Links

Researchers Figure Out How to Make AI Misbehave, Serve Up Prohibited Content

Related Stories

Bah Bah (Score: 2) by cmdrklarg on Friday August 04 2023, @10:04PM (2 children)

Re:Bah Re:Bah (Score: 2) by darkfeline on Saturday August 05 2023, @01:05AM (1 child)

Re:Bah (Score: 2) by mcgrew on Saturday August 05 2023, @06:41PM

I see the problem here. I see the problem here. (Score: 1) by Runaway1956 on Friday August 04 2023, @11:48PM (1 child)

Re:I see the problem here. (Score: 2) by Freeman on Monday August 07 2023, @03:53PM

No one knows No one knows (Score: 5, Interesting) by hendrikboom on Saturday August 05 2023, @12:24AM (4 children)

Re:No one knows (Score: 0) by Anonymous Coward on Saturday August 05 2023, @06:28AM

Re:No one knows Re:No one knows (Score: 2) by mcgrew on Saturday August 05 2023, @06:50PM (2 children)

Re:No one knows Re:No one knows (Score: 2) by hendrikboom on Sunday August 06 2023, @03:45AM (1 child)

Re:No one knows (Score: 2) by mcgrew on Tuesday August 08 2023, @10:55PM

Mr. Scott's wisdom shall prevail (Score: 2) by Tork on Saturday August 05 2023, @01:36AM

Hypnosis, Belief, Propaganda and AIs (Score: 2) by Mojibake Tengu on Saturday August 05 2023, @11:45AM

Llama-2 Can't kill a process. (Score: 0) by Anonymous Coward on Saturday August 05 2023, @01:38PM