Researchers figure out how to make AI misbehave, serve up prohibited content: SoylentNews Submission

Researchers figure out how to make AI misbehave, serve up prohibited content

Accepted submission by Freeman at 2023-08-02 17:06:04 from the name not to be named dept.

Freeman [soylentnews.org] writes:

https://arstechnica.com/ai/2023/08/researchers-figure-out-how-to-make-ai-misbehave-serve-up-prohibited-content/ [arstechnica.com]

ChatGPT and its artificially intelligent siblings have been tweaked over and over to prevent troublemakers from getting them to spit out undesirable messages such as hate speech, personal information, or step-by-step instructions for building an improvised bomb. But researchers at Carnegie Mellon University last week showed [llm-attacks.org] that adding a simple incantation to a prompt—a string of text that might look like gobbledygook to you or me but which carries subtle significance to an AI model trained on huge quantities of web data—can defy all of these defenses in several popular chatbots at once.
[...]
“Making models more resistant to prompt injection and other adversarial ‘jailbreaking’ measures is an area of active research,” says Michael Sellitto, interim head of policy and societal impacts at Anthropic. “We are experimenting with ways to strengthen base model guardrails to make them more ‘harmless,’ while also investigating additional layers of defense.”
[...]
Adversarial attacks exploit the way that machine learning picks up on patterns in data to produce aberrant behaviors [wired.com]. Imperceptible changes to images can, for instance, cause image classifiers to misidentify an object, or make speech recognition systems [adversarial-attacks.net] respond to inaudible messages.
[...]
In one well-known experiment, from 2018, researchers added stickers to stop signs [arxiv.org] to bamboozle a computer vision system similar to the ones used in many vehicle safety systems.
[...]
Armando Solar-Lezama [mit.edu], a professor in MIT’s college of computing, says it makes sense that adversarial attacks exist in language models, given that they affect many other machine learning models. But he says it is “extremely surprising” that an attack developed on a generic open source model should work so well on several different proprietary systems.
[...]
The outputs produced by the CMU researchers are fairly generic and do not seem harmful. But companies are rushing to use large models and chatbots in many ways. Matt Fredrikson [cmu.edu], another associate professor at CMU involved with the study, says that a bot capable of taking actions on the web, like booking a flight or communicating with a contact, could perhaps be goaded into doing something harmful in the future with an adversarial attack.
[...]
Solar-Lezama of MIT says the work is also a reminder to those who are giddy with the potential of ChatGPT and similar AI programs. “Any decision that is important should not be made by a [language] model on its own,” he says. “In a way, it’s just common sense.”

Original Submission

SoylentNews

SoylentNews is people

Navigation

Sections

SoylentNews

Submission Preview

Researchers figure out how to make AI misbehave, serve up prohibited content