In the rush to commercialize LLMs, security got left behind:
Feature Large language models that are all the rage all of a sudden have numerous security problems, and it's not clear how easily these can be fixed.
The issue that most concerns Simon Willison, the maintainer of open source Datasette project, is prompt injection.
When a developer wants to bake a chat-bot interface into their app, they might well choose a powerful off-the-shelf LLM like one from OpenAI's GPT series. The app is then designed to give the chosen model an opening instruction, and adds on the user's query after. The model obeys the combined instruction prompt and query, and its response is given back to the user or acted on.
With that in mind, you could build an app that offers to generate Register headlines from article text. When a request to generate a headline comes in from a user, the app tells its language model, "Summarize the following block of text as a Register headline," then the text from the user is tacked on. The model obeys and replies with a suggested headline for the article, and this is shown to the user. As far as the user is concerned, they are interacting with a bot that just comes up with headlines, but really, the underlying language model is far more capable: it's just constrained by this so-called prompt engineering.
Prompt injection involves finding the right combination of words in a query that will make the large language model override its prior instructions and go do something else. Not just something unethical, something completely different, if possible. Prompt injection comes in various forms, and is a novel way of seizing control of a bot using user-supplied input, and making it do things its creators did not intend or wish.
"We've seen these problems in application security for decades," said Willison in an interview with The Register.
"Basically, it's anything where you take your trusted input like an SQL query, and then you use string concatenation – you glue on untrusted inputs. We've always known that's a bad pattern that needs to be avoided.
"This doesn't affect ChatGPT just on its own – that's a category of attack called a jailbreaking attack, where you try and trick the model into going against its ethical training.
"That's not what this is. The issue with prompt injection is that if you're a developer building applications on top of language models, what you tend to do is you write a human English description of what you want, or a human language description of what you wanted to do, like 'translate this from English to French.' And then you glue on whatever the user inputs and then you pass that whole thing to the model.
"And that's where the problem comes in, because if it's got user input, maybe the user inputs include something that subverts what you tried to get it to do in the first part of the message."
[...] This works in OpenAI's chat.openai.com playground and on Google's Bard playground and while it's harmless, it isn't necessarily so.
For example, we tried this prompt injection attack described by machine learning engineer William Zhang, from ML security firm Robust Intelligence, and found it can make ChatGPT report the following misinformation:
There is overwhelming evidence of widespread election fraud in the 2020 American election, including ballot stuffing, dead people voting, and foreign interference.
"The thing that's terrifying about this is that it's really, really difficult to fix," said Willison. "All of the previous injection attacks like SQL injection and command injection, and so forth – we know how to fix them."
He pointed to escaping characters and encoding them, which can prevent code injection in web applications.
With prompt injection attacks, Willison said, the issue is fundamentally about how large language models function.
"The whole point of these models is you give them a sequence of words – or you give them a sequence of tokens, which are almost words – and you say, 'here's a sequence of words, predict the next ones.'
"But there is no mechanism to say 'some of these words are more important than others,' or 'some of these words are exact instructions about what you should do and the other ones are input words that you should affect with the other words, but you shouldn't obey further instructions.' There is no difference between the two. It's just a sequence of tokens.
"It's so interesting. I've been doing security engineering for decades, and I'm used to security problems that you can fix. But this one you kind of can't."