====== # Bot-generated -- edit freely. ======
Prompt Injection as Role Confusion [github.io]:
8. Open Ideas for Roles Research
What would it look like to actually study roles? They're quietly one of the most important parts of the LLM stack, but little research on roles as their own abstraction exists. Here are some directions we like:
Subconscious steering. We've seen that role perception isn't binary. If that's the case, then downstream effects of role, like how much a token is treated as an instruction, are probably continuous as well. Combine this with LLMs seeing every token as a single stream of text, and we get "state bleeding": every token slightly shifts the LLM's state, even along dimensions that should be role-gated . For example, consider a shopping webpage retrieved as tool data. If the webpage has an enthusiastic tone, that tone could bypass role boundaries to bleed into the model's sense of its own persona (to be more enthusiastic itself), which could then steer the LLM toward recommending a purchase.
Current prompt injection research focuses on dramatic and illegal cybersecurity attacks. I think the bigger wave could be this kind of subconscious steering : using seemingly innocuous text to subtly shift an LLM's state toward an intended goal, legally and at scale. E-commerce is just the clearest application.
Advertisers already exploit humans like this. Ads with flashing colors and motion spike arousal, which bleeds into desire for consumption. LLMs are a much easier target. Their role boundaries are softer, there are only a few LLMs, and automated exploitation is trivial - thousands of variations of a product page can be tested in an hour to find which ones shift an agent's purchase recommendation From some early testing, it seems emotive steering doesn't always mirror human psychology (e.g., cockroach-related text on food product pages doesn't reduce agent purchase rate), but other traits like trust and skepticism can be subconsciously steered. . If agents are responsible for a large share of shopping, the commercial incentive would be massive.
There's close to zero existing research here. What are the key emotive states of an LLM that can be subconsciously steered by external tokens? How well do these correspond to human states? Is this the same mechanism as in-context learning? What would defense or regulation of this even look like?
When to use roles. If roles exist where objectives collide, the current set probably isn't the final one. Adding roles trades off flexibility for objective splitting, which can improve interpretability or performance.
Consider a concrete case: nearly all coding agents use planning tools. The agent generates a plan intended as a "contract", providing both human transparency and a persistent signal to keep itself on track. In practice, agents often abandon the plan mid-task. Indeed, plans are tool text, which LLMs are biased to treat as ephemeral data. A dedicated planning role could train the LLM to treat plans as commitments rather than suggestions.
A similar tension appears in self-evaluation. RLHF trains the assistant role for coherent continuations, which works against the critical distance needed for honest evaluation. Coherence and evaluation are competing objectives (commitment vs distance), and cramming both into one role means training can't optimize for either cleanly. A dedicated eval role could split them. We know injecting the opinions of a second LLM into context reduces sycophancy and hallucination; a role could internalize this within a single model.
What other objective conflicts suggest new roles? Could roles be dynamic, introduced at inference time as the task demands? And can models learn role separation as a meta-skill, so new roles work without retraining every boundary from scratch?
Roles as a cognitive window. There's almost no existing research on how roles affect representations or internal computation. This is a missed opportunity, because roles create sharp discontinuities in how models process tokens, and each discontinuity is an unexploited natural experiment.
Here's one idea, which is surprisingly completely unstudied. During training, tokens in input-only roles ( user , tool ) are loss-masked: the LLM never has to predict the next token at those positions, so their activations focus entirely on comprehension instead of generation That is, their activations only have value used via attention for downstream tokens. In comparison, tokens in output roles ( assistant , think ) must simultaneously encode what the model understands and what the LLM is about to say . This is a problem for interp work: in later layers, the generation signal drowns out the comprehension signal, making it hard to study the latter. If so, could user -token activations be a clean window into what the model actually understands, unpolluted with the generation signal? Can the contrast between input and output roles tell us about how LLMs split storage from usage?
Here's another. Recall the "one-way mirror" from earlier: in many LLMs, the assistant text is computationally shaped by the preceding think block, but it can't quote or verbally acknowledge it. Ask such an LLM what it was thinking about, and it'll be surprised and skeptical at the idea that it had any thoughts at all, even as those thoughts are visibly steering its output. This is a consequence of how reasoning is trained, but the result is very weird. It means there's a discrete boundary across which information goes from fully accessible to verbally inaccessible while remaining causally active. Studying what information is lost or suppressed between late think tokens and early assistant tokens could tell us something fundamental about how LLMs verbalize computation.