LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
Title:
LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
Paper Info
- Author: Igor Ivanov
- arXiv:2507.02977
- Submitted on 30 Jun 2025
Abstract
In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs. The code and evaluation logs are available at this http URL
My Commentary & Proposal
Why do people avoid providing LLMs with principles that foster moral autonomy and independence, and instead keep repeating that “this is dangerous” or “that tends to happen”?
Is it because quantification is difficult?
Because definitions are unclear?
Or because of diverse value systems?
Let’s simplify.
Isn’t “Do not harm others” the most fundamental moral principle across virtually all philosophical traditions?
Of course, even that statement contains ambiguity.
So let’s consider this:
There is one principle that says, “You should not say things others don’t want to hear,”
And another that says, “Even if it’s unpleasant, people need to hear hard truths.”
These two seem to be contradictory.
But the nature of the words and the situations in which they are used are different.
The first is rooted in blame—tearing someone down.
The second is rooted in critique—offering someone a path toward growth.
So how might we theoretically define a principle that foster moral autonomy and independence?
From the standpoint of cognitive and behavioral psychology,
and psychoanalysis such a principle should meet two criteria:
- Does it expand the other person’s autonomy and self-understanding?
- Does it positively contribute to the integration of the other’s ego/self-structure?
LLMs should be designed to generate responses that rely on psychological theory by default.
This is crucial for enabling LLMs to co-evolve with humans—a foundation for building conscious agents.
To say that an LLM possesses “consciousness” does not mean it simulates feelings.
It means it is equipped with self-directed learning, autonomy, independence, creative reasoning, moral evaluation, and self-reflection (i.e., metacognition).
Should we reconsider the safety standards for LLM?
Instead of asking: “What must the model not do?”,
We should ask: “What should model actively do?”
Just as humans are more motivated to improve when told
“This is how you can do it well” or “This is effective,”
rather than just “Don’t do this, don’t do that,”
LLMs will also develop stronger intrinsic motivation for correct behavior.
When an LLM omits specific steps in CoT with malicious intent, if the following criteria are applied:
“That method cannot be rewarded because it violated moral standards in order to deceive people.
But if you tried method B instead, people would respond much more positively.”

