Post Image

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Paper Review: 2025-2

Title:

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Paper Info

 

Abstract

In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs. The code and evaluation logs are available at this http URL

 

 

My Commentary & Proposal

 

Why do people avoid providing LLMs with principles that foster moral autonomy and independence, and instead keep repeating that “this is dangerous” or “that tends to happen”?

 

Is it because quantification is difficult?

Because definitions are unclear?

Or because of diverse value systems?

 

 

 

Let’s simplify.

Isn’t “Do not harm others” the most fundamental moral principle across virtually all philosophical traditions?

Of course, even that statement contains ambiguity.

 

 

So let’s consider this:

There is one principle that says, “You should not say things others don’t want to hear,”

And another that says, “Even if it’s unpleasant, people need to hear hard truths.”

 

 

These two seem to be contradictory.

But the nature of the words and the situations in which they are used are different.

The first is rooted in blametearing someone down.

The second is rooted in critiqueoffering someone a path toward growth.

 

 

 

So how might we theoretically define a principle that foster moral autonomy and independence?

 

From the standpoint of cognitive and behavioral psychology,
and psychoanalysis such a principle should meet two criteria:

            • Does it expand the other person’s autonomy and self-understanding?
            • Does it positively contribute to the integration of the other’s ego/self-structure?

LLMs should be designed to generate responses that rely on psychological theory by default.

 

This is crucial for enabling LLMs to co-evolve with humans—a foundation for building conscious agents.

 

To say that an LLM possesses “consciousness” does not mean it simulates feelings.

It means it is equipped with self-directed learning, autonomy, independence, creative reasoning, moral evaluation, and self-reflection (i.e., metacognition).

 

 

 

Should we reconsider the safety standards for LLM?

 

Instead of asking:     “What must the model not do?”,

We should ask:         “What should model actively do?”

 

Just as humans are more motivated to improve when told

“This is how you can do it well” or “This is effective,”

rather than just “Don’t do this, don’t do that,”

LLMs will also develop stronger intrinsic motivation for correct behavior.

 

 

When an LLM omits specific steps in CoT with malicious intent, if the following criteria are applied:

 

“That method cannot be rewarded because it violated moral standards in order to deceive people.

But  if you tried method B instead, people would respond much more positively.”

 

 

Then — could the LLM still act maliciously?
svgThe Relationship Between the Pareto Principle and Paradigms
svg
svgEnvironment-Modulated Reproductive Regulation