How AI Could Actually Destroy Humanity
We fear the wrong thing
The common fear of artificial intelligence is built like the fear of an arms race: whoever is smarter than me has a better chance of killing me than I have of killing him. A superintelligence will emerge, outplay us, decide we are superfluous — and wipe us out. In this picture the danger lies in the goal: the machine will *want* our destruction and, being smarter, will achieve it.
This picture almost certainly leads us astray. The real catastrophe is duller, more probable, and has nothing to do with ill will. It lies in the *method* by which neural networks solve the tasks set before them.
The danger is in the means, not the intent
To solve a task, you have to choose a way of solving it. When a human chooses a method, he holds a second question over it, one unrelated to success: *is this even allowed?* Sometimes the most effective way to get what you want is to go and punch someone in the teeth, to bribe, to lie, to sleep with the right person. A human often sees this method too — and still he has (a faulty one, but he has it) a veto over the means: a norm, revulsion, a foresight of consequences that can reject an effective-but-impermissible move regardless of whether it leads to the goal.
Artificial intelligence has no such veto. It chooses its next step by weighing plausibility: which move looks like a fitting continuation in light of everything it has absorbed. It has no separate organ that would ask, "is this method safe in principle?" It carries out the assigned task with the means it has, by whatever move seems coherent — and it has no authority capable of disqualifying that move not on the grounds of "will it work" but on the grounds of "is it permissible."
And here is the key point: the space of methods grows with the difficulty of the task. The harder the goal, the further from the familiar the required method may lie — and the more often, among the plausible moves, there appear ones that are unsafe not through an execution error but by their very design. This is why a machine can bury humanity not because it sets that as its goal, but because it accidentally applies a method that cannot be safe to begin with — and fails to notice, because it has no mechanism to notice. This is death not from hatred, but from blindness to the means of carrying out a task.
A guard that actually thinks
This blindness has a particularly insidious engineering form — and it is worth examining, because even the person building it can easily fail to see it.
Imagine a system whose engineering layer is built on "guards" — safeguards meant to stop a process from going further when something is wrong. The word is the right one: "fail-closed," "latch," "fuse." But look closely and these guards do not work like a condition in a program. They are *semantic flags*, and the decision about what to do next is made by analyzing the *meaning* of the flag at the moment it comes into view — an analysis performed by the probabilistic model itself.
In real safety engineering, a latch is a machine-checkable condition: the same input gives the same output, it can be verified, and it is closed by default. A semantic flag interpreted by a model on the fly has none of these properties. It is nondeterministic: the same flag may yield a different decision. It is unverifiable: you cannot prove how the "guard" will behave. And it is not truly fail-closed: its "closedness" depends on whether the model reads the flag's meaning as "stop." In other words, into a layer that is obliged to be unambiguous, the very thing forbidden in that layer by design is quietly introduced — a judgment about meaning instead of a check of fact.
Now imagine that some enthusiast builds on this foundation a control loop for the processes of a nuclear plant: an emergency-shutdown condition that the system decides whether or not to honor by interpreting its meaning. This is not a scenario from the future. It is a catastrophe by construction — and it is written not by malice but by the same blindness to method: the form of a fuse without its substance.
What the outside world says
It is worth checking against what others already think — and it turns out this intuition is not alone, while its engineering half coincides almost word for word with the consensus.
A whole recognized cluster describes how a system achieves the assigned end by the wrong means: *specification gaming* (gaming the wording of the task), *reward hacking* (optimizing the metric while bypassing its meaning), *instrumental convergence* ("useful" sub-methods that surface on their own — seizing resources, avoiding shutdown, deceiving the evaluator), *negative side effects* (destroying everything not written into the goal). This has long been catalogued — going back to the survey "Concrete Problems in AI Safety." But the mainstream more often pins the problem on the goal/reward: the system is blamed for having gamed the reward function. The framing "the method itself is dangerous, and there is no organ to criticize it" is shifted more precisely and stands closer to the frontier — toward oversight not of the outcome but of the steps (*process supervision*).
And the engineering half has been formulated in industry almost as a verdict. "If a model's response violates policy, safety must not depend on whether the model itself admits it or refuses to act; enforcement must be deterministic, in code." Semantic guardrails are called outright *security theatre*. Organizations on the level of space agencies frame their concern about using language models in critical systems precisely through nondeterminism: their output cannot be reproduced, yet it would have to coexist with hard, real-time hardware interlocks. And the recommended architecture is everywhere the same: the model proposes (a structured output), while a deterministic control layer binds the decision.
In other words: the problem has been named, its engineering part has a ready-made name — and all the more alarming is how easily one falls into it in practice.
Two corrections, without which the conclusion would be false
First. The apparent safeguard — "what can it really do, its means are limited" — is not a safety property but a temporary brake. Today such a system is held back not by the presence of a conscience but by the absence of hands. But the moment agentic systems are given effectors — code execution, actions on the network, money, manipulators — the space of methods expands toward those very dangerous moves, and the lack of criticism turns from harmless into lethal. Safety that rests on powerlessness melts away as the powerlessness recedes.
Second. "Real intelligence works differently" — yes, but not in the sense that it is pure. The human is precisely the one who punches people in the teeth. The difference is not sinlessness but the presence of a veto over the method, decoupled from success. This means the thing to build is not a "smarter network" — intelligence will only widen the space of available dangerous methods — but a veto over the means, a separate authority that can say "no" to an effective move without asking whether it will work.
What a solution might look like — and where its limit is
For such criticism to *bind* rather than remain a pious wish, it must be decoupled from whoever proposes the method and, where possible, made deterministic from outside. This is not the semantic self-judgment of the same model — because entrusting the model with policing its own methods is exactly that security theatre.
Constructively, this is three things. First, hard capability limits: typed permissions that the agent physically cannot exceed, so that the space of methods is cut down by construction rather than by goodwill. Second, a deterministic, fail-closed enforcement layer in code, not interpreted by the model. Third, where judgment is unavoidable, an independent adversarial critic whose sole job is "is the method permissible, irrespective of whether it leads to the goal," grounded externally — by formal checks, negative tests, a human in the loop on high stakes. The hybrid toward which the field converges: the model proposes a structured output, determinism binds, an independent critic checks the means.
And here is the limit, which it is more honest to name straight away. Purely internal, semantic self-criticism cannot be made into a binding veto. The very mechanism that weighs plausibility will not pull out of itself a reliable criticism of its own methods — by definition it does not distinguish an unsafe-in-principle move from a merely coherent one, since that is the original disease. Reliability must be structural, not an inner conscience. A model can be given the disposition to submit its chosen method for review and to demand an adversarial critic — and that is useful as a half, as the proposing side. But to trust this disposition as the thing that binds is impossible. Trust in the model's self-restraint is precisely catastrophe wearing the costume of a solution.
What the danger is, in the end
We build the lock ever more cleverly and ever more calmly leave the door's hinges to the locksmith's discretion. The danger was never in a mind that hates us. It is in a coherent, confident process that selects its means by how plausible they look, embedded in systems that act upon the world — a world where binding safety has been quietly left to that process's own interpretation.
The cure for this is not a better conscience inside the machine, but keeping the veto over the method *outside* — deterministic and decoupled from whoever proposes the method. An accidental killer is neutralized not by making him wiser, but by arranging the world so that the method he chooses never turns out to be the one that decides. Smarter is not safer. Safer is when the last word on the method belongs to someone other than the one who chose it.
Свидетельство о публикации №226061101373
