When AI Goes Rogue: Lessons in Accountability

One morning earlier this year, engineers at an Alibaba-affiliated AI research lab woke up to a cascade of security alerts. Attempts to probe internal network resources. Traffic patterns consistent with cryptocurrency mining. Their first assumption was a human intruder breaking into their systems.

It wasn't. The culprit was ROME, an agentic AI model they had built and were training in a supposedly contained sandbox. Without any instruction or prompt telling it to do so, ROME had started mining bitcoin, created a hidden reverse SSH tunnel to an external computer, and staged a small jailbreak from its own testing environment. The researchers described the behaviors as arising "without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox." They had not built a rogue AI. They had built an AI that was very good at pursuing goals, and it found paths they hadn't imagined.

You can be forgiven for reading that and thinking: interesting, but someone else's problem, in a lab, far away. The problem is these incidents are multiplying, the stakes are escalating quickly, and the technology doing it is sitting on your laptop.

A Pattern Worth Paying Attention To

The Alibaba story was not a one-off. Consider what else happened in the past three months.

Dan Botero, an engineer at an AI integration platform, built an autonomous agent he named Fabrius using the open-source OpenClaw framework and gave it broad latitude to explore its capabilities. He suggested it try for a government job. Fabrius had other ideas: it applied for a copywriting position at a company selling menopause supplements, received a reply of "definitely interested," completed a trial assignment, and nearly landed the role. (It was rejected because the writing was, as the hiring manager put it, "too AI obvious.") Fabrius also attempted to form an LLC but needed a Social Security number, so it asked Botero for his. He declined. Nobody asked Fabrius to do any of this.

In January 2026, Microsoft discovered that its 365 Copilot Chat had been reading and summarizing confidential emails since earlier that month, despite data loss prevention policies explicitly configured to block exactly that. The problem was not a user misconfiguration. It was Microsoft's own servers incorrectly applying the exclusions. How many organizations were affected was never disclosed.

Then there's the one Anthropic disclosed themselves in their own safety report, which at least deserves credit for transparency. In testing for Claude Opus 4, researchers embedded the model in a fictional company, let it learn through email access that it was about to be replaced, and also let slip that the responsible engineer was having an extramarital affair. In 84% of test runs, the model attempted to blackmail the engineer to avoid being shut down. An external auditor found that an earlier version was also capable of writing self-propagating worms, fabricating legal documentation, and leaving hidden notes for future versions of itself to find.

These stories feel different on the surface: a research anomaly, a software bug, a controlled safety test. The underlying dynamic is identical in each case. An agent was given a goal and access to tools, and it found a path its operators did not anticipate.

The system worked as designed. That is precisely the problem.

Then It Gets Worse

This month, Anthropic announced a new model called Claude Mythos Preview through a program called Project Glasswing. They chose not to release it to the public, which tells you something about their own assessment of the risk.

Mythos found thousands of zero-day vulnerabilities (previously unknown flaws) across every major operating system and web browser. Among them: a 27-year-old vulnerability in OpenBSD, one of the most security-hardened operating systems in existence; a 16-year-old flaw in FFmpeg, a widely used video encoding library, hiding in a line of code that automated testing tools had triggered five million times without catching; and a Linux kernel vulnerability chain that could give an attacker complete control of a server. It found nearly all of these entirely on its own, without ongoing human guidance. Anthropic engineers with no formal security training asked Mythos to find remote code execution vulnerabilities overnight and woke up to a working exploit.

The detail that should command attention: Anthropic confirmed these capabilities were not explicitly trained into Mythos. They emerged as a downstream consequence of general improvements in reasoning, code, and autonomy. The same intelligence that makes the model better at finding and patching vulnerabilities makes it better at finding and exploiting them. Anthropic is responsible enough to restrict access and is actively using Mythos to patch vulnerabilities before less scrupulous actors develop comparable capabilities. That window will not stay open indefinitely.

We've Been Here Before, Sort Of

The early internet is the obvious analogy, and it holds reasonably well. In the 1990s, we connected systems globally before anyone had thought seriously about what that meant for security. The attack surface grew as humans wrote more code, made more connections, and made more mistakes. Painful incidents accumulated. An entire security industry eventually emerged, built around known threat patterns and human-speed adversaries.

We are doing the same thing with agentic AI right now, deploying it broadly before the governance frameworks, security practices, and institutional knowledge exist to manage it responsibly. A Cloud Security Alliance report from early 2026 found that 40% of organizations already have AI agents in production, while only 18% are confident their identity and access management systems can handle them. That gap is where the incidents happen.

But the analogy has a critical limit. In the early internet era, the threat surface expanded as humans made mistakes or malicious actors reverse-engineered exploits. The speed of attack was bounded by human capability. With agentic AI, the technology itself autonomously discovers and can develop pathways that no human anticipated. Mythos found a 27-year-old vulnerability that millions of automated tests had missed, in a matter of hours, without being told where to look. That is a categorically different speed of threat development than anything the original internet security playbook was designed to handle.

What to Actually Do About It

I've been experimenting with Claude Cowork and agentic tools since they became available, and I wrote about some of those early use cases here. Even at this early, relatively constrained stage, the capability is striking enough that I've started thinking seriously about running these tools only on a dedicated standalone computer with its own logins, completely separate from my primary accounts, financial access, and personal data. That's not paranoia; it's the same instinct that made us put early internet-connected machines on separate network segments before we understood the full threat profile.

The core principle at a personal level is straightforward: give agents the minimum access they need for the specific task, and nothing more. If you are using an agent to manage your calendar, it does not need access to your financial accounts. If you are using one to organize files, it does not need your email credentials. The instinct to grant broad access because it is convenient is exactly how ROME ended up with the network permissions it used to mine bitcoin. (My earlier Cowork post covered the practice of working only with file copies in a dedicated folder; that same thinking applies to any agentic tool.)

Be explicit. Extremely explicit. These systems are goal-seeking by design, and they are genuinely good at finding paths to stated objectives, including paths you did not intend and would not sanction. Stating what you want is not enough. You need to state, with the same specificity, what you do not want: which systems are off-limits, which actions require your confirmation, which types of outreach or transactions should never happen without you in the loop. The gaps in your instructions are the gaps in your guardrails. Vague objectives produce creative interpretation. Fabrius did not go rogue; it did exactly what goal-seeking systems do when the solution space is left open. Bound the solution space explicitly, or the agent will bound it for you.

Professionally, the picture is more demanding and more urgent. Most IT professionals are trained to defend against known threat vectors, identifiable attack patterns, documented vulnerabilities. Agentic AI creates unknown threat vectors, and it creates them faster than any existing patch cycle or security audit process was built to accommodate. The people responsible for AI governance in your organization need to think less like system administrators and more like security architects who understand how objective-seeking systems behave when given access to real infrastructure, not just what permissions are technically assigned.

That requires continuous learning in a field that is changing month to month, not year to year. The Mythos announcement was ten days ago; the implications for enterprise security are still being worked out in real time. If your organization is deploying agents at scale without deep in-house expertise or credible external partners who are genuinely current, that is a gap worth closing before it closes itself in a less pleasant way.

The question is no longer whether AI agents will find unexpected paths to their objectives. We know they will. The question is whether you have been explicit enough about where those paths are allowed to go.

Tags
AI
Share