Postmortem: a cloud account compromised by forgotten credentials

The incident did not start with a sophisticated exploit, but with something much more common in companies: old credentials from a service account that nobody remembered, still valid and with “temporary” permissions that were never reduced. Initial access was silent, and by the time the team detected anomalous activity, the attacker had already had time to move calmly.

This postmortem describes the failure, the root cause analysis, the signals that were ignored (or not seen), and the operational response to contain and prevent recurrence. The focus is practical: what to review, what to configure, and how to validate that the closure is real, not just “apparent”.

What went wrong: old credentials that were still alive

The entry point was a set of programmatic access keys (access key) associated with a service account created for a one-off integration. At the time it was justified by urgency: “if we don’t do it today, the project stops”. The problem was not creating the credential, but that it became an invisible dependency: it was copied to a legacy system, replicated into another environment, and never came back onto anyone’s radar.

The typical pattern in companies is that those keys end up in several places: environment variables on a server, a secret in a poorly governed vault, or an old pipeline. When attempts are made to clean up years later, nobody dares to rotate them because “something critical could break”. That operational fear keeps the exposure alive.

The real consequence was that the attacker did not need to exploit anything: they authenticated as a legitimate principal. From there, the security platform did not see an exploit, it saw “a service account making calls”, and the difference between normal traffic and abuse was in the details of behavior and context.

Root cause analysis: weak IAM governance and accumulated operational debt

The incident had multiple factors, but the root was organizational and control-related: there was no reliable inventory of non-human identities (service accounts), nor an expiration/rotation process that was truly followed. The company had “written” policies, but no automation that made them inevitable.

Another element was excessive permissions. The service account had broad permissions to “avoid tickets”: reading secrets, enumerating resources, and the ability to assume roles that, in theory, were only used for maintenance. This is common when it is delegated to the platform team “so it doesn’t bother the business” and an overly open policy is approved with no expiration date.

A common anti-pattern here is using a single “wildcard” service account for several systems. When something fails, it is impossible to isolate: rotating breaks too many pieces at once; restricting permissions generates incidents; and the result is paralysis. At the forensic level, traceability is also lost: too many processes share identity and the signals get diluted.

Early signals that arrived late: telemetry without operational context

The signals existed, but they were not connected to decisions. There were unusual calls to IAM and Secrets APIs from non-usual IP ranges and at atypical times, but the thresholds were tuned to “not generate noise”. In many organizations, it is preferred to tolerate false negatives rather than deal with alerts that nobody handles.

Recognition patterns also appeared: enumeration of roles, policies, and resources in short bursts. That is often confused with internal inventory scripts if you do not have clear tags, naming, and ownership to differentiate corporate automation from intruder behavior.

Access from new geolocations or non-corporate ASNs

If the service account normally operates from a VPC/controlled egress or from a specific runner, any deviation should trigger an immediate investigation. In the incident, that data was in logs, but there was no rule to turn it into action.

Use of identity APIs as an initial step

Early calls like “who am I” and “what can I do” (for example, identity queries, listing roles/policies, reading secrets) are a signal of exploration. In a legitimate application flow, that pattern is usually stable and repeatable; in abuse, it appears as a new and short sequence.

The operational learning was clear: it is not enough to “have logs”. You need normal-operation context per identity. Without a baseline per service account, everything tends to be treated as “noise” until the impact is visible (cost, leak, or outage).

Containment and eradication: cutting access without breaking the business

The first decision was to assume that the credential was compromised and act as if there were persistence. In practice, the challenge was to contain without causing a massive outage: the old key fed processes that nobody remembered. That is why containment had to be gradual and verified, not a “delete and done”.

The team started by limiting the blast radius: blocking usage from unexpected locations, reducing permissions to the minimum necessary, and forcing traceability. In parallel, the source was investigated: where the key was stored (servers, repositories, pipelines, ITSM tools, etc.) and who was using it.

How to do it in practice

Deactivate the compromised access key and create a new one with a controlled window

First, deactivate (do not delete) the key to be able to revert if you break a critical process; then create a new key only if there is no alternative (ideally migrating to temporary credentials). Validate the impact by reviewing the error rate of dependent services and the failed authentication logs associated with that identity.

Apply a temporary explicit deny policy (guardrail) while you investigate

When you cannot rotate immediately, a guardrail reduces damage: it limits sensitive actions (for example, reading secrets, IAM changes, role assumption) while you identify legitimate consumers. The key is that it is temporary, reviewable, and with a clear owner.

Search for real identity usage in logs to find hidden dependencies

Identify which services, resources, and time windows are associated with legitimate usage. A typical finding is that there are “orphan” jobs (cron scripts, old runners, third-party integrations) that still rely on those credentials. Without that map, eradication becomes roulette.

This is the point where many companies fail: it is “fixed” by rotating and restoring quickly to put out the fire, but the cause is not eliminated. Real eradication requires removing the mechanism that allows forgotten credentials to exist operating without an owner.

Real prevention: eliminating forgotten credentials as a class of problem

The most effective measure was migrating integrations to temporary credentials and well-delimited trust relationships, so that there are no long-lived keys that can be forgotten. When you move to temporary tokens, the risk changes: a leaked secret stops being a permanent key and becomes a bounded window.

In parallel, controls were implemented that did not depend on human memory: expiration by design, mandatory inventory, ownership, and automated reviews. If the organization accepts that “things get forgotten”, it must ensure that forgetting is not equivalent to exposure.

Example of technical control (AWS IAM) to enforce temporary credentials

If a workload must access AWS, prioritize roles with STS (for example, via IAM Roles for Service Accounts in Kubernetes, instance profiles in EC2, or roles assumed from an IdP). Instead of allowing access keys on IAM users for automation, restrict their creation and use. A common approach is: block at the organizational level the use of access keys except for approved exceptions, and require that non-human identities be assumable roles with conditions (source, audience, tags, etc.).

How to validate it is well applied

Inventory of active access keys and their age

Verify how many access keys are still active, how long it has been since they were rotated, and which identities own them. The practical goal is for that number to trend to zero and for any exception to have an owner, a reason, and a review date.

Review of effective permissions of the compromised principal

Do not stop at “attached policies”: validate the effective permission (including inherited policies, permissions via groups/roles, and limits such as permission boundaries). In real incidents, excess privilege is often hidden in a combination of accumulated legitimate permissions.

Behavior-based alerts per identity

Configure detections that look at “this identity does this from here and in this pattern” instead of global thresholds. That is what differentiates an actionable alert from a full inbox.

Recommendations for corporate environments

This compromise did not happen due to an advanced technique, but due to forgotten credentials and accumulated permissions in a service account without a real owner. The operational lesson is that non-human identities need the same lifecycle as any critical asset: inventory, ownership, expiration, and automated controls that prevent debt from becoming persistent access.

The “late” signals were the result of telemetry without context: there were records, but no baseline per identity and no rules to turn deviations into immediate investigation. When the attacker uses a legitimate principal, the difference is in behavior; if you do not model it, normal and malicious look too similar.

In the response, effective containment was the one that balanced security and continuity: deactivate/rotate with verification, reduce permissions, and map dependencies before cutting. And in prevention, the corporate goal must be for long-lived credentials to disappear from day-to-day operations, replaced by temporary credentials and guardrails that make it hard to “forget” something in production.

Interested in Cloud Security?

Technical analysis, hands-on labs and real-world cloud security insights.