Cloud incident response plan: the minimum playbook for companies already operating in production

Most companies “have” an incident response plan, but when the incident happens in the cloud in production, what makes the difference is not the document: it is the ability to execute hard decisions in minutes without breaking the business. The minimum playbook does not seek to cover all scenarios, but to ensure that in the face of three frequent events (account compromise, secret leak, privilege escalation) the team can contain, preserve evidence, and restore control with the least possible impact.

This article is intended for teams that already run workloads in production and need an actionable script for the first 24 hours. We will not go into IR theory or general frameworks: here is the “what we do now”, with real trade-offs and concrete validations in AWS/Azure/GCP.

What went wrong (and how it shows up) in production cloud incidents

In the cloud, incidents rarely start with “they encrypted everything”. They usually start with small signals: an API key used from an anomalous geography, a role that suddenly assumes sessions with a strange duration, or a secret that appears in a repo or in logs. The problem is that in production there is noise: pipelines that rotate credentials, service accounts that scale via autoscaling, and teams that deploy 20 times a day. That noise is the perfect refuge for the attacker.

The three scenarios that repeat most in companies operating seriously share a pattern: loss of control over identity and permissions. A compromised account (human or service) allows persistence; a secret leak enables “legitimate” access without exploiting anything; a privilege escalation turns a limited flaw into a cross-cutting incident (network, data, backups, CI/CD). In all cases, the real cost in the company usually comes afterward: blind permission changes that break production, massive uncoordinated rotations that take down integrations, or “turning off” logs that were the only evidence.

The first important decision is to accept that containment in the cloud almost always competes with business continuity. If you cut too much, you stop revenue; if you cut too little, the attacker moves. The minimum playbook proposes reversible and verifiable containments, prioritizing cutting access paths before touching data or infra.

The first 24 hours: prioritization and decisions that cannot be postponed

In the first hours, the goal is not to “fix everything”, but to regain operational control: know which identity is compromised, how it gets in, what permissions it has, and whether there is persistence. If the team starts by “rotating all secrets” without identifying the vector, it usually causes two damages: critical dependencies break (payments, SSO, B2B integrations) and clues are lost (active sessions, event correlation) because the system state is changed unnecessarily.

A prioritization that works in companies with active production is: (1) stop the actor’s current accesses, (2) preserve visibility and evidence, (3) block persistence, (4) restore legitimate access and (5) only then, rotate in a planned way. This avoids the typical corporate pattern of “mass change by email” that ends in a deployment freeze and teams bypassing controls to restore service.

Operational checklist for the first 24 hours

Freeze the affected identity perimeter without breaking everything: disable the specific user/credential, revoke sessions/tokens when possible, and apply temporary restrictions (by IP, by conditions, by session duration) before touching global policies.
Ensure logging and retention: verify that audit logs (CloudTrail/Azure Activity Logs/GCP Audit Logs) are enabled, with retention and a destination outside the affected account/project if it exists. Avoid “cleaning” resources because it can destroy evidence and complicate compliance.
Identify recent actions with the greatest impact: creation/modification of roles, trust policies, new keys, changes in IdP/SSO, firewall/SG/NSG rules, and exfiltration from storage or databases. In companies, attackers usually prioritize credentials and persistent access before “touching” workloads.

Each point above is deliberately reversible: it seeks to minimize impact and maintain the ability to investigate. If a more aggressive cut is needed, let it be a conscious escalation (with communication to the business) and not an impulsive reaction.

Minimum viable containment: cut access without losing visibility or control

Containing in the cloud is not “shutting down servers”; it is cutting identity and network paths selectively. Minimum viable usually is: revoke sessions, disable specific credentials, limit role assumption, and block egress or access to sensitive data if there is evidence of exfiltration. Containment must have two properties: be quick to execute and easy to validate.

In real incidents, the most expensive mistake is applying a global “deny all” or touching base policies without a rollback plan. That kind of response usually takes down pipelines, triggers improvised rotations, and opens the door to bypass (teams creating credentials outside central control to “deploy again”). Better a layered containment: first the compromised identity, then critical roles, then network routes and, lastly, specific data services.

How to do it in practice

AWS (IAM/STS): deactivate access keys for the affected user, force session termination (by revoking credentials and rotating keys for the compromised user/role), and restrict assumption of critical roles via conditions. If you suspect a compromised role, immediately review its trust policy and recent sessions in CloudTrail (events AssumeRole, AssumeRoleWithSAML, AssumeRoleWithWebIdentity).
Azure (Entra ID/ARM): block the user or service account, revoke sessions (sign-in sessions), and review recent role assignments in Azure RBAC and changes in applications/consents. A typical incident is the abuse of a registered app with excessive permissions.
GCP (IAM/OAuth): disable or rotate service account keys, review recent IAM bindings and issued OAuth tokens. In GCP, IAM changes at the project/organization level are the inflection point: prioritize auditing that before touching workloads.

Quick validation after containment: try to reproduce the attacker’s access in a controlled way (for example, simulating role assumption with the suspected principal) and confirm in the logs that the attempts fail. If you cannot validate, containment is a hypothesis, not a control.

Secret leaks and privilege escalation: rotation and blast radius with discipline

When there is a secret leak (tokens, keys, passwords, kubeconfig, CI/CD credentials), the natural impulse is to rotate everything. In a company, that can break integrations with customers, vendors, or legacy systems where changing a secret involves tickets, windows, and validations. The minimum playbook recommends rotating by risk: first what gives access to identity (IdP, CI/CD, cloud accounts), then what gives access to data (DB/storage), and lastly credentials for peripheral services.

For privilege escalation, the focus is not “removing permissions from everyone”, but identifying the mechanism: role chaining, open trust policies, IAM administration permissions, or indirect paths (for example, a principal with permission to update a function/serverless that then runs with a privileged role). In real incidents, the attacker rarely invents magic: they exploit an existing poorly-scoped permission and turn it into persistence.

Technical block: example of a concrete control and how to validate it (AWS)

If you detect that a critical role can be assumed by unexpected principals, fix the trust policy to limit the principal and add conditions. Example (simplified) to allow assumption only from a specific role and require MFA when it applies:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::123456789012:role/ci-deploy"}, "Action": "sts:AssumeRole", "Condition": { "Bool": {"aws:MultiFactorAuthPresent": "true"} } } ] }

Validation in AWS: review in CloudTrail that AssumeRole events only come from the allowed principal; look for AccessDenied failures for attempts outside the condition; and verify in IAM Access Analyzer whether the role is exposed to external entities. If the role is associated with workloads (EC2/ECS/Lambda), confirm that those services did not depend on that interactive assumption (so as not to break production).

Secret rotation must be accompanied by inventory and a functionality test. A rotated secret without a real “smoke test” ends in delayed outages (for example, 2 hours later when the cached token expires) and makes the incident look “intermittent”, complicating diagnosis.

Per-provider checklist for investigation and verification without losing control

Investigating in the cloud is correlation: identity → action → resource → data. Minimum viable is having a list of “places” to look first, because time goes into exploration. In companies, the investigation is often blocked by insufficient permissions for the response team or by logs scattered across different accounts/projects; that’s why the checklist focuses on high-value evidence and on verifying that you can still see what is happening.

Initial verification points

AWS: CloudTrail (including management events), IAM changes (users, roles, policies, access keys), STS events (AssumeRole*), KMS changes (key policies/grants), and S3 access (data events if enabled). In real incidents, discovering a new access key created minutes before the first external access often explains persistence.
Azure: Entra ID sign-in logs, audit logs (app changes, consents, role assignments), subscription Activity Logs, and Key Vault changes (accesses and modifications). A common pattern is the abuse of delegated/consented permissions to operate “as if it were legitimate”.
GCP: Cloud Audit Logs (Admin Activity and Data Access if applicable), IAM changes at the project/folder/org level, recently created service account keys and token usage. In companies with multiple projects, the attacker tries to move laterally via org-level IAM.

After reviewing, decide whether the incident is “identity only” (with no evidence of data access) or if there are signs of exfiltration. That decision changes the next step: if there is likely exfiltration, prioritize blocking egress routes and access to specific datasets; if there is not, prioritize closing persistence and cleaning up permissions. In both cases, document every operational change with timestamp and owner: later it will save you in audits and internal postmortems.

Recommendations for corporate environments

The minimum playbook for cloud incident response in production must optimize for execution: reversible containments, immediate verification, and preservation of evidence. In the first 24 hours, what is critical is to regain identity control (users, service accounts, roles and trust), keep logging, and avoid global changes that break the business without blocking the attacker.

If your team only implements three things, let them be: (1) an operational per-provider checklist for containment and investigation, (2) validation mechanisms after each change (logs and controlled tests), and (3) secret rotation by priority and with tests, not “blindly”. That is what separates a mature response from a noisy response that worsens the impact in production.

Interested in Cloud Security?

Technical analysis, hands-on labs and real-world cloud security insights.