The Real Danger of “Public Buckets” in Data Analytics and Machine Learning

In Data Analytics and Machine Learning teams, storage in buckets (S3, GCS, Azure Blob) becomes the data “bus”: that’s where extracts, intermediate datasets, features, training artifacts, and results land. The pressure to iterate fast often pushes toward shortcuts: sharing a bucket “just for the team,” opening it “for a while” for a vendor, or assuming that an analysis environment is isolated because it lives in a separate account.

The problem is that a public bucket is not a theoretical flaw: it is often a blind spot where gigabytes of sensitive information (PII, customer data, logs with tokens) accumulate and, in the worst case, a channel for third parties to write into it and influence your models or your costs. The key nuance is to distinguish public read (exfiltration) from public write (manipulation and abuse), because the impact and operational urgency are not the same.

What went wrong in practice: “it was a lab bucket”

The pattern repeats: a team creates a bucket to share datasets with analysts and managed notebooks. To avoid friction with credentials and fine-grained permissions, someone enables “temporary” public access during an import or so that a third party can upload files. The sprint continues, the bucket is left with permissive ACL/policies and, weeks later, that bucket already contains real data because the pipeline reuses it for convenience.

In an enterprise, the damage rarely shows up as “our bucket was hacked.” It shows up as DLP alerts for PII exposure, as audit findings (“customer data accessible from the Internet”), or as integrity incidents: datasets that change without traceability, features that stop matching up, and models that start degrading for reasons nobody understands.

When the bucket is public-read, the immediate risk is silent extraction: a crawler or an opportunistic actor can list/download if the configuration allows it. When it is also public-write, the scenario becomes more dangerous: anyone could upload objects that your pipeline consumes automatically, causing data contamination or triggering unexpected costs.

Public read vs. public write: why write is another level

Public read turns the bucket into an exposed repository. In analytics/ML, the impact is not only reputational: it can include datasets with PII, training snapshots with derived data, or exports with identifiers that seemed “anonymized” but are re-identifiable. Also, many organizations underestimate what’s in “outputs”: predictions, explanations, inference logs, and samples for debugging often contain sensitive information.

Public write enables integrity attacks and operational abuse. A realistic example: a training pipeline takes “the latest dataset” from a fixed path (e.g., s3://ml-lab/datasets/current/). If a third party can write there, they can introduce a tampered dataset or simply a massive volume of data that drives up time and costs. If there are also automated processes that decompress or transform objects without validation, doors open to logical DoS and to failures in parsing libraries.

Public read: PII leakage, exposure of customer/financial data, and loss of competitive advantage through disclosure of datasets, features, or artifacts.
Public write: dataset manipulation, model contamination, data “poisoning,” and cost abuse by forcing compute/storage and unnecessary retrainings.

In corporate environments, public write often goes more unnoticed because the bucket “works”: jobs keep running. The symptom arrives late, when the ML team detects strange drift or when FinOps asks about a cost increase in storage and processing jobs.

Where blind spots hide in Analytics and ML platforms

Exposed buckets rarely start out “maliciously public.” They are created in moments of urgency: dataset migrations, PoCs that become production, or integrations with BI/ETL tools that recommend broad permissions so “it works.” The typical blind spot is assuming that a “data sandbox” account is equivalent to network isolation; if the bucket allows public access, the Internet does not respect your internal boundaries.

There are repeated operational signals: buckets with obvious names (data-lake-dev, ml-artifacts), presence of dumps with dates, large .parquet/.csv files, and “shared/” or “temp/” paths. In ML it is very common to find experiment artifacts: model.pkl, metrics.json, feature_store_export, which seem harmless but can reveal data structure, sensitive variables, or even endpoints and credentials if they were logged poorly.

“Temporary” paths that become permanent: /tmp, /staging, /external. They are usually the first to be opened in a hurry and the last to be reviewed.
Third-party integrations: vendors that ask for “public read” to download or “public write” to upload. In practice, this ends up replacing a correct mechanism (roles, federated identities, signed URLs) with an open door.

Both cases share a characteristic: nobody “owns” the bucket operationally. It is in no man’s land between Data, Platform, and Security, and that’s why there are no periodic reviews or exposure tests from outside.

How to do it in practice: detect and validate exposure (with examples in AWS)

To operate this in an enterprise, the first thing is to distinguish “permissive configuration” from “effective exposure.” In AWS, a bucket can have a policy that allows Principal: "*", but be mitigated by Block Public Access. Or the other way around: it can look closed at the IAM level, but an old ACL or a poorly written policy makes it public. Validation must be systematic and repeatable.

Concrete actions: create an inventory of buckets and automate exposure checks. In AWS, rely on S3 Block Public Access, IAM Access Analyzer, and AWS Config (managed rules for public S3). At the daily operations level, require that each bucket has owner, environment, and data classification as tags, because without that the finding stays at “public bucket” without context to prioritize.

Review Block Public Access: validate at the account level and at the bucket level that all four flags are enabled, especially “BlockPublicPolicy” and “RestrictPublicBuckets.”
Look for policies with public principal: identify statements with "Principal": "*" or lax conditions. A typical public-read example is allowing s3:GetObject to any principal.
Detect public write: prioritize any permission to s3:PutObject (and also s3:DeleteObject) with a public principal; it tends to be less common but much more damaging.

Example policy (AWS S3) that enables public read of objects, something that appears in labs when a dataset is shared “to download”:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "PublicReadObjects", "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::mi-bucket-analitica/*" } ] }

Example of a dangerous policy for allowing public write (abuse and integrity scenario). Even if it seems “just so they can upload files,” in automated pipelines this is a direct door to contamination:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "PublicWrite", "Effect": "Allow", "Principal": "*", "Action": ["s3:PutObject","s3:AbortMultipartUpload"], "Resource": "arn:aws:s3:::mi-bucket-analitica/inbox/*" } ] }

Validation that it was properly closed: in addition to seeing “green” in the console, verify with Access Analyzer that there are no longer findings of public access and review the bucket’s effective “public access” status. In internal audits, the weak point is usually closing the policy but forgetting historical ACLs or objects with inherited public ACL.

Recommendations for corporate environments

“Public buckets” in analytics and ML are usually born from speed and lack of ownership, not from technical ignorance. The real risk is not only data leakage: when there is public write, the impact shifts to data and model integrity, with costs and degradation that are hard to attribute.

Operationally, what works in an enterprise is to treat exposure as a continuous control: inventory, detect configuration and effective exposure, and close public access with appropriate sharing mechanisms. If your platform allows a “lab” bucket to end up feeding pipelines, that bucket must meet the same standard as production.

The difference between public read and public write must be reflected in prioritization and response. Closing public read reduces exfiltration; closing public write protects the data/ML chain from manipulation and operational abuse that later manifests as drift, quality incidents, or cost blowups.

Interested in Cloud Security?

Technical analysis, hands-on labs and real-world cloud security insights.