Poisoning Web-Scale Training Datasets is Practical

No cover — Poisoning Web-Scale Training Datasets is Practical (2024, Arxiv)

Published May 6, 2024 by Arxiv.

(1 review)

Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

1 edition

Jacob T. reviewed Poisoning Web-Scale Training Datasets is Practical by Nicholas Carlini

Empirical evidence of LLM attacker economics

5 stars

With the race to collect and train on ever more data (and re-train on the latest data more quickly), the ability for LLM creators to perform even cursory checks against training set corruption is almost nil. This paper shows two ways an attacker can corrupt 0.01-1% of a LLM training dataset for a reasonable sum. Existing works have shown that for a specific desired error state, a 0.01% training data poisoning attack can yield a 60-90% chance of tampering with model performance.

There are two core primitives presented in this paper: 1. The corpi release a metadata archive of URLs, and then the fetched content. There are enough expired domains in the metadata that allows for an attacker to corrupt a percentage of the URLs being scraped. 2. Wikipedia is converted into a timestamped dump (e.g., a ZIM file) in a predictable order, and on a predictable schedule. By changing …

With the race to collect and train on ever more data (and re-train on the latest data more quickly), the ability for LLM creators to perform even cursory checks against training set corruption is almost nil. This paper shows two ways an attacker can corrupt 0.01-1% of a LLM training dataset for a reasonable sum. Existing works have shown that for a specific desired error state, a 0.01% training data poisoning attack can yield a 60-90% chance of tampering with model performance.

There are two core primitives presented in this paper: 1. The corpi release a metadata archive of URLs, and then the fetched content. There are enough expired domains in the metadata that allows for an attacker to corrupt a percentage of the URLs being scraped. 2. Wikipedia is converted into a timestamped dump (e.g., a ZIM file) in a predictable order, and on a predictable schedule. By changing Wikipedia articles just before archival, even if they are reverted by attentive editors, they will persist in the dump. The authors estimate that they could alter ~6.5% of Wiki articles during this process.