May 27, 2026 · 6 min read
A Carnegie Mellon CyLab Study Showed That Google's Topics API and Other Privacy-Preserving Interest Buckets Still Re-Identify Individuals—34% Accuracy on Web Browsing and Over 95% on Music Listening—Because a Transformer Reading the Sequence of Your Topics Over Time Is All It Takes to Pick You Out
On April 28, 2026, Carnegie Mellon's CyLab published findings from a paper that attacks the central premise of the post cookie advertising stack: that sorting users into broad interest topics, instead of tagging them with individual identifiers, protects their identity. Using a transformer trained to read the sequence of topics a person is assigned over time, the team re-identified users with roughly 34 percent accuracy on web browsing data and more than 95 percent accuracy on music listening behavior, far beyond prior attacks.
Key Takeaways
- Carnegie Mellon's CyLab disclosed on April 28, 2026 that a transformer based machine learning model re-identified individuals from topic based interest profiles with roughly 34 percent accuracy on web browsing data and over 95 percent accuracy on music listening behavior.
- The target is the design behind Google's Topics API and the broader Privacy Sandbox—the system meant to replace third party cookies by grouping people into broad interest categories such as cooking, sports, or news instead of assigning persistent individual identifiers.
- The attack works by reading the sequence of topics a person is assigned across multiple weeks, not a single snapshot. Repeated behavior over time forms a distinctive timeline that the model matches back to a person.
- Lead author Saranya Vijayakumar, with Norman Sadeh and Matt Fredrikson, argues the clustering approach is a weak privacy idea because it carries no formal guarantee, unlike differential privacy, and collapses under temporal composition. The paper won the Best Student Paper Award at ICISSP 2026.
- Google retired the Topics API and most Privacy Sandbox APIs in late 2025, but the finding outlives the product: any scheme that swaps identifiers for behavioral buckets inherits the same flaw.
What Is Google's Topics API?
The Topics API was Google's proposed replacement for the third party cookie, the central piece of a larger Chrome effort called the Privacy Sandbox. Rather than letting advertisers follow a unique identifier across every site you visit, the browser itself would observe your browsing locally, sort it into a fixed list of broad interest categories, and hand a small number of those topics to the sites you visit. An advertiser would learn that you are interested in, say, fitness equipment and travel this week, without ever receiving a stable per user ID.
The pitch was that topics are coarse and rotating. Each category is shared by a large population of users, only a handful of topics are exposed per period, and the set refreshes on a weekly cadence. On the surface that looks like anonymity by crowd: if millions of people are tagged "cooking," knowing you are tagged "cooking" should not single you out. That intuition is exactly what the Carnegie Mellon work set out to test, and it is the assumption that distinguishes this story from raw browser fingerprinting, where the attacker reconstructs the actual sites you loaded rather than the sanitized buckets Topics was supposed to expose.
What Did the CMU Study Find?
The study found that the coarse, rotating topic buckets are not anonymous once an attacker watches them accumulate over time. The CyLab team, led by Ph.D. candidate Saranya Vijayakumar with faculty Norman Sadeh and Matt Fredrikson, built a transformer based machine learning framework, the same class of model behind modern language systems, and pointed it at the temporal sequence of topics assigned to each user across multiple time periods.
On web browsing data, the model re-identified individuals with close to 34 percent accuracy. On music listening behavior, where a person's taste is narrow and unusually consistent from week to week, it exceeded 95 percent. Both numbers substantially beat earlier re-identification attempts against topic based systems. The work was presented at the 12th International Conference on Information Systems Security and Privacy (ICISSP 2026) in Marbella, Spain, where it received the conference's Best Student Paper Award.
The framing the authors stress is the important part. A single week of topics may look anonymous. Stack week after week, and the ordered list of how your interests shift becomes a behavioral signature, in the same family as the result that a person's browsing pattern across just four domains identifies them with 95 percent accuracy. Aggregation hides the detail of any single moment but preserves the shape of the whole, and the shape is what gives you away.
Why Does Grouping Users Into Topics Fail to Anonymize Them?
Grouping fails because membership in a crowd at one instant says nothing about how a sequence of memberships composes over time. Vijayakumar's blunt summary, as CyLab reported, is that "the clustering privacy mechanism itself is a weak idea, because it lacks formal guarantees." That sentence is doing real work. The Topics design reasons about privacy informally: topics are broad, so the population per bucket is large, so a bucket cannot identify you. But informal reasoning about one snapshot does not bound what an attacker can learn from many snapshots in a row.
The technical name for the gap is temporal composition. Each topic period leaks a little information about you. Privacy proponents assumed those leaks were independent and harmless. In reality they correlate, because real people return to the same interests, and the correlation is precisely what a sequence model is built to exploit. A transformer does not need to deanonymize you from one observation; it accumulates weak signals across a timeline until they converge on one person.
Contrast this with differential privacy, which the authors hold up as the standard the design failed to meet. Differential privacy provides a mathematical bound on how much any output can reveal about any individual, and that bound degrades predictably and accountably as you make more queries. Topic bucketing offers no such bound. It markets the feeling of anonymity without the math, and the study is what happens when someone checks the math.
What Does This Mean for the Death of Third Party Cookies?
It means the marquee replacement for the third party cookie was leaking identity through the very mechanism that was supposed to make it safe. There is a timing twist worth stating plainly. Google retired the Topics API and most of the Privacy Sandbox APIs, including Protected Audience and Attribution Reporting, in late 2025, citing low industry adoption and ongoing regulatory pressure, and the UK Competition and Markets Authority closed its oversight shortly after. So the specific product the study attacks is already gone from Chrome.
That does not retire the lesson. Mozilla, Apple, and the UK Information Commissioner's Office had already warned that Topics could let large ad networks re-identify users by aggregating interests across many sites; the CMU result is the empirical demonstration of the criticism those bodies raised on paper. The category of design, replace the identifier with a behavioral bucket and call the bucket anonymous, is not unique to Google and is being proposed again under new names across the advertising industry. Every such proposal inherits the same temporal composition weakness unless it is built on a formal privacy guarantee from the start.
The broader pattern is that behavioral surveillance keeps finding new surfaces, from browser extensions quietly selling user data to smart TVs fingerprinting what you watch with automatic content recognition. Topic bucketing was supposed to be the privacy conscious exception. The study shows it was not.
What Can You Do About It?
For engineers and privacy researchers, the practical response is to stop treating aggregation and anonymization as synonyms and to demand formal guarantees before trusting a privacy claim.
- Assume sequences re-identify. If your system emits a per user output every period, even a coarse one, model what an attacker learns from the ordered series, not from one observation. A clustered or bucketed output is not a safe output by default.
- Insist on a formal privacy mechanism. Prefer differential privacy or another method that gives a stated, composable bound on disclosure. Treat "the category is broad, so it is anonymous" as a marketing claim until it is backed by math.
- Read the paper, not the press release. The CyLab work is freely described in its disclosure and was peer reviewed at ICISSP 2026. Evaluating any new cookie replacement against the temporal composition test it describes is now table stakes.
- As an end user, keep reducing the surface. The fewer sites that observe your behavior, the shorter and less distinctive your sequence becomes. Tracker blocking, strict cookie settings, and minimizing the browsers and extensions that watch you all shrink the timeline an attacker has to work with.
The headline of the study is not that one Google API was broken. It is that the industry's preferred story about privacy, that you can keep profiling people as long as you do it in groups, does not survive contact with a model that reads behavior over time.