May 12, 2026 · 11 min read
Four Canadian Privacy Regulators Just Ruled OpenAI Violated Five Privacy Laws to Train ChatGPT—and the "Publicly Available Information" Defense Failed in All Four Provinces
A three year joint investigation by the federal Privacy Commissioner and three provincial counterparts concluded on May 6 that OpenAI's scraping of "trillions of words" to train GPT-3.5 and GPT-4 was overbroad and lacked valid consent. The decision sets the most detailed precedent yet on what AI training does and does not get to take from the public web.
On May 6, 2026, the Office of the Privacy Commissioner of Canada published PIPEDA Findings #2026-002 — a joint decision with Quebec's Commission d'accès à l'information, British Columbia's OIPC, and Alberta's OIPC. After almost exactly three years of investigation, all four regulators agreed: OpenAI violated five separate privacy statutes when it scraped the public web to train ChatGPT.
The specific contraventions are listed in the finding. PIPEDA subsection 5(3) at the federal level. Sections 2, 11, 14, and 17 of British Columbia's Personal Information Protection Act. Sections 11, 16, and 19 of Alberta's PIPA. Section 5 of Quebec's Act respecting the protection of personal information in the private sector. Five laws across four jurisdictions, with the same conclusion: the way ChatGPT was built crossed legal lines that have been on the books for decades.
The line most worth remembering from the decision is the regulators' direct rejection of the most common defense AI companies have offered for indiscriminate scraping: "The fact that personal information is accessible does not represent a carte blanche to collect and use it without limits." That is the new floor in Canada, and it is far below where the AI industry has been operating.
What the Regulators Found in the Training Pipeline
The investigation focused on the data OpenAI used to train GPT-3.5 and GPT-4 — the models that powered ChatGPT from its public launch in late 2022 through most of 2024. The regulators examined OpenAI's documented filtering practices and concluded they were almost nonexistent at the scale required.
According to the joint finding, OpenAI applied only a narrow set of exclusions to its training corpus:
- Pirated content sites
- Adult content sites
- For GPT-4 only, sites that aggregate personal data into searchable directories
That is the entire filter list the regulators identified. Not excluded: social media platforms, discussion forums, children's websites, websites for vulnerable populations, sites containing detailed medical conversations, religious confessional content, or political organizing content. The trillions of words OpenAI scraped therefore "necessarily included" sensitive personal information across every category the law treats as requiring express consent.
The regulators were particularly direct about the lack of identifying information masking. "Given the absence of specific mitigation measures aimed at detecting and masking private identifying information in the GPT-3.5 and 4 pretraining datasets," the decision reads, "we find that these datasets necessarily included sensitive information." That sentence — that finding of fact — is what the rest of the legal analysis hangs on.
Why Implied Consent Did Not Work
The most legally interesting part of the decision is the consent analysis. OpenAI argued that posting information publicly online constituted implied consent to its later use. The regulators rejected that on four separate grounds.
- Sensitivity overrides implication. Canadian privacy law requires express consent for sensitive categories — medical, financial, religious, political, and information about children. Implied consent never applies. The scraped data contained all five categories. The legal analysis essentially ends there.
- Reasonable expectations. When most of the underlying data was posted between 2010 and 2022, generative AI training was not a publicly known concept. The regulators held that "individuals would not reasonably have expected" their posts to be used for that purpose because the purpose did not exist in popular consciousness when they posted.
- No meaningful choice. Implied consent under Canadian law requires the individual to have had a reasonable opportunity to decline. The web at scale offers no such opportunity. The regulators called this a "lack of meaningful choice about participation."
- Third party posting. Much of the personal information in the training set was posted by someone other than the subject — a friend's photograph caption, a forum thread mentioning a coworker, a tagged comment. The regulators found it unreasonable to assume "all information made publicly accessible without restrictions was provided with the knowledge and consent of the individual."
The "publicly available information" exemption in each statute also did not apply. The decision walks through each provincial regulation in detail: social media profiles, professional pages, and most online content are not enumerated as eligible public sources. The Alberta analysis is especially sharp — PIPA-AB requires a reasonable belief that individuals themselves provided the information, and scraped social media data plainly does not satisfy that test.
The Data Subject Rights Gap
A separate section of the decision addressed how OpenAI handles requests to access, correct, or delete personal information that may be in its training data. The regulators found the existing process inadequate.
OpenAI's removal mechanism required individuals to provide "verified personal information" sufficient for OpenAI to "directly and uniquely associate the requester to the information in question." In practice, this is almost impossible to satisfy. The training data is not indexed by individual. The model does not retain documents in retrievable form. An individual asking for their data to be removed has no way to specify which entries among the trillions of words concern them, and OpenAI has no way to look them up. The result is a rights mechanism that exists on paper but cannot be exercised in practice.
The decision identifies this as a structural problem rather than a procedural one. It is the same gap that has plagued GDPR enforcement against large language models in Europe — the right to erasure presumes the data can be located, and current LLM training pipelines treat data as fungible weights rather than retrievable records. Canada's regulators have now formally classified that gap as a privacy law violation.
What OpenAI Has Agreed to Change
In response to the preliminary findings, OpenAI accepted a series of corrective commitments. The decision describes them in detail, which is unusual — Canadian privacy decisions typically summarize remediation rather than document it.
- New filtering tool. OpenAI built a private individual detection model that operates before training. The decision reports 98 to 99 percent recall on the PII Masking benchmark with a 3 to 6 percent false positive rate. The tool distinguishes private individuals from public figures and fictional characters, allowing public business information to pass through.
- Model deprecation. GPT-3.5 and GPT-4 have been retired. The new mitigation only applies to models that came after. The data those earlier models were trained on still exists in the weights but is no longer being served.
- Transparency to Canadian users. OpenAI committed to a Canadian privacy blog post, to media promotion of its practices, and to notifying users that interactions may be reviewed for training. Users will also be advised against sharing sensitive information.
- Data partnership reforms. OpenAI says it now avoids datasets with "sensitive or personal information, or information that belongs to a third party," and requires contractual assurances from licensors that data was obtained lawfully.
The regulators classified the issue of appropriate purpose as "well-founded and conditionally resolved" — meaning the original violation stands as a finding but the remediation, if maintained, satisfies their concerns going forward. The consent issue was classified as "well-founded" without conditional resolution. That is a stronger conclusion. The consent failure is on the record permanently.
Why This Decision Matters Outside Canada
The Canadian decision is not binding outside Canada, but the analytical framework will travel. Three things make this finding more significant than the typical national regulator action.
- It is the most detailed AI training privacy decision globally. The Italian Garante's 2023 ChatGPT order ran a few pages. The Canadian joint finding runs to a substantial document with line by line analysis of training pipeline practices. Other regulators will use it as a template.
- It treats "publicly available" as a non defense. The U.S. legal debate has tended to treat public web data as fair game absent specific statutory restrictions. Canada has now articulated a working theory of why public posting does not authorize re purposing — and the theory does not depend on novel statutes, just on standard consent doctrine.
- It is multi jurisdictional within one country. Federal, Quebec, BC, and Alberta regulators reached identical conclusions on identical facts. That makes it much harder for OpenAI or other AI companies to argue the result is an outlier.
The pattern of finding mirrors what is happening in adjacent enforcement areas. California's $12.75 million GM settlement turned on a data minimization theory — that retaining and reselling information beyond operational necessity is itself a violation. The Canadian ChatGPT finding turns on the same theory applied to model training. The regulators agreed that pulling in more than the training task required, without filtering or consent, made the entire training pipeline unlawful.
Why Email Users Should Pay Attention Anyway
The ChatGPT decision is about web scraping, not email, but the legal logic applies directly to a category of email infrastructure that has been quietly absorbing user data for a similar reason: the assumption that anything in an inbox is fair game for training.
Several major email platforms have begun running on device or in cloud AI features that read inbox content to generate summaries, suggested replies, calendar entries, and priority categorization. Gmail's Gemini integration is the most visible example, and Google's contractual stance is that personal Gmail content is not used to train external models — but the training carve out does not extend to behavioral signals like how often you open promotional messages, which sender domains you reply to most, or the categorical fingerprint of your relationship graph. A federal class action filed earlier this year challenges that distinction.
The Canadian framework — sensitive data requires express consent, "publicly available" is not a free pass, and reasonable expectations are evaluated at the time the data was originally generated — maps directly onto the inbox case. Most inbox training programs were designed with the assumption that aggregated, anonymized behavioral signals fall outside the consent regime. If Canada's analysis spreads, that assumption stops working.
There is also a second order effect for the tracking pixel ecosystem. Email marketing platforms increasingly use AI to optimize send time, subject line, and content based on aggregated open and click behavior. The training data for those models is the same open and click telemetry that France's CNIL just gave marketers 90 days to stop collecting without consent. If both regulators are right, the AI optimization pipeline is downstream of a consent gap that is closing fast.
What Privacy Researchers Should Read Next
Three artifacts from this decision deserve attention from anyone working in AI policy or privacy law.
- The four part implied consent analysis in paragraphs 80 through 110 of the finding. It is the cleanest articulation of why public posting does not constitute training consent that any regulator has produced.
- The data subject rights gap discussion. The decision essentially holds that an LLM that cannot honor erasure requests is not compliant on its face, no matter what the front end process looks like. This is the foundation of an argument that will be central to GDPR enforcement going forward.
- The remediation specificity. The 98 to 99 percent recall PII filter, the contractual data partnership reforms, and the model deprecation timeline together form a roadmap that other regulators can demand from other AI companies. The decision functions as a de facto compliance benchmark.
The commissioners closed with a sentence worth quoting in full: "There is generally no obvious connection between an individual's posting of personal information online for a specific purpose...and the subsequent scraping and use of that personal information to train AI models." That sentence is the policy lever. Every AI training case from here forward will turn on whether courts and regulators in other jurisdictions adopt the same starting position.
Sources
- PIPEDA Findings #2026-002: Joint Investigation of OpenAI OpCo, LLC — Office of the Privacy Commissioner of Canada
- News release: Joint investigation by Canadian privacy regulators into OpenAI's ChatGPT — OPC
- Probe finds ChatGPT's model training violated Canada's federal, provincial privacy laws — IAPP
- OpenAI violated Canadian privacy laws in developing first ChatGPT model — Globe and Mail