Data Leak Reveals Extent of Censorship in China

ForkLog

11 months ago

Data Leak Reveals Extent of Censorship in China

A database has leaked online, loaded into a complex large language model (LLM) for the automatic filtering of content deemed “sensitive” by the Chinese government. This was reported by TechCrunch.

China has developed an AI system to “enhance its already powerful censorship machine,” the publication writes. The topics affected extend far beyond traditional taboos such as the Tiananmen Square events and cover 133,000 examples. Among them:

complaints about poverty in rural areas of the country;
a news report on a bribed member of the Communist Party;
corrupt police officers persecuting entrepreneurs.

The system is primarily aimed at filtering information among Chinese internet users but can be applied for other purposes. TechCrunch cited the example of enhancing the censorship capabilities of domestic AI models.

Xiao Qiang, a researcher from the University of California, Berkeley, after reviewing the document, emphasized the authorities’ desire to use LLM to strengthen repression.

“Unlike traditional censorship mechanisms, which rely on human labor for keyword filtering and manual review, an LLM trained with such instructions will significantly increase the efficiency and detail of state information control,” he said.

The situation once again highlights that authoritarian regimes are quickly mastering the latest technologies, noted TechCrunch journalists.

LLM for Detecting Dissent

The document was discovered by a security researcher under the pseudonym NetAskari in an unsecured Elasticsearch database hosted on a Baidu server. There is no precise information on who created the set. It is known that the latest entries date back to December 2024.

The creator of the system tasked an unnamed LLM with determining whether the content relates to sensitive political topics, public life, or the military. It should be considered the highest priority and immediately flagged.

Among the topics are scandals related to environmental pollution and food safety, financial fraud, and labor disputes that could lead to public protests.

Any form of “political satire” is subject to direct persecution. For example, if someone uses historical analogies to express an opinion about “current political figures,” it should be immediately flagged. Similarly, with “Taiwan policy” and military topics, including troop movements, exercises, and armaments. The Chinese word 台湾 (Taiwan) is mentioned in the database more than 15,000 times.

A fragment of the dataset. The code refers to prompt markers and LLM. This confirms that the system uses an AI model to perform its tasks. Data: TechCrunch.

One fragment mentions an anecdote about the transience of power—a particularly sensitive topic for China due to its authoritarian political system, notes TechCrunch.

Designed for “Public Opinion Management”

The document contains no information about its creator, but it mentions its purpose “for public opinion management.” This provides a strong hint that the system’s database serves the goals of the Chinese government, noted Michael Caster, head of the Asia program at the human rights organization Article 19.

He emphasized that “public opinion management” is controlled by the powerful Chinese state regulator—the Cyberspace Administration of China (CAC)—and usually refers to censorship and propaganda.

The ultimate goal is to ensure the protection of the Chinese government’s narratives online and to suppress any alternative views.

Repression Becomes Smarter

In February, OpenAI published a report stating that an unknown entity, likely operating from China, used generative artificial intelligence to monitor conversations on social media. It analyzed discussions of those advocating protests due to human rights violations in the country and forwarded them to the Chinese government.

OpenAI also found that the technology is used to generate comments highly critical of the well-known Chinese dissident Cai Xia.

Traditional censorship methods rely on basic algorithms that automatically block content mentioning terms from a blacklist such as “Tiananmen massacre” or “Xi Jinping.” Many users encountered this when first using DeepSeek.

But new AI technologies could make censorship more effective, according to TechCrunch. They are capable of detecting even subtle criticism and regularly improving.

“I think it’s very important to highlight how AI-driven censorship is evolving, making state control over public opinion even more sophisticated, especially at a time when Chinese models like DeepSeek are gaining momentum,” said Qiang.

After the rapid rise in popularity of AI models, DeepSeek has caught the attention of Chinese authorities. Employees work under new, stricter conditions, and some have had their passports confiscated.

In March, OpenAI recommended the US government ban AI models from the Chinese lab, as the project is “state-subsidized” and “state-controlled.”