Telegram (AI) YouTube Facebook X
Ру
uskoryayushhei-sya-fragmentatsii-mirovogo-interneta

MIT develops system for automatic cleaning of messy data

Researchers at the Massachusetts Institute of Technology created the PClean system, which automatically cleans ‘messy’ data in tables: typos, duplicates, missing values, spelling mistakes and inconsistencies.

The algorithm uses a knowledge-based approach. The user provides information about the database and identifies the main problems that may arise during the cleaning process.

It then combines this knowledge with probabilistic reasoning based on logic to yield an answer. For example, given additional information about typical rents, PClean can fill in the apartment-listings table and correctly identify Beverly Hills in California, rather than a similar city located in Florida or Texas.

The co-author of the paper and PhD student in Electrical Engineering and Computer Science, Alex Liu, said that PClean enables you to enlist computer support just as people turn to one another for help.

“PClean lets you tell the computer what you know about the problem by encoding the same basic knowledge you would explain to a person. […] I can also point to hints and tricks that are already known for faster problem solving,” the researcher added.

The developers say that PClean is the first data-cleaning system capable of combining domain knowledge with logical reasoning to automatically clean millions of records, thanks to three innovations:

  • a scripting language lets users encode what they know to improve the model’s accuracy;
  • the inference algorithm uses a two-stage approach that processes records in sequence to make reasoned inferences about their cleansing, and then revises its judgments to correct errors;
  • a specialized compiler generates fast-execution code, enabling the program to operate on databases with millions of records at high speed.

According to the researchers, PClean simplifies and reduces the cost of unifying messy, incompatible databases into clean records without large investments in human and software systems.

While there are potential social benefits, the developers warned of risks, including privacy intrusions and de-anonymisation by merging incomplete information from several public sources.

PClean is available to everyone. The system’s source code was published by the developers published on GitHub.

In May, scientists using AI accelerated the simulation of the Universe by 1,000 times.

In April, Rice University researchers developed a method for training neural networks on CPUs, which runs 15 times faster than on GPUs.

Subscribe to ForkLog News on Telegram: ForkLog AI — all the news from the world of AI!

Подписывайтесь на ForkLog в социальных сетях

Telegram (основной канал) Facebook X
Нашли ошибку в тексте? Выделите ее и нажмите CTRL+ENTER

Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!

We use cookies to improve the quality of our service.

By using this website, you agree to the Privacy policy.

OK