Data Curation

Businesses can’t stop talking about “Big Data” for good reasons. Today, there’s a treasure trove of analyzable data points from various sources, including smart devices, Internet of Things sensors, data lakes, and many others.

However, all that information comes in a variety of formats and may or may not be structured. Some may even be misleading or flawed. If machine learning developers want to leverage large pools of data, they must prepare it for use in effective model training. This procedure is known as data curation.

How Does the Curation of Data Work?

Curation as a general term refers to the gathering, organization, and management of assets to make them more accessible and understandable for their intended purpose.

Data curation in the context of machine learning naturally involves organizing datasets to facilitate model training. It involves everything from data gathering to filtering, annotations, categorizing, and cleansing, and it’s an essential component of supervised learning procedures.

What Happens When You Use Uncurated Data?

The performance and accuracy of machine learning implementations depend greatly on the quality of the training and verification data. If developers fail to curate data, the model cannot learn effectively and may counterproductively perform worse post-deployment.

Curation must go beyond gathering the right datasets; organization and annotations are still essential to boosting ML accuracy. Think about how difficult it would be to find a book if the library failed to sort its titles on the shelf. Without a structured dataset, you can’t gain insights no matter how much useful information is around you.

What Are Real-World Applications of Curated Data?

Nearly any industry that operates with machine learning algorithms benefits from curating data. For instance, the medical industry can develop new drugs and bring them to market more quickly through ML-powered research and development and safety testing.

Vehicle manufacturers likewise train algorithms to develop driving assistance systems. Those systems are more precise with curated data.

What Processes Make Up Data Curation?

The focus of data curation is on datasets (i.e. tables, files, etc.) rather than collections of individual points of data. There are several steps to curating those sets.

  • Collecting data. Gather the relevant information that can help accomplish the task you want to teach your model. Businesses often maintain data lakes and servers for this purpose. Just keep in mind that gathering data is only the first step of the curation process.
  • Organizing and making it accessible. The next step is ensuring the data is easy to search through and understand.
  • Gathering metadata. Metadata describes the context around the data, and curation naturally includes the management of metadata. Developers wishing to make the data accessible to non-technical users might invest in data catalogs for this purpose.
  • Interpreting the information. Machine learning algorithms can’t extrapolate predictions from a dataset that contains quality issues, missing values, or biases. Developers may publish documentation detailing data structure and contents to help users make sense of the data.
  • Cleansing the data. The most prevalent curation step is cleansing data, checking for and resolving issues such as missing values and incorrect labels.
  • Annotations. Adding annotations or labels renders the dataset into a format that the machine can interpret for training. Annotations are necessary in supervised learning settings, though unsupervised procedures often don’t require manual annotations.
  • Evaluation. Because model performance is a direct result of data curation efforts, monitoring that performance is part of the job to check whether your input data is serving you well.

Data curation is such a prevalent task in machine learning development that many businesses have dedicated teams and professionals on board to assist.

Who Are Responsible For Data Curation in Modern Organizations?

Companies whose workflows revolve around machine learning implementations almost always have dedicated teams of data curators, each of which fulfills separate essential roles in data curation. The most common types are curation leaders, collaborators, and stewards.

Lead Curators

A few lead curators moderate data catalogs to maintain quality metadata. For instance, a business might keep an internal wiki to document data quality assurance.

Curation Collaborators

Gathering the right data and providing proper context for it often involves expertise in a variety of topics. That’s why curation collaborators bring together subject matter experts across the company to effectively crowdsource and share knowledge to help with curation.

For instance, employees in the procurement department can help with a machine learning model that works with financial data. They can share vital contextual information to help data analysts understand the “why?” behind the numbers and apply them properly to training sets.

Data Stewards

Stewarts often have overlapping roles with data curators, but their responsibilities are still unique in their respective organizations. Stewarts manage databases as a whole and help organize a company’s overall data strategy. Curators mainly focus on ensuring quality with specific datasets.