Data Selection

What is Data Selection?

In data mining, data selection is the process of identifying the best instruments for data collection along with the most appropriate data type and data source. It may also refer to the process of selecting specific samples, subsets, or data points in a data set for further analysis and processing.

Data selection should not be confused with selective data reporting, which occurs when a researcher excludes from their results data that fails to support a particular hypothesis. It’s also distinct from active and interactive data selection, which involves the use of collected data for monitoring or secondary analysis.

How Does Data Selection Work?

Data selection is a precursor to data collection, serving to guide and refine that process. Successful data selection generally requires that one define:

  • The reason for data collection. Are they seeking to answer a research question, gain insights about customer behavior, or train a machine learning model?
  • The scope of data that should be collected and whether any data should automatically be excluded from the collection process.
  • Who is responsible for making selection decisions? Is it a single individual, a team, or a community? Are there any regulatory or legislative concerns that might impact collection?
  • Technical aspects of the data which should be collected, including format and metadata.
  • Any time, capital, and resource costs or constraints associated with data collection.
  • The type of data that should be collected and the source it should be collected from.
  • Data collection tools and methods.

Data Types and Data Sources

Generally speaking, there are two primary types of data:

  • Quantitative data is concrete and measurable. It’s typically expressed in numerics. Examples include biometric markers, statistics, and measurements.
  • Qualitative data is based on observation and interpretation. Examples include video footage, images, and raw text.

Neither type of data is necessarily superior to the other. Some projects may require collecting both qualitative and quantitative data, using the former to contextualize the latter.

Data sources are far more broad and may include, but are not limited to:

  • Social media.
  • Email.
  • Journals.
  • Research publications.
  • Chat logs.
  • A survey or study.
  • Live camera footage.
  • Pre-existing data sets.
  • Internal reports.
  • Polls and/or interviews.
  • Government or institutional records.
  • Publicly-available information.
  • Focus groups.
  • Observations.
  • Online databases.

When assessing your data sources, it’s important to consider how data will be preserved and stored and how you’ll differentiate between primary and secondary data. The former is raw data collected directly from a data source, while the latter has been processed according to your needs and requirements.

Common Data Selection Methods

Once an organization has broadly defined its target data type and source, the next step is to select the data sets that will be collected.

The methods you use at this stage depend largely on your reasons for collecting the data in the first place. For instance, if your goal is social sentiment analysis, you need only define your collection criteria – specific keywords and interactions that will flag a post as relevant. From there, it’s just a matter of setting up a tool that allows you to monitor and collect that information in real time.

If you’re selecting data for a machine learning model, things get somewhat more complicated. Once you’ve selected a data source, you’ll need to ensure that source:

  • Does not contain redundant data.
  • Is contextually relevant.
  • Is unbiased.
  • Sufficiently represents all edge cases and corner cases.
  • Is varied enough to prevent overfitting.
  • Meets any other data requirements your organization may have.

The above is typically determined through a process known as sampling. This involves selecting random entries from a data source in an effort to obtain a representative data set. This can be done in several different ways, including simple random sampling, stratified sampling, systematic sampling, and cluster sampling.

In some cases, it may even be more feasible to parse a larger data set down into a smaller, more representative sample

You might also consider incorporating machine learning into the sampling process. Typically, this will involve some combination of self-supervised learning and active learning with either diversity-based or uncertainty-based sampling. A side benefit of this approach is that it allows you to more readily automate your data pipeline, introducing new samples as needed.

Sampling also plays an important role in assessing data collected for research purposes.

Why is Data Selection Important?

Done right, data selection helps to ensure that collected data is:

  • Valuable, relevant, and reliable.
  • Accurate.
  • High quality.
  • Unbiased.
  • Representative.

Improper data selection, meanwhile, can lead to a multitude of issues, most notably data selection bias. Typically, the result of a flawed sample process, data selection bias returns inaccurate results by selecting non-random data samples for analysis. Other potential problems include data redundancy, unreliable or anomalous results, and the collection of low-quality or inaccurate data.

In short, data selection is an essential part of any data-focused initiative. Without a documented data selection procedure in place, you cannot guarantee the validity and reliability of any findings, even those generated by a machine learning model. Consequently, you also cannot guarantee that were you to act on that data; you would be making an informed decision, which would be to your overall benefit.

In the worst-case scenario, this could result in decisions that are actively harmful to your organization.

Common Data Selection Use Cases

Data selection and data collection are ultimately two sides of the same coin. In light of this, data selection may be applied to a wide range of different use cases. These include, but are not limited to:

  • Pre-processing and/or selecting data sets to train machine learning models such as facial recognition, natural language processing, or computer vision tools.
  • Feature selection as part of the process of building a predictive model.
  • Performing scientific or market research.
  • Analyzing customer behavior.
  • Collecting data to assist in workflow or process optimization.
  • Supporting data exploration as part of a larger data initiative.
  • Identifying sensitive data that needs to be removed or redacted to comply with data privacy regulations.
  • Performing database queries.
  • Identifying and sanitizing redundant data present in databases or file systems.