Undercoverage bias - Does your data have it? If so, then here's how to remove it.

Undercoverage bias is when a sample survey does not adequately represent the population it is meant to describe. This can happen for a variety of reasons, including low response rates, self-selection bias, and non-random sampling. When undercoverage bias exists, it can lead to inaccurate conclusions about the population being studied.

This is a serious problem because decisions and policies are often based on survey results. If those results are biased, then the decisions that are made based on them can be as well. Undercoverage bias can have far-reaching and potentially negative consequences.

That’s why it’s so important to be aware of this type of bias and take steps to avoid it in surveys.

Using Data Science & Machine Learning to spot undercoverage bias

There are a number of ways to detect undercoverage bias in data sets. Here’s a quick list of the most popular methods:

Compare the distribution of characteristics between the target population and the sample. If there are significant differences, it is likely that the dataset has undercoverage bias.
Use simulation methods. This involves creating a model of the target pop and then simulating data from that population. The simulated data can then be compared to the actual dataset to see if there are any discrepancies.
Statistical tests can be used as well. These tests usually involve comparing means or proportions between the target, and the sample. If there are significant differences, it is likely that there is some level of undercoverage bias in your dataset. You should be able to run any sort of p-value test on this.

How to safely remove undercoverage bias from your dataset

There are a few ways that you can try to remove undercoverage bias from your dataset. One way is to oversample the minority class. This will create more instances of the minority class in your dataset, which can help reduce bias.

Another way to deal with undercoverage bias is to use a weighted algorithm. This means that each instance of the minority class is given a higher weight, so that it has more influence on the model. An example of this is below:

Finally, you could also try to create a synthetic dataset. This means creating new data points that are similar to existing ones in your dataset, but not identical. This can help improve the diversity of your dataset and reduce bias.

The best way to remove undercoverage bias is to collectors more data. This will help to ensure that your dataset is more representative of the population as a whole. However, this is not always possible or practical.

Survey Research

What is Survey Research

Survey research is a common methodological tool for collecting data about attitudes, preferences, and beliefs. Researchers use surveys to generate knowledge about target populations and to produce reliable, generalizable findings.

There are two main types of survey research: quantitative and qualitative. In quantitative Research, scientists use surveys to collect numerical data from respondents. This type of data can be used to test hypotheses and answer pre-determined questionnaires. Qualitative Research entails city-based parents’ subjective opinions on childcare through in-depth interviews or focus groups. Scientists conducting this type of research are often interested in exploring a particular phenomenon or generating new insights about a topic.

Why it matters for Data Science & ML processes

There are a few reasons why research is important for data science. They are listed below:

Surveys can provide a wealth of information about a target population. This information can be used to build models that predict behaviors or outcomes of interest.
Surveys can be used to collect data that is otherwise difficult to obtain. This might include data on people’s attitudes, beliefs, or preferences
Surveys can be helpful in measuring impact or success of interventions or programs.

Collecting Data

When you are collecting data with surveys, there are 2 major biases you will encounter. Survey sampling bias, and volunteer bias (voluntary response bias). More details on survey sampling bias are below.

FYI: for those that work with time series problems related to data collection, it is very likely the problem you will encounter the most often is look-ahead bias. Click here for more information on look-ahead bias, and how to prevent it’s damage.

Survey Sampling

Survey sampling is the practice of selecting a subset of a population to answer survey questions. This allows for a more accurate representation of the population as a whole when studying opinions or behaviors.

There are many different types of survey sampling techniques, but the most common are simple random sampling and stratified sampling. In simple random sampling, every member of the population has an equal chance of being selected for the survey. Stratified sampling involves dividing the population into groups (strata) and selecting a sample from each group in order to achieve a more representative sample.

Advantages of Simple Random Sampling

Simple random sampling is the most basic and straightforward form of random sampling. In a simple random sample, every unit in the population has an equal chance of being selected. This means that the selection process is completely random – there’s no way to predict which units will be selected and which will not.

One of the main advantages of using simple random sampling is that it gives you a relatively unbiased estimate of the population parameters. This is because each unit in the population has an equal chance of being selected, so any bias in the selection process should be evenly distributed across all units. As a result, the estimates generated by simple random sampling are less likely to be distorted by any underlying biases in the population.

Advantages of Stratified Sampling

The advantage of stratified sampling over simple random sampling is that it is better at representing the populations from which we are trying to draw inferences. This is because by stratifying we ensure that each stratum (sub-group) is proportionately represented in our sample. This means that any characteristics of the population that are related to the stratification variable(s) are also more likely to be represented in our sample.

For example, if we want to estimate the diameter of trees in a forest, we could stratify by species and then take a simple random sample within each species. This would give us a better representation of all the different tree species in the forest than if we had just taken a simple random sample.

Which is better to use

Generally speaking, you will want to use stratified sampling, as this will let you get a decent proportional of the target you are trying to study for, in other words, it will help you reduce class imbalance.

Class imbalance is a problem that often arises in machine learning, particularly when working with datasets where one class is much more represented than another. For example, imagine you had a dataset of 100 images, 90 of which were pictures of cats and 10 of which were pictures of dogs. If your goal was to train a classifier to identify animal species in new images, the “class imbalance” would be very significant since there would be 9 times as many cat images as dog images. Fixing this issue is important for the following reason:

First, if left untreated, class imbalance can lead to poor performance by machine learning algorithms. This is because most algorithms are designed to work best when each data point has an equal chance of being selected.

Voluntary Response Bias

Voluntary response bias is one of the most common and troublesome biases in survey research. It occurs when people who volunteer to participate in a study are not representative of the population as a whole. This can happen for a number of reasons, but the most common one is that people who are more interested or invested in the topic being studied are more likely to volunteer.

This bias matters because it can completely distort the results of a study. If the people who choose to participate are not representative of the population, then the conclusions drawn from the study may be inaccurate or misleading.

Although the survey is technically fair, as anyone can respond to it. It does a poor job at fair representation because the ones most likely to respond are the ones who are complaining. This will massively impact your research results on your dataset. Here is how data science and machine learning can be used to spot if your dataset has some voluntary response bias.

Using Data Science principles to detect it

There are a few ways to spot if your data has voluntary response bias. One way is to look at the distribution of responses. If there is a noticeable difference in the distribution of responses between different groups (e.g. people who respond “Yes” vs people who respond “No”), then this could be an indication that there is response bias. Another way to check for bias is to look at the relationship between the response and other variables in the data set. If there is a strong correlation between the response and other variables, then this could be an indication that there is response bias. Finally, you can use machine learning techniques to detect patterns in the data that may indicate response bias.

How to safely remove it

Data science can be used to remove volunteer response bias from data by using techniques like weighting and stratification. Weighting adjusts for Volunteer Response Bias (VBR) by assigning different weights to different groups of respondents. Stratification involves dividing the population into strata, or groups, and selecting a representative sample from each stratum. This ensures that each group is represented proportionately in the final sample.

Another way to remove it is to just resample the data (over-sampling), and this time just use stratified sampling, and this should in theory help remove the voluntary response bias.

Undercoverage bias – Does your data have it? If so, then here’s how to remove it.

Using Data Science & Machine Learning to spot undercoverage bias

How to safely remove undercoverage bias from your dataset