Online Course

NRSG 795: BIOSTATISTICS FOR EVIDENCE-BASED PRACTICE

Module 1: Variables, Values, and Spreadsheets as Databases

Missing Data

The quality of the evidence can be seriously compromised when there are values missing for some study participants. Data are often incomplete because a researcher is simply unable to collect an observation despite their best effort to collect all the information on the variables they will need. Datafiles from surveys, experiments, and secondary sources often have some data missing.

Data are missing for many reasons:

Surveys: participants refuse, or do not know the answer or accidentally skip an item.
Longitudinal/experimental studies: dropping out before the study is completed because participant has moved out of the area, died, no longer sees the personal benefit to participating, or do not like the effects of the treatment. Bad weather conditions may render observation impossible in field experiments. A researcher becomes sick or equipment fails.

Record review: missing pages/files, observations/responses not recorded
Data may be missing in any type of study due to accidental or data entry error. A researcher drops a tray of test tubes. A data file becomes corrupt.

Missing values can also be thought of as system missing or user missing.

  • system missing values are values that are completely absent from the data. Possible causes:
    • some questions weren't offered to all respondents;
    • some respondents skipped some of the questions;
    • a technical failure occurred.
  • user missing values are values that are present in the data but must be excluded from calculations. Here the user is specifying the value is missing. Possible reasons:
    • Nominal/Ordinal variables may contain values that reflect answers such as “don't know” and “no opinion”
    • Extremely high or low values that possibly don't correspond to reality (outliers).

The impact of the missing data on the results of statistical analysis depends on the mechanism which caused the data to be missing and the way in which the data analyst deals with it. When viewing datasets a missing response maybe entered as a ‘.’ or given a value ‘9’. When values are assigned one must be careful for : (1) the value representing missing cannot be a possible response option (e.g., if asking age when first experienced symptom if an age of nine is a feasible answer than you cannot use ‘9’ to represent missing and (2) in analysis be careful not to include the values in the analysis (e.g., when calculating the mean you do not want that value included as extra 9s will influence the estimate).

Why it Matters

Missing values can cause serious problems. Most statistical procedures automatically eliminate cases with missing values, so you may not have enough data to perform the statistical analysis. Alternatively, although the analysis might run, the results may not be statistically significant because of the small amount of input data. Missing values can also cause misleading results by introducing bias. Factors to consider are the quantity (extent of missingness), the pattern, the role of the variable and the level of measurement. When a data set is incomplete, the data analyst has to decide how to deal with it.

The most common decision is to use complete case analysis (also called listwise deletion)–analyzing only the cases with complete data. Individuals with data missing on any variables are dropped from the analysis.

  • Extent of missing data: Missing values are handled differently if there are only a few missing values as opposed to say 20% missing. It is usually considered OK if less than 5% of the data is missing (but this also depends on the total sample size).
  • Advantages–it is easy to use, is very simple, and is the default in most statistical packages.
  • Limitations- it can substantially lower the sample size, leading to a severe lack of power. This is especially true if there are many variables involved in the analysis, each with data missing for a few cases. It can also lead to biased results, depending on why the data are missing. Diminishing generalizability.

Types of missing data (patterns of missingness)

The types of missing data are based on the relationship between the missing data mechanism and the missing and observed values. These classes are important to understand because the problems caused by missing data and the solutions to these problems are different for the four classes.

  • Missing Completely at Random (MCAR). MCAR means that the missing data mechanism is unrelated to the values of any variables, whether missing or observed. The propensity for a data point to be missing is completely random. If the observed values are essentially a random sample of the full data set, complete case analysis gives the same results as the full data set would have. Unfortunately, most missing data are not MCAR.
  • Non-Ignorable (NI) or missing not at random (MNAR). NI means that the missing data mechanism is related to the missing values. It commonly occurs when people do not want to reveal something very personal or unpopular about themselves. For example, if individuals with higher incomes are less likely to reveal them on a survey than are individuals with lower incomes, the missing data mechanism for income is non-ignorable. Whether income is missing or observed is related to its value. Complete case analysis can give highly biased results for NI missing data. If proportionally more low and moderate income individuals are left in the sample because high income people are missing, an estimate of the mean income will be lower than the actual population mean.
  • Missing at Random (MAR). MAR means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. Missing income data may be unrelated to the actual income values, but are related to education. Perhaps people with more education are less likely to reveal their income than those with less education.

A key distinction is whether the pattern if missing is ignorable (i.e., MCAR or MAR) or non-ignorable. There are excellent techniques for handling ignorable missing data (you may read about deletions, substitutions, and  imputations) but these start becoming technically complex and often require specialized software. Non-ignorable missing data are more challenging and require a different approach (improved study design and collection procedures).

We won’t be doing any of these techniques in class but it is important to understand the concepts and think about how missing values may influence your interpretation of the findings. At minimum a routine habit should be to:

  • Look for missing values and determine the extent (are there lots or only a little)
  • Note how is missing coded—do you need to be careful in doing your analysis?
  • Inform the reader about any missing value issues when describing your findings.

This website is maintained by the University of Maryland School of Nursing (UMSON) Office of Learning Technologies. The UMSON logo and all other contents of this website are the sole property of UMSON and may not be used for any purpose without prior written consent. Links to other websites do not constitute or imply an endorsement of those sites, their content, or their products and services. Please send comments, corrections, and link improvements to nrsonline@umaryland.edu.