Online Course

NRSG 795: BIOSTATISTICS FOR EVIDENCE-BASED PRACTICE

Module 10: Other Common Uses of Statistics and the Need for Good Data

Measuring Quality Of Data

When reading research studies, you should always carefully review the quality of the measures. You want to feel confident that the variable used in the analysis is a reflection of reality and is measuring what it was supposed to (click here examples of two types of measurement errors). There are many ways to evaluate quality. You might ask yourself – how good are the measures at quantifying what it is intended to represent? This is the validity. How consistent and reproducible is the measure? This is reliability.

An understanding of the concepts of reliability and validity is fundamental to choosing and selecting appropriate measures for a study.  While these topics were likely covered in other research design classes (click here for a mini review), in this module we point out you now know the statistical tests that can be used to assess reliability and validity. For example, Pearson correlation and linear regression may be used to assess convergent validity.

Reliability

The reliability of a test is the extent to which the test is consistent or dependable.

In measurement terms, it is the extent to which a test is free from random error. Thus, we use statistics to quantify and test for reliability. There are several common approaches to assessing reliability:

  1. Test-retest reliability. The extent to which scores on the same measure, administered at two different times correlate with each other.
  2. Equivalent forms reliability. The extent to which scores on similar, but not identical, measures – administered at two different times correlate with each other.
  3. Internal consistency reliability. The extent to which scores on the items of a scale correlate with each other. This is usually assessed with coefficient alpha, the most common of which is Cronbach’s alpha. This is interpreted like a Pearson’s correlation where 0=no agreement and 1= perfect agreement. As a rule of thumb, a Cronbach’s alpha of .7 or higher is required to be a “good” test. Higher values (>.8 or even >.9) may be required for instruments used in clinical practice such as making a diagnosis based on an instrument. 
  4. Interrater reliability. The extent to which the rating of one or more judges correlate with each other. If the measure is a scale (e.g., agitation), then coefficient alpha can be used. If the measure is nominal (e.g., socializing appropriately, YN), then a Kappa statistic can be used. This reflects the measure of agreement among judges. Like Cronbach’s alpha, kappa ranges from 0 (indicating that the judges ratings are completely random) to 1 (indicating that the ratings have no error).

Validity

A good measure must not only be reliable, but also valid. The validity of a test reflects the extent to which the test actually measures what it claims to.

This relates to how well the concept you are measuring is operationalized. If you conducted a study using a fitbit (which records physical activity), the measure is likely reliable if you followed the directions for use. However, if you claimed that it was assessing the conceptual variable “overall health status”, the fitbit data would likely not be a valid measure since it is not measuring the overall health.
There are several types of validity and their labeling and categorization can slightly vary among scholars.  However, these are commonly used validity types, some of which are assessed with statistics:

  • Face validity. The extent to which the measured variable appears to be an adequate measure of the concept. This is often measured subjectively.
  • Content validity. The extent to which the measured variable appears to adequately cover the full domain of the conceptual variable. This too may be measured subjectively with a panel of experts who know the domains of the conceptual variable. However, statistical testing (i.e., factor analysis) is used to examine how individual items relate to subscales and to total scale.
  • Construct validity. The extent to which inferences can legitimately be made from how the construct was operationalized in a study to the theoretical constructs that were proposed based on theory. There are two forms of construct validity:
  • Convergent validity. The extent to which a measured variable is found to be related to another measured variable designed to measure the same concept. If the two measures meet the assumptions of the test, a Pearson correlation can be used.
  • Discriminant validity. The extent to which a measured variable is found to be unrelated to other measured variables designed to measure other conceptual variables. Pearson r can be used here if it is a simple comparison.
  • Criterion validity. The extent to which one type of measure (e.g., activity measured by self-report), correlates with a behavioral measure (e.g., activity measured by fitbit).

    There are two formsof criterion validity:
    • Concurrent validity is the extent to which one measurement is backed up by a related measurement obtained at about the same point in time.
    • Predictive validity is where you want to determine if a current measure correlates with a future behavior.

In both, correlation and regression can be used.

Required Readings and Videos

Learning Activity

  • After reading the following abstracts, identify the type of evidence particular statements are providing (reliability or validity) and identify the type/approach used.
  • Check your responses here.

This website is maintained by the University of Maryland School of Nursing (UMSON) Office of Learning Technologies. The UMSON logo and all other contents of this website are the sole property of UMSON and may not be used for any purpose without prior written consent. Links to other websites do not constitute or imply an endorsement of those sites, their content, or their products and services. Please send comments, corrections, and link improvements to nrsonline@umaryland.edu.