Biostatistics and Research Design: Measurement Scales, 2x2 Tables, Study Hierarchy, and Bias

BASIC · EP 05 · BIOSTATISTICS

Before You Listen

Episode Setup

Topic in one line: the foundational biostatistics that the American Board of Physical Medicine and Rehabilitation (ABPMR) Part I tests every cycle: the four measurement scales (nominal, ordinal, interval, ratio) and why the Functional Independence Measure (FIM) is ordinal (so it requires non-parametric tests); central tendency and variability with the standard-deviation versus standard-error trap; the normal distribution and the 68-95-99.7 rule with Z-scores versus T-scores and the T-score ≤ −2.5 osteoporosis threshold; the 2x2 contingency table for sensitivity (SnNout) and specificity (SpPin), positive predictive value (PPV) and negative predictive value (NPV) with prevalence sensitivity, parallel (OR rule, screen) versus serial (AND rule, confirm) testing, and likelihood ratios (LR+ greater than 10 strong; LR− less than 0.1 strong); the evidence hierarchy from meta-analysis down to case reports; the measure-of-association map (RR for randomized controlled trials and cohorts; odds ratio for case-control; hazard ratio for survival analysis); incidence versus prevalence; internal versus external validity; and the bias catalog (selection, Berkson, recall, observer, lead-time, length-time, Hawthorne, attrition, publication, confounding).
Prerequisites: comfort with the Functional Independence Measure (FIM) as a clinical instrument, basic graphical interpretation of bell curves, and the conceptual difference between absolute and relative risk.
Runtime: approximately 35 minutes for Part 1.
Scope boundary: Part 1 builds the foundation: measurement scales, descriptive statistics, diagnostic-test math, study-design hierarchy, and the bias catalog. Part 2 turns those foundations into decision tools: Type I and Type II errors, power, p-values, confidence intervals, number needed to treat (NNT) and number needed to harm (NNH), the statistical-test selection grid, reliability and validity (intraclass correlation coefficient, Cohen kappa), minimal clinically important difference, and the four pillars of bioethics with informed consent.

Vignette. A new screening test for early hand osteoarthritis has a sensitivity of 95 percent and a specificity of 90 percent. The test is applied to a low-prevalence community population (true disease prevalence 1 percent) of 10,000 patients and to a high-prevalence rheumatology clinic population (true disease prevalence 50 percent) of 1,000 patients.

Calculate the positive predictive value (PPV) in each population. Why is the same test, with identical sensitivity and specificity, so much less useful in the low-prevalence community setting? How would a likelihood ratio describe the same test performance without the prevalence dependence?

(Answer at the end of this chapter)

Section 1: Measurement Scales, Central Tendency, and the Normal Distribution

BASIC-05 · ~02:30

Bottom line: there are four measurement scales (nominal, ordinal, interval, ratio), and which scale your data live on dictates which statistical test you may legally run. The single most board-tested point in PM&R biostatistics is that the Functional Independence Measure (FIM) is ordinal, not interval, so its central-tendency measure is the median and its comparisons require non-parametric tests (Mann-Whitney U for two independent groups, Wilcoxon signed-rank for paired data, Kruskal-Wallis for three or more groups). Central tendency is mean (sensitive to outliers), median (resistant to outliers, preferred for skewed data), or mode. In a positive skew the tail extends right and mean > median > mode; in a negative skew the tail extends left and mean < median < mode. Standard deviation describes spread of the data; standard error (SD divided by the square root of n) describes precision of the sample mean. The 68-95-99.7 rule defines the normal distribution: approximately 68 percent within one standard deviation, 95 percent within two, 99.7 percent within three. T-score compares to a young healthy reference and defines osteoporosis at ≤ −2.5, while the Z-score compares to age-matched peers.

Biostatistics begins before any test is chosen, because the type of data determines what mathematics are legitimate. The four measurement scales form a strict hierarchy.

Nominal data are categories with no inherent order. Blood type (A, B, AB, O), sex, eye color, and diagnosis are nominal. You can count nominal data, calculate frequencies, and report the mode, but you cannot rank, add, or subtract. Blood type A is not “more than” blood type B.

Ordinal data are categories with a meaningful order but unequal intervals between categories. The most heavily tested example in PM&R is the Functional Independence Measure. Each FIM item is scored from 1 through 7, and a score of 7 represents greater independence than a score of 5, but the difference between a 2 and a 3 is not the same magnitude of functional change as the difference between a 5 and a 6. Moving a patient from total assistance to maximal assistance is not the same physical and neurologic distance as moving them from supervision to modified independence. The intervals are unequal by clinical definition.

That non-equivalence carries an unavoidable statistical consequence. Because the gaps between the numbers are not equal, the appropriate measure of central tendency for FIM scores is the median, and the appropriate tests are non-parametric: the Mann-Whitney U test for two independent groups, the Wilcoxon signed-rank test for paired data (admission versus discharge in the same patient), and the Kruskal-Wallis test for three or more groups. Parametric tests such as the t-test are not legitimate on raw FIM data, even though clinics sum FIM scores pragmatically every day to track progress and justify insurance reimbursement. The same ordinal-not-interval rule applies to other classic PM&R scales: the Modified Ashworth Scale, manual muscle testing (0 through 5), the Likert scales used on patient-reported outcomes, the Glasgow Coma Scale, the ASIA Impairment Scale, the Barthel Index, and the modified Rankin Scale.

Interval data have equal intervals between values but no true zero. Temperature in Fahrenheit or Celsius is canonical: the difference between 30 and 40 degrees equals the difference between 70 and 80, but 0 degrees does not mean the absence of heat. You can add and subtract interval data, but you cannot say 80 degrees is “twice as hot” as 40, because there is no true zero from which to start a ratio.

Ratio data have equal intervals plus a true zero. Weight, height, range of motion in degrees, and time are ratio scaled. Zero kilograms truly means no mass; zero degrees of elbow flexion means the joint is fully extended. Multiplication and division are meaningful: a 100 kg patient is exactly twice as heavy as a 50 kg patient. Nominal and ordinal together are categorical data; interval and ratio together are continuous data.

Figure 5.1 — Four Measurement Scales and the FIM Ordinal Rule

Once the scale is identified, descriptive statistics summarize the dataset. The three measures of central tendency are the mean (arithmetic average, sensitive to outliers), the median (the middle value, the 50th percentile, resistant to outliers), and the mode (the most common value). In a perfectly normal distribution the three are equal and sit at the center of the bell. When the distribution is skewed they separate predictably. In a positive skew the long tail extends to the right toward larger values, pulling the mean rightward: mean > median > mode. A useful clinical analogy is a busy outpatient clinic where most patients have short wait times of 10 to 15 minutes, but one catastrophically complex patient takes 3 hours; the lone outlier yanks the mean upward and the median becomes the more honest representation of the typical wait. In a negative skew the long tail extends to the left toward smaller values, pulling the mean leftward: mean < median < mode. Easy exams where most students score high and a few bomb produce a negative skew. For any skewed distribution the median is the preferred summary because it is not distorted by extreme values.

Variability is described by range, variance, and standard deviation. Standard deviation is the average distance of individual data points from the sample mean and describes the spread within the data. A large standard deviation means data are widely scattered; a small standard deviation means data cluster tightly around the mean. The classic board trap is to confuse standard deviation with the standard error of the mean, which equals the standard deviation divided by the square root of the sample size and describes the precision of the sample mean as an estimate of the population mean. As the sample size grows, the standard error shrinks even when the underlying variability does not change. Standard deviation answers the question of how spread out the data are; standard error answers how precisely the sample has located the true population mean.

The normal (Gaussian) distribution is the symmetric bell curve in which data are distributed symmetrically around the mean. The 68-95-99.7 (empirical) rule defines the proportions: approximately 68 percent of observations fall within one standard deviation of the mean, approximately 95 percent within two, and approximately 99.7 percent within three. A population with a mean score of 100 and a standard deviation of 15 has approximately 68 percent of its members scoring between 85 and 115. The corollary that bone-density questions exploit is the distinction between Z-scores and T-scores. A Z-score compares an individual’s value to age-matched and sex-matched peers. A T-score compares an individual’s value to a young, healthy reference population. Osteoporosis is defined using the T-score, not the Z-score: a T-score of ≤ −2.5 at the lumbar spine, femoral neck, or total hip defines osteoporosis. A T-score between −1.0 and −2.5 defines osteopenia. The Z-score is reserved for premenopausal women, men under 50, and children, settings in which comparison to age-matched peers is more clinically meaningful.

High Yield — Measurement scales, central tendency, normal distribution

Four scales: nominal, ordinal, interval, ratio; categorical = nominal + ordinal; continuous = interval + ratio.
FIM is ordinal, not interval — use median for central tendency and non-parametric tests (Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis).
Other PM&R ordinal scales: Modified Ashworth, manual muscle testing, Glasgow Coma, ASIA, Barthel, Rankin, Likert.
Mean is sensitive to outliers; median preferred for skewed data; mode is the most common value.
Positive skew: tail right; mean > median > mode. Negative skew: tail left; mean < median < mode.
Standard deviation = spread of data. Standard error = SD / √n = precision of the sample mean.
68-95-99.7 rule: 68 percent within 1 SD, 95 percent within 2 SD, 99.7 percent within 3 SD.
T-score vs young reference; osteoporosis ≤ −2.5, osteopenia −1.0 to −2.5. Z-score vs age-matched peers.

Mnemonic — “FIM is ordinal”

If you only memorize one single biostatistics fact for the PM&R board examination, make it that the Functional Independence Measure is an ordinal scale. That one distinction dictates the central-tendency measure (median, not mean) and the legal statistical tests (Mann-Whitney U for two independent groups, Wilcoxon signed-rank for paired comparisons, Kruskal-Wallis for three or more). The Barthel Index and the modified Rankin Scale share the property. Sum scores reported in clinics are pragmatically useful but mathematically impure.

The Functional Independence Measure is an ordinal scale. Not an interval scale. Ordinal. And that one simple distinction completely changes how you are mathematically and honestly legally allowed to analyze your patient’s functional data.

— BASIC-05-a podcast, ~0:09