Physical Medicine & Rehabilitation: Principles and Practice
4th Edition

Chapter 53
Principles and Applications of Measurement Methods
Steven R. Hinderer
Kathleen A. Hinderer
Objective measurement provides a scientific basis for communication between professionals, documentation of treatment efficacy, and scientific credibility within the medical community. Federal, state, private third-party payer, and consumer organizations increasingly are requiring objective evidence of improvement as an outcome of treatment. Empirical clinical observation is no longer an acceptable method without objective data to support clinical decision making. The lack of reliability of clinicians’ unaided measurement capabilities is documented in the literature (1,2,3,4,5,6,7), further supporting the importance of objective measures. In addition, comparison of alternative evaluation or treatment methods, when more than one possible choice is available, requires appropriate use of measurement principles (8,9,10,11).
Clinicians and clinical researchers use measurements to assess characteristics, functions, or behaviors thought to be present or absent in specific groups of people. The application of objective measures uses structured observations to compare performances or characteristics across individuals (i.e., to discriminate), or within individuals over time (i.e., to evaluate), or for prognostication based on current status (i.e., to predict) (12,13). It is important to understand the principles of measurement and the characteristics of good measures to be an effective user of the tools. Standards for implementation of tests and measures have been established within physical therapy (14,15), psychology (16), and medical rehabilitation (17) to address quality improvement and ethical issues for the use of clinical measures.
The purpose of this chapter is to discuss the basic principles of tests and measurements and to provide the reader with an understanding of the rationale for assessing and selecting measures that will provide the information required to interpret test results properly. A critical starting point is to define what is to be measured, for what purpose, and at what cost. Standardized measurements meeting these criteria should then be assessed for reliability and validity pertinent to answering the question or questions posed by the user. Measurements that are shown not to be valid or reliable provide misleading information that is ultimately useless (18).
The initial section of this chapter discusses the psychometric parameters used to evaluate tests and measures. Principles of evaluation, testing, and interpretation are detailed in the second section. The third section provides guidelines for objective measurement when a standardized test is not available to measure the behavior, function, or characteristic of interest.
The complexity and diversity of the tests and measures used in rehabilitation medicine clinical practice and research preclude itemized description in a single chapter. Appendix A, which appears at the end of this chapter, provides sources of available objective tests and measures and serves as a resource for the reader to seek further information on measures in their domain or domains of interest. Reviewing the references provided in Appendix A in conjunction with the principles provided in this chapter will enable the reader to become a more sophisticated user of objective measurement tools. Although there are several good measures listed in Appendix A, there is much developmental work that needs to be completed for many of these tests. A measurement is not objective unless adequate levels of reliability have been demonstrated (18). Therefore, it is imperative that the user be able to recognize the limitations of these tests to avoid inadvertent misuse or misinterpretation of test results.
PSYCHOMETRIC PARAMETERS USED TO EVALUATE TESTS AND MEASURES
The methods developed primarily in the psychology literature to evaluate objective measures generally are applicable to the standardized tests and instruments used in rehabilitation medicine. The topics discussed in this section are the foundation for all useful measures. Measurement tools must have defined levels of measurements for the trait or traits to be assessed and a purpose for obtaining the measurements. Additionally, tests and measures need to be practical, reliable, and valid.
Levels of Measurement
Tests and measures come in multiple forms because of the variety of parameters measured in clinical practice and research.
Despite the seemingly overwhelming number of measures, there are classified levels of measurement that determine how test results should be analyzed and interpreted (19). The four basic levels of measurement data are nominal, ordinal, interval,
P.1140

and ratio. Nominal and ordinal scales are used to classify discrete measures because the scores produced fall into discrete categories. Interval and ratio scales are used to classify continuous measures because the scores produced can fall anywhere along a continuum within the range of possible scores.
A nominal scale is used to classify data that do not have a rank order. The purpose of a nominal scale is to categorize people or objects into different groups based on a specific variable. An example of nominal data is diagnosis.
Ordinal data are operationally defined to assign individuals to categories that are mutually exclusive and discrete. The categories have a logical hierarchy, but it cannot be assumed that the intervals are equal between each category, even if the scale appears to have equal increments. Ordinal scales are the most commonly used level of measurement in clinical practice. Examples of ordinal scales are the manual muscle test scale (20,21,22,23,24) and functional outcome measures (e.g., Functional Independence Measure) (25).
Interval data, unlike nominal and ordinal scales, are continuous. An interval scale has sequential units with numerically equal distances between them. Interval data often are generated from quantitative instrumentation as opposed to clinical observation. An example of an interval measurement is range-of-motion scores reported in degrees.
A ratio scale is an interval scale on which the zero point represents a total absence of the quantity being measured. An example is force scores obtained from a quantitative muscle strength testing device.
Interval and ratio scales are more sophisticated and complex than nominal and ordinal scales. The latter are more common because they are easier to create. However, analysis of nominal and ordinal scales requires special consideration to avoid misinference from test results (26,27). The major controversies surrounding the use of these scales are the problems of unidimensionality and whether scores of items and subtests can be summed to provide an overall score. Continuous scales have a higher sensitivity of measurement and allow more rigorous statistical analyses to be performed.
Purpose of Testing
After the level of the measure has been selected, the purpose of testing must be examined. Tests generally serve one of two purposes: screening or in-depth assessment of specific traits, behaviors, or functions.
SCREENING TESTS
Screening tests have three possible applications:
  • To discriminate between “suspect” and “normal” patients
  • To identify people needing further assessment
  • To assess a number of broad categories superficially
One example of a screening test is the Test of Orientation for Rehabilitation Patients, administered to individuals who are confused or disoriented secondary to traumatic brain injury, cerebrovascular accident, seizure disorder, brain tumor, or other neurologic events (28,29,30,31). This test screens for orientation to person and personal situation, place, time, schedule, and temporal continuity. Another well-developed screening test is the Miller Assessment for Preschoolers (MAP) (32). This test screens preschoolers for problems in the following areas: sensory and motor, speech and language, cognition, behaviors, and visual-motor integration.
The advantages of screening tests are that they are brief and sample a broad range of behaviors, traits, or characteristics. They are limited, however, because of an increased frequency of false-positive results that is due to the small sample of behaviors obtained. Screening tests should be used cautiously for diagnosis, placement, or treatment planning. They are used most effectively to indicate the need for more extensive testing and treatment of specific problem areas identified by the screening assessment.
ASSESSMENT TESTS
Assessment tests have four possible applications:
  • To evaluate specific behaviors in greater depth
  • To provide information for planning interventions
  • To determine placement into specialized programs
  • To provide measurements to monitor progress
An example of an assessment measure is the Boston Diagnostic Aphasia Examination (33). The advantages of assessment measures are that they have a lower frequency of false-positive results; they assess a representative set of behaviors; they can be used for diagnosis, placement, or treatment planning; and they provide information regarding the functional level of the individual tested. The limitations are that an extended amount of time is needed for testing, and they generally require specially trained personnel to administer, score, and interpret the results.
Criterion-Referenced versus Norm-Referenced Tests
Proper interpretation of test results requires comparison with a set of standards or expectations for performance. There are two basic types of standardized measures: criterion-referenced and norm-referenced tests.
CRITERION-REFERENCED TESTS
Criterion-referenced tests are those for which the test score is interpreted in terms of performance on the test relative to the continuum of possible scores attainable (18). The focus is on what the person can do or what he or she knows rather than how he or she compares with others (34). Individual performance is compared with a fixed expected standard rather than a reference group. Scores are interpreted based on absolute criteria, for example, the total number of items successfully completed. Criterion-referenced tests are useful to discriminate between successive performances of one person. They are conducted to measure a specific set of behavioral objectives. The Tufts Assessment of Motor Performance (which has undergone further validation work and has been renamed the Michigan Modified Performance Assessment) is an example of a criterion-referenced test (35,36,37,38,39). This assessment battery measures a broad range of physical skills in the areas of mobility, activities of daily living, and physical aspects of communication.
NORM-REFERENCED TESTS
Norm-referenced tests use a representative sample of people who are measured relative to a variable of interest. Norm referencing permits comparison of a single person’s measurement with those scores expected for the rest of the population. The normal values reported should be obtained from, and reported for, clearly described populations. The normal population should be the same as those for whom the test was designed to detect abnormalities (34). Reports of norm-referenced test results should use scoring procedures that reflect the person’s position relative to the normal distribution (e.g., percentiles, standard scores). Measures of central tendency (e.g., mean, median, mode) and variability (e.g., standard deviation, standard error of the mean) also should be reported to provide information
P.1141

on the range of normal scores, assisting with determination of the clinical relevance of test results. An example of a norm-referenced test is the Peabody Developmental Motor Scale (40). This developmental test assesses fine and gross motor domains. Test items are classified into the following categories: grasp, hand use, eye-hand coordination, manual dexterity, reflexes, balance, nonlocomotor, locomotor, and receipt and propulsion of objects.
Practicality
A test or instrument should ideally be practical, easy to use, insensitive to outside influences, inexpensive, and designed to allow efficient administration (41). For example, it is not efficient to begin testing in a supine position, switch to a prone position, then return to supine. Test administration should be organized to complete all testing in one position before switching to another. Instructions for administering the test should be clear and concise, and scoring criteria should be clearly defined. If equipment is required, it must be durable and of good quality. Qualifications of the tester and additional training required to become proficient in test administration should be specified. The time to administer the test should be indicated in the test manual. The duration of the test and level of difficulty need to be appropriate relative to the attention span and perceived capabilities of the patient being tested. Finally, the test manual should provide summary statistics and detailed guidelines for appropriate use and interpretation of test scores based on the method of test development.
Reliability and Agreement
A general definition of reliability is the extent to which a measurement provides consistent information (i.e., is free from random error). Granger and associates (42) provide the analogy “it may be thought of as the extent to which the data contain relevant information with a high signal-to-noise ratio vs. irrelevant static confusion.” In contrast, agreement is defined as the extent to which identical measurements are made. Reliability and agreement are distinctly different concepts and are estimated using different statistical techniques (43). Unfortunately, these concepts and their respective statistics often are treated synonymously in the literature.
The level of reliability is not necessarily congruent with the degree of agreement. It is possible for ratings to cluster consistently toward the same end of the scale, resulting in high-reliability coefficients, and yet these judgments may or may not be equivalent. High reliability does not indicate whether the raters absolutely agree. It can occur concurrently with low agreement when each rater scores patients differently, but the relative differences in the scores are consistent for all patients rated. Conversely, low reliability does not necessarily indicate that raters disagree. Low-reliability coefficients can occur with high agreement when the range of scores assigned by the raters is restricted or when the variability of the ratings is small (i.e., in a homogeneous population). In instances in which the scores are fairly homogeneous, reliability coefficients lack the power to detect relationships and are often depressed, even though agreement between ratings may be relatively high. The reader is referred to Tinsley and Weiss for examples of these concepts (44). Both reliability and agreement must be established on the target population or populations to which the measure will be applied, using typical examiners. There are five types of reliability and agreement:
  • Interrater
  • Test-retest
  • Intertrial
  • Alternate form
  • Population specific
Each type will be discussed below, along with indications for calculating reliability versus agreement and their respective statistics.
INTERRATER RELIABILITY AND AGREEMENT
Interrater or interobserver agreement is the extent to which independent examiners agree exactly on a patient’s performance. In contrast, interrater reliability is defined as the degree to which the ratings of different observers are proportional when expressed as deviations from their means; that is, the relationship of one rated person to other rated people is the same, although the absolute numbers used to express the relationship may vary from rater to rater (44). The independence of the examiners in the training they receive and the observations they make is critical in determining interrater agreement and reliability. When examiners have trained together or confer when performing a test, the interrater reliability or agreement coefficient calculated from their observations may be artificially inflated.
An interrater agreement or reliability coefficient provides an estimate of how much measurement error can be expected in scores obtained by two or more examiners who have independently rated the same person. Determining interrater agreement or reliability is particularly important for test scores that largely depend on the examiner’s skill or judgment. An acceptable level of interrater reliability or agreement is essential for comparison of test results obtained from different clinical centers. Interrater agreement or reliability is a basic criterion for a measure to be called objective. If multiple examiners consistently obtain the same absolute or relative scores, then it is much more likely that the score is a function of the measure, rather than of the collective subjective bias of the examiners (18).
Pure interrater agreement and reliability are determined by having one examiner administer the test while the other examiner or examiners observe and independently score the person’s performance at the same point in time. When assessing some parameters, when the skill of the examiner administering the test plays a vital role (e.g., sensory testing, range of motion testing) or when direct observation of each examiner is required (e.g., strength), it is impossible to assess pure interrater agreement and reliability. In these instances, each examiner must test the individual independently. Consequently, these interrater measures are confounded by factors of time and variation in patient performance.
TEST-RETEST RELIABILITY AND AGREEMENT
Test-retest agreement is defined as the extent to which a patient receives identical scores during two different test sessions when rated by the same examiner. In contrast, test-retest reliability assesses the degree of consistency in how a person’s score is rank ordered relative to other people tested by the same examiner during different test sessions. Test-retest reliability is the most basic and essential form of reliability. It provides an estimate of the variation in patient performance on a different test day, when retested by the same examiner. Some of the error in a test-retest situation also may be attributed to variations in the examiner’s performance. It is important to determine the magnitude of day-to-day fluctuations in performance so that true changes in the parameters of interest can be determined. Variability of the test or how it is administered should not be the source of observed changes over time. Additionally, with quantitative measuring instruments, the examiner must be
P.1142

knowledgeable in the method of and frequency required for instrument calibration.
The suggested test-retest interval is 1 to 3 days for most physical measures and 7 days for maximal effort tests in which muscle fatigue is involved (45). The test-retest interval should not exceed the expected time for change to occur naturally. The purpose of an adequate but relatively short interval is to minimize the effects of memory, practice, and maturation or deterioration on test performance (46).
INTERTRIAL RELIABILITY AND AGREEMENT
Intertrial agreement provides an estimate of the stability of repeated scores obtained by one examiner within a test session. Intertrial reliability assesses the consistency of one examiner rank-ordering repeated trials obtained from patients using the same measurement tool and standardized method for testing and scoring results within a test session. Intertrial agreement and reliability also are influenced by individual performance factors such as fatigue, motor learning, motivation, and consistency of effort. Intertrial agreement and reliability should not be confused with test-retest agreement and reliability. The latter involves test sessions usually separated by days or weeks as opposed to seconds or minutes for intertrial agreement and reliability. A higher level of association is expected for results obtained from trials within a test session than those from different sessions.
ALTERNATE FORM RELIABILITY AND AGREEMENT
Alternate form agreement refers to the consistency of scores obtained from two forms of the same test. Equivalent or parallel forms are different test versions intended to measure the same traits at a comparable level of difficulty. Alternate form reliability refers to whether the parallel forms of a test rank order people’s scores consistently relative to each other. A high level of alternate form agreement or reliability may be required if a person must be tested more than once and a learning or practice effect is expected. This is particularly important when one form of the test will be used as a pretest and a second as a posttest.
POPULATION-SPECIFIC RELIABILITY AND AGREEMENT
Population-specific agreement and reliability assess the degree of absolute and relative reproducibility, respectively, that a test has for a specific group being measured (e.g., Ashworth scale scores for rating severity of spasticity from spinal cord injury). A variation of this type of agreement and reliability refers to the population of examiners administering the test (18).
INTERPRETATION OF RELIABILITY AND AGREEMENT STATISTICS
Because measures of reliability and agreement are concerned with the degree of consistency or concordance between two or more independently derived sets of scores, they can be expressed in terms of correlation coefficients (34). The reliability coefficient is usually expressed as a value between 0 and 1, with higher values indicating higher reliability. Agreement statistics can range from -1 to +1, with +1 indicating perfect agreement, 0 indicating chance agreement, and negative values indicating less than chance agreement. The coefficient of choice varies, depending on the data type analyzed. The reader is referred to Bartko and Carpenter (47), Hartmann (48), Hollenbeck (49), Liebetrau (50), and Tinsley and Weiss (44) for discussions of how to select appropriate statistical measures of reliability and agreement. Table 53-1 provides information on appropriate statistical procedures for calculating interrater and test-retest reliability and agreement for discrete and continuous data types. No definitive standards for minimum acceptable levels of the different types of reliability and agreement statistics have been established; however, guidelines for minimum levels are provided in Table 53-1. The acceptable level varies, depending on the magnitude of the decision being made, the population variance, the sources of error variance, and the measurement technique (e.g., instrumentation versus behavioral assessments). If the population variance is relatively homogeneous, lower estimates of reliability are acceptable. In contrast, if the population variance is heterogeneous, higher estimates of reliability are expected. Critical values of correlation coefficients, based on the desired level of significance and the number of subjects, are provided in tables in measurement textbooks (51,52). It is important to note that a correlation coefficient that is statistically significant does not necessarily indicate that adequate reliability or agreement has been established, because the significance level only provides an indication that the coefficient is significantly different from zero (see Table 53-1).
TABLE 53-1. Interrater Reliability, Test-Retest Reliability, and Agreement Analysis: Appropriate Statistics and Minimum Acceptable Levels
Data Type Reliability Analysis Agreement Analysis
Appropriate Statistic Level Appropriate Stastic Level
Discrete
   Nominal ICC or κW >0.75 κ >0.60
   Ordinal ICC >0.75 κW >0.60
Continuous
   Interval ICC >0.75 χ2 and T P < 0.05
   Ratio ICC >0.75 χ2 and T P < 0.05
References: ICC: discrete (47,55), ordinal (47), continuous (44,47), minimal acceptable level (56); Cohen’s κ—κ (44,47,57,58), κW (47,59,60), κW equivalence with ICC for reliability analysis of minimal data (61,62,63,64), minimal acceptable level (65); Lawlis and Lu’s χ2 and T; statistical and minimal level (43,44).ICC, intraclass correlation; κ, kappa; κW, weighted kappa; T, T index.
Agreement and reliability both are important for evaluating patient ratings. As discussed earlier, these are distinctly different concepts and require separate statistical analysis. Several factors must be considered to determine the relative importance of each. Decisions that carry greater weight or impact for the people being assessed may require more exact agreement. If the primary need is to assess the relative consistency between raters, and exact agreement is less critical, then a reliability measure alone is a satisfactory index. In contrast, whenever the major interest is either the absolute value of the score, or the meaning of the scores as defined by the points on the scale (e.g., criterion-referenced tests), agreement should be reported in addition to the reliability (44). Scores generated from instrumentation are expected to have a higher level of reliability or agreement than scores obtained from behavioral observations.
A test score actually consists of two different components: the true score and the error score (34,53). A person’s true score is a hypothetical construct, indicating a test score that is unaffected by chance factors. The error score refers to unwanted variation in the test score (54). All continuous scale measurements have a component of error, and no test is completely reliable. Consequently, reliability is a matter of degree. Any reliability coefficient may be interpreted directly in terms of percentage of score variance attributable to different sources (18). A reliability coefficient of 0.85 signifies that 85% of the
P.1143

variance in test scores depends on true variance in the trait measured and 15% depends on error variance.
SPECIFIC RELIABILITY AND AGREEMENT STATISTICS
There are several statistical measures for estimating interrater agreement and reliability. Four statistics commonly used to determine agreement are the frequency ratio, point-by-point agreement ratio, kappa (κ) coefficients, and Lawlis and Lu’s X2 and T-index statistics. For reliability calculations, the most frequently used correlation statistics are the Pearson product-moment (Pearson r) and intraclass correlation coefficients (ICC). When determining reliability for dichotomous or ordinal data, specific ICC formulas have been developed. These nonparametric ICC statistics have been shown to be the equivalent of the weighted kappa (κw) (55,56,57,58). Consequently, the κw also can be used as an index of reliability for discrete data, and the values obtained can be directly compared with equivalent forms of ICCs (56). The method of choice for reliability and agreement analyses partially depends on the assessment strategy used (44,47,50,59). In addition to agreement and reliability statistics, standard errors of measurement (SEM) provide a clinically relevant index of reliability expressed in test score units. Each statistic is described below.
Frequency Ratio
This agreement statistic is indicated for frequency count data (46). A frequency ratio of the two examiners’ scores is calculated by dividing the smaller total by the larger total and multiplying by 100. This statistic is appealing because of its computational and interpretive simplicity. There are a variety of limitations, however. It only reflects agreement of the total number of behaviors scored by each observer; there is no way to determine whether there is agreement for individual responses using a frequency ratio. The value of this statistic may be inflated if the observed behavior occurs at high rates (59). There is no meaningful lower bound of acceptability (48).
Point-by-Point Agreement Ratio
This statistic is used to determine if there is agreement on each occurrence of the observed behavior. It is appropriate when there are discrete opportunities for the behavior to occur or for distinct response categories (46,60,61). To calculate this ratio, the number of agreements is totaled by determining the concurrence between observers regarding the presence or absence of observable responses during a given trial, recording interval, or for a particular behavior category. Disagreements are defined as instances in which one observer records a response and the other observer does not. The point-by-point agreement percentage is calculated by dividing the number of agreements by the number of agreements plus disagreements, and multiplying by 100 (61). Agreement generally is considered to be acceptable at a level of 0.80 or above (61).
The extent to which observers are found to agree is partially a function of the frequency of occurrence of the target behavior and of whether occurrence and/or nonoccurrence agreements are counted (60). When the rate of the target behavior is either very high or very low, high levels of interobserver agreement are likely for occurrences or nonoccurrences, respectively. Consequently, if the frequency of either occurrences or nonoccurrences is high, a certain level of agreement is expected simply owing to chance. In such cases, it is often recommended that agreements be included in the calculation only if at least one observer recorded the occurrence of the target behavior. In this case, intervals during which none of the observers records a response are excluded from the analysis. It is important to identify clearly what constitutes an agreement when reporting point-by-point percentage agreement ratios because the level of reliability is affected by this definition.
Kappa Coefficient
The κ coefficient provides an estimate of agreement between observers, corrected for chance agreement. This statistic is preferred for discrete categorical (nominal and ordinal) data because, unlike the two statistics discussed above, it corrects for chance agreements. In addition, percentage agreement ratios often are inflated when there is an unequal distribution of scores between rating categories. This often is the case in rehabilitation medicine, in which the frequency of normal characteristics is much higher than abnormal characteristics (62,63). In contrast, κ coefficients provide accurate estimates of agreement, even when scores are unequally distributed between rating categories (63).
Kappa coefficients are used to summarize observer agreement and accuracy, determine rater consistency, and evaluate scaled consistency among raters (59). Three conditions must be met to use κ:
  • The patients or research subjects must be independent.
  • The raters must independently score the patients or research subjects.
  • The rating categories must be mutually exclusive and exhaustive (62,63).
The general form of κ is a coefficient of agreement for nominal scales in which all disagreements are treated equally (44,47,50,64,65,66,67). The κw statistic was developed for ordinal data (47,50,68,69), in which some disagreements have greater gravity than others (e.g., the manual muscle testing scale, in which the difference between a score of 2 and 5 is of more concern than the difference between a score of 4 and 5). Refer to the references cited above for formulas used to calculate κand κw.
Several other variations of κ have been developed for specific applications. The kappa statistic κv provides an overall measure of agreement, as well as separate indices for each subject and rating category (70). This form of κ can be applied in situations in which subjects are not all rated by the same set of examiners. The variation of κ described by Fleiss et al. is useful when there are more than two ratings per patient (57); a computer program is available to calculate this statistic (62). When multiple examiners rate patients and a measure of overall conjoint agreement is desired, the kappa statistic κm is indicated (71). Standard κ statistics treat all raters or units symmetrically (57). When one or more of the ratings is considered to be a standard (e.g., scores from an experienced rater), alternate analysis procedures should be used (71,72,73).
Lawlis and Lu X2 and T Index
These measures of agreement are recommended for continuous data (44). They permit the option of defining seriousness of disagreements among raters. A statistically significant X2 indicates that the observed agreement is greater than that expected owing to chance. The T index is used to determine whether agreement is low, moderate, or high. The reader is referred to Tinsley and Weiss (44) for a discussion of the indications for, calculation of, and interpretation of these statistics.
Pearson Product-Moment Correlation Coefficient
Historically, the Pearson r has been used commonly as an index of reliability. It has limited application, however, because it is a parametric statistic intended for use with continuous bivariate data. The generally accepted minimum level of this coefficient is 0.80; however, levels above 0.90 often are considered more desirable (34,51). The Pearson r provides only an index of the
P.1144

strength of the relationship between scores and is insensitive to consistent differences between scores. Consequently, a linear regression equation must be reported in addition to the Pearson r to indicate the nature of the relationship between the scores (18). Because the Pearson r is limited to the analysis of bivariate data, it is preferable to use an ICC to assess reliability because ICC can be used for either bivariate or multivariate data. The Pearson r and ICC will yield the same result for bivariate data (74).
Intraclass Correlation Coefficients
ICCs provide an index of variability resulting from comparing rating score error with other sources of true score variability (42,52,75). As indicated above, it is the coefficient of choice for reliability analyses. The ICC is based on the variance components from an analysis of variance (ANOVA), which includes not only the between-subject variance, as does the Pearson r, but also other situation-specific variance components, such as alternate test forms, maturation of subjects between ratings, and other sources of true mean differences in the obtained ratings (76). The individual sources of error can be analyzed to determine their percentage contribution to the overall error variance using generalizability analysis (53,54). For further information regarding the use of generalizability theory to distinguish between sources of error, the reader is referred to Brennan (77) and Cronbach and associates (78).
There are six different ICC formulas (54). The correct ICC formula is selected based on three factors:
  • The use of a one-way versus two-way ANOVA
  • The importance of differences between examiners’ mean ratings
  • The analysis of an individual rating versus the mean of several ratings (44,54)
Selection of the proper formula is critical and is based on the reliability study design (54,76,79). It is important to report which type of ICC is used to compute reliability because the calculations are not equivalent. Variations of the ICC formulas also exist for calculating ICCs using dichotomous (80) and ordinal (44) nonparametric data. The marginal distributions do not have to be equal, as was originally proposed for nonparametric ICCs (56). These nonparametric ICC formulas have been demonstrated to be equivalent to weighted κ coefficients, provided that the mean difference between raters is included as a component of variability and the rating categories can be ordered (56).
Standard Error of Measurement
It has been suggested that measurement error estimates are the most desirable index of reliability (18,34,75). The SEM is an estimate, in test score units, of the random variation of a person’s performance across repeated measures. The SEM is an expression of the margin of error between a person’s observed score and his or her true ability (46). The SEM is an important indicator of the sensitivity of the test to detect changes in a person’s performance over time.
The formula for the SEM is
where SD is the standard deviation of the test scores and rrr is the reliability coefficient for the test scores (34,45,75). Correlating scores from two forms of a test is one of several ways to estimate the reliability coefficient (75) and often is used in psychology when parallel forms of a test are available. In rehabilitation medicine, however, equivalent forms of a test often are not available. The test-retest reliability coefficient therefore is the coefficient of choice for calculating the SEM in most rehabilitation applications because the primary interest is in the variation of subject performance. The SEM is a relatively conservative statistic, requiring larger data samples (approximately 300 to 400 observations) in order to not overestimate the error (15).
It is best to report a test score as a range rather than as an absolute score. The SEM is used to calculate the range of scores (i.e., confidence interval) for a given person; that is, the person’s true performance ability is expected to fall within the range of scores defined by the confidence interval. A person’s score must fall outside of this range to indicate with confidence that a true change in performance has occurred. Based on a normal distribution, a 95% confidence interval would be approximately equal to the mean ±2 SEM. A 95% confidence interval is considered best to use when looking for change over time. This rigorous level of confidence minimizes the likelihood of a type I error (i.e., there is only a 5% chance that differences between scores obtained from a given person during different test sessions will not fall within the 95% confidence interval upper and lower values). Consequently, there is less than a 5% chance that differences between scores exceeding the upper end of the confidence interval are due to measurement error (i.e., they have a 95% chance of representing a true change in performance).
FACTORS AFFECTING RELIABILITY
There are four sources of measurement error for interrater reliability (18,45):
  • Lack of agreement among scorers
  • Lack of consistent performance by the individual tested
  • Failure of the instrument to measure consistently
  • Failure of the examiner to follow the standardized procedures to administer the test
Threats to test-retest reliability similarly are caused by four factors:
  • The instrument
  • The examiner
  • The patient
  • The testing protocol
Sources and prevention of examiner error will be discussed in the section on principles of evaluation, testing, and interpretation.
There are several factors conducive to good reliability of a measure (45). These factors are the power to discriminate among ability groups; sufficient time allotted so that each patient can show his or her best performance without being penalized for an unrepresentative poor trial; test organization to optimize examinee performance; and test administration and scoring instructions that are clear and precise. Additionally, the testing environment should support good performance, and the examiner must be competent in administering the test. For tests designed to be appropriate for a wide age range, reliability should be examined for each age level rather than for the group as a whole (53).
In summary, reliability and agreement are essential components to any objective measurement. Measurements lacking test-retest reliability contain sufficient error as to be useless because the data obtained do not reflect the variable measured (18). Reliability is an important component of validity, but good reliability or agreement does not guarantee that a measure is valid. A reliable measurement is consistent, but not necessarily correct. However, a measurement that is unreliable cannot be valid.
P.1145

Validity
Validity is defined as the accuracy with which a test measures that which it is intended to measure. Application of the concept of validity refers to the appropriateness, meaningfulness, and usefulness of a test for a particular situation (18). Validity is initially investigated while a test or instrument is being developed and confirmed through subsequent use. Four basic aspects of validity will be discussed: content, construct, criterion-related, and face validity.
CONTENT VALIDITY
Content validity is the systematic examination of the test content to determine if it covers a representative sample of the behavior domain to be measured. It should be reported in the test manual as descriptive information on the skills covered by the test, number of items in each category, and rationale for item selection. Content validity generally is evidenced by the opinion of experts that the domain sampled is adequate. There are two primary methods that the developer of a test can use for obtaining professional opinions about the content validity of an instrument (81). The first is to provide a panel of experts with the items from the test and request a determination of what the battery of items is measuring. The second method requires providing not only the test items but also a list of test objectives so that experts can determine the relationship between the two. For statistical analysis of content validity, the reader is referred to Thorn and Deitz (82).
CONSTRUCT VALIDITY
Construct validity refers to the extent to which a test measures the theoretical construct underlying the test. Construct validity should be obtained whenever a test purports to measure an abstract trait or theoretical characteristics about the nature of human behavior such as intelligence, self-concept, anxiety, school or work readiness, or perceptual organization. The following five areas must be considered with regard to construct validity in test instruments (34,81).
Age Differentiation
Any developmental changes in children or changes in performance due to aging must be addressed as part of the test development.
Factor Analysis
Factor analysis is a statistical procedure that can be performed on data obtained from testing. The purpose of factor analysis is to simplify the description of behavior by reducing an initial multiplicity of variables to a few common underlying factors or traits that may or may not be pertinent to the construct or constructs that the test was originally designed to measure. The reader is referred to Cronbach (75), Wilson et al. (83), Wright and Masters (84), and Wright and Stone (85) for in-depth discussions of factor analysis. The more recent development of confirmatory factor analysis (86,87) overcomes the relative arbitrariness of traditional factor analysis methods. Confirmatory factor analysis differs from traditional factor analysis in that the investigator specifies, before analysis, the measures that are determined by each factor and which factors are correlated. The specified relationships are then statistically tested for goodness of fit of the proposed model compared with the actual data collected. Confirmatory factor analysis is therefore a more direct assessment of construct validity than is traditional factor analysis. Rasch modeling (88) is a further expansion on confirmatory factor analysis methods for the purpose of establishing construct validity of a measurement tool. Rasch models start with a carefully thought-out and systematically implemented analogy used to facilitate the construction of the concepts of the measurement tool in concrete terms, then use a developmental pathway analogy to develop the Rasch concepts of unidimensionality, fit, difficulty/ability estimation and error, locations for item difficulties, and locations for person abilities.
Internal Consistency
In assessing the attributes of a test, it is helpful to examine the relationship of subscales and individual items to the total score. This is especially important when the test instrument has many components. If a subtest or item has a very low correlation with the total score, the test developer must question the subtest’s validity in relation to the total score. This technique is most useful for providing confirmation of the validity of a homogeneous test. A test that measures several constructs would not be expected to have a high degree of internal consistency. For dichotomous data, the Kuder-Richardson statistic is used to calculate internal consistency (34). Cronbach’s coefficient alpha (α) is recommended when the measure has more than two levels of response (34). The minimum acceptable level of α generally is set at 0.70 (89).
Convergent and Divergent Validity
Construct validity is evidenced further by high correlations with other tests that purport to measure the same constructs (i.e., convergent validity) and low correlations with measures that are designed to measure different attributes (i.e., divergent validity). It is desirable to obtain moderate levels of convergent validity, indicating that the two measures are not measuring identical constructs. If the new test correlates too highly with another test, it is questionable whether the new test is necessary because either test would suffice to answer the same questions. Moderately high but significant correlations indicate good convergent validity, but with each test still having unique components. Good divergent validity is demonstrated by low and insignificant correlations between two tests that measure theoretically unrelated parameters, such as an activities of daily living assessment and a test of expressive language ability.
Discriminant Validity
If two groups known to have different characteristics can be identified and assessed by the test, and if a significant difference between the performance of the two groups is found, then incisive evidence of discriminant validity is present.
CRITERION-RELATED VALIDITY
Criterion-related validity includes two subclasses of validity: concurrent validity and predictive validity (34,46). The commonality between these subclasses of validity is that they refer to multiple measurement of the same construct. In other words, the measure in question is compared with other variables or measures that are considered to be accurate measures of the characteristics or behaviors being tested. The purpose is to use the second measure as a criterion to validate the first measure.
Criterion-related validity can be assessed statistically, providing clear guidelines as to whether a measure is valid. Frequently, the paired measurements from the tests under comparison have different values. The nature of the relationship is less important than the strength of the relationship (18). Ottenbacher and Tomchek (90) showed that the limits of agreement technique provided the most accurate measurement error when comparing test results versus other statistics frequently used for such comparisons.
P.1146

Concurrent Validity
Concurrent validity deals with whether an inference is justifiable at the present time. This is typically done by comparing results of one measure against some criterion (e.g., another measure or related phenomenon). If the correlation is high, the measure is said to have good concurrent validity. Concurrent validity is relevant to tests used for diagnosis of existing status, rather than predicting future outcome.
Predictive Validity
Predictive validity involves a measure’s ability to predict or forecast some future criterion. Examples include performance on another measure in the future, prognostic reaction to an intervention program, or performance in some task of daily living. Predictive validity is difficult to establish and often requires collection of data over an extended period of time after the test has been developed. Hence, very few measures used in rehabilitation medicine have established predictive validity. A specific subset of predictive validity that is important to rehabilitation medicine practice is ecological validity. This concept involves the ability to identify impairments, functional limitations, and performance deficits within the context of the person’s own environment. Measures with good concurrent validity sometimes are presumed to have good predictive validity, but this may not be a correct assumption. Unless predictive validity information exists for a test, extreme caution should be exercised in interpreting test results as predictors of future behavior or function.
FACE VALIDITY
Face validity is not considered to be an essential component of the validity of a test or measure. It reflects only whether a test appears to measure what it is supposed to, based on the personal opinions of those either taking or giving the test (91). A test with high face validity has a greater likelihood of being more rigorously and carefully administered by the examiner, and the person being tested is more likely to give his or her best effort. Although it is not essential, in most instances, face validity is still an important component of test development and selection. Exceptions include personality and interest tests when the purpose of testing is concealed to prevent patient responses from being biased.
SUMMARY
The information discussed in this section provides the basis for critically assessing available tests and measures. The scale of the test or instrument should be sufficiently sophisticated to discriminate adequately between different levels of the behavior or function being tested. The purposes for testing must be identified, and the test chosen should have been developed for this purpose. The measure selected should be practical from the standpoint of time, efficiency, budget, equipment, and the population being tested. Above all, the measure must have acceptable reliability, agreement, and validity for the specific application it is selected. Reliability, agreement, and validity are important for both clinical and research applications. The power of statistical tests depends on adequate levels of reliability, agreement, and validity of the dependent measures (92). Consequently, it is essential that adequate levels of reliability, agreement, and validity be assessed and reported for dependent measures used in research studies.
For additional information on the test development process, the reader is referred to Miller (93). For information on the principles of tests and measurements, the reader is referred to Anastasi and Urbina (34), Baumgartner and Jackson (45), Cronbach (75), Safrit (51), Rothstein (18), Rothstein and Echternach (15), and Verducci (52).
Identification of the most appropriate test for a given application, based on the psychometric criteria discussed above, does not guarantee that the desired information will be obtained. Principles of evaluation, testing, and interpretation must be followed to optimize objective data acquisition.
PRINCIPLES OF EVALUATION, TESTING, AND INTERPRETATION
Systematic testing using standardized techniques is essential to quantify a patient’s status objectively. Standardized testing is defined as using specified test administration and scoring procedures, under the same environmental conditions, with consistent directions (34,46). Standardized testing is essential to permit comparison of test results for a given person over time and to compare test scores between patients (91). In addition, consistent testing techniques facilitate interdisciplinary interpretation of clinical findings among rehabilitation professionals and minimize duplication of evaluation procedures.
Examiner Qualifications
Assessments using objective instrumentation or standardized tests must be conducted by examiners who have appropriate training and qualifications (14,16,17,34,91,94). The necessary training and expertise varies with the type of instrument or test used. The characteristics common to most rehabilitation medicine applications will be discussed. Examiners must be thoroughly familiar with standardized test administration, scoring, and interpretation procedures. Training guidelines specified in the published test manual must be strictly adhered to. A skilled examiner is aware of factors that might affect test performance and takes the necessary steps to ensure that the effects of these factors are minimized. Interrater reliability needs to be attained at acceptable levels with examiners who are experienced in administering the test to ensure consistency of test administration and scoring.
Examiners also must be knowledgeable about the instruments and standardized tests available to assess parameters of interest. They need to be familiar with relevant research literature, test reviews, and the technical merits of the appropriate tests and measures (14,16,17,34). From this information, examiners should be able to discern the advantages, disadvantages, and limitations of using a particular test or device. Based on the purpose of testing and characteristics of the person being assessed, examiners need to be able to select and justify the most appropriate assessment method from the available options.
When interpreting test results, examiners must be sensitive to factors that may have affected test performance (34). Conclusions and recommendations should be based on a synthesis of the person’s scores, the expected measurement error, any factors that might have influenced test performance, the characteristics of the given person compared with those of the normative population, and the purpose of testing versus the recommended applications of the test or instrument. Written documentation of test results and interpretation should include comments on any potential influence of the above factors.
Examiner Training
Proper training of examiners is critical to attaining an acceptable level of interrater reliability for test administration and
P.1147

scoring (91). Examiners should be trained to minimize later decrements in performance (59). Training methods should be documented carefully so that they can be replicated by future examiners.
TRAINING PROCEDURES
As part of their training, examiners should read the test manual and instructions carefully. Operational definitions and rating criteria need to be memorized verbatim (95). A written examination should be administered to document the examiners’ assimilation of test administration and scoring procedures (59). This information should be periodically reviewed to produce close adherence to the standardized protocol. It is helpful for examiners to view a videotape of an experienced examiner conducting the test. If test administration and scoring techniques need to be adjusted for the varying abilities of the target population (e.g., children of different age levels), the experienced examiner should be observed testing a representative sample from the target population to demonstrate the various testing, scoring, and interpretation procedures.
Videotapes also are useful to clarify scoring procedures and establish consistency of scoring between and within raters (66,91). Once scoring procedures have been reviewed adequately, interrater reliability can be established by having trainees view several patients on videotape, then compare their scores with those from an experienced examiner. Scoring discrepancies should be discussed, and trainees should continue to score videotaped segments until 100% agreement is established with an experienced examiner (95). Intrarater consistency of scoring also can be established by having an individual examiner score the same videotape on multiple occasions. Sufficient time should elapse between multiple viewings so examiners do not recall previous ratings.
For assessments that involve multiple trials (e.g., strength assessments), intertrial reliability can be calculated to provide a measure of the examiner’s consistency of administering multiple trials within a given test session. As was mentioned previously, intertrial reliability also is influenced by factors such as fatigue, motor learning, motivation, and the stability of performance over a short period of time. Multiple trials administered during a given session generally are highly correlated; thus, intertrial reliability coefficients are expected to be very high. Although this measure provides feedback on consistency in administering multiple trials, it should not be considered a substitute for establishing other types of reliability during the training phase.
ESTABLISHING PROCEDURAL RELIABILITY
Procedural reliability is defined as the reliability with which standardized testing and scoring procedures are applied. As part of training, examiners should be observed administering and scoring the test on a variety of people with characteristics similar to those of the target population (91). Procedural reliability should be established by having an experienced examiner observe trainees to determine if the test is being administered and scored according to the standardized protocol. Establishing procedural reliability greatly increases the likelihood that the observed changes in performance reflect true changes in status and not alterations in examiner testing or scoring methods. Unfortunately, this type of reliability often is neglected. According to Billingsley and associates (96), failure to assess procedural reliability poses a threat to both the internal and external validity of assessments.
Procedural reliability is assessed by having an independent observer check off whether each component of an assessment is completed according to the standardized protocol while viewing a live or videotaped assessment. Specific antecedent conditions, commands, timing of execution, and positioning are monitored, and any deviations are noted. Procedural reliability is calculated as a percentage of correct behaviors (96). Checklists should include all essential components of the standardized protocol. An example of a procedural reliability checklist for selected items on the MAP is provided in Figure 53-1 (91). In this example, the checklist varies for each item administered. Another example of procedural reliability is referenced for strength testing using a myometer (97). In this case, the protocol was standardized across muscle groups, including the command sequence, tactile input, myometer placement, start and end positions, and contraction duration.
Figure 53-1. Procedural reliability checklist for selected items on the Miller Assessment for Preschoolers. (Reprinted with permission from Gyurke J, Prifitera A. Standardizing an assessment. Phys Occup Ther Pediatr 1989;9:71.)
Deviation from the standardized protocol can be minimized by conducting periodic procedural reliability checks (96). Procedural reliability should be assessed on an ongoing basis at random intervals in clinical or research settings, in addition to the training period. Assessments should be conducted at least once per phase during a research study. Examiners should be informed that procedural reliability checks will occur randomly, and ideally should be unaware of when specific assessments are conducted to avoid examiner reactivity. A minimum acceptable level of procedural reliability should be established for clinical or research use (generally, 90% to 100%). During the training phase, a 100% level should be attained. Feedback on procedural reliability assessments should be provided to examiners. If an examiner’s score decreases below the acceptable level, pertinent sections of the standardized protocol should be reviewed.
ESTABLISHING INTERRATER RELIABILITY AND AGREEMENT
Once an examiner has demonstrated consistency in scoring by viewing videotaped assessments and reliability in test administration through procedural reliability checks, then interrater reliability and agreement should be established with an experienced
P.1148

examiner (59,91). Both examiners should independently rate people with characteristics similar to the target population. Reliability and agreement assessments should be conducted under conditions similar to those of the actual data collection procedures (60). As with procedural reliability, interrater assessments should be conducted periodically in both clinical and research settings. It is essential to establish interrater reliability and agreement at least once per phase in a research study to determine the potential influence of examiner rating differences on the data recorded (59,60). When calculating interrater agreement when the experienced examiner’s scores are considered to be a standard, specific statistical procedures are indicated (71,72,73).
For assessments where the person’s performance can be observed directly (e.g., developmental or activities of daily living assessments), it is preferable to establish interrater reliability and agreement with the examiner in training administering the test while the experienced examiner simultaneously observes and independently scores the person, so that pure interrater reliability and agreement can be assessed. When measuring parameters such as range of motion, sensation, or strength, it is imperative that both examiners independently conduct the tests because the measurement error depends to a large extent on the examiner’s skill and body mechanics in administering the test. In addition, direct observation of these parameters by each examiner is required. In these instances, interrater reliability and agreement are confounded by factors of time and variation in patient performance, as discussed above in the section on interrater reliability and agreement.
If examiners are aware that interrater reliability and agreement is being assessed, the situation is potentially reactive (60). Reactivity refers to the possibility that behavior may change if the examiners realize they are being monitored. Examiners demonstrate higher levels of reliability and agreement when they are aware that they are being observed. It is difficult, however, to conduct reliability and agreement assessments without examiner awareness; consequently, during a research study, it might be best to lead examiners to believe that all of their observations are being monitored throughout the investigation (60). It is important to note that levels of reliability and agreement attained when examiners are aware that they are being monitored are potentially inflated compared with examiner performance in a typical clinic setting where monitoring occurs infrequently.
DETECTING EXAMINER ERRORS
When training examiners in the use of rating scales, interrater reliability and agreement data should be examined to determine if there are any consistent trends indicative of examiner rating errors. These data should be obtained from testing patients who represent a broad-range sample of pertinent characteristics of the population, so that a relatively normal score distribution is expected. In many circumstances, a representative group of patients can be observed efficiently on videotape by multiple examiners. The distribution of examiners’ scores across patients is then compared for error trends (52). If only one examiner is using a given rating scale, so that multiple examiners’ scores cannot be compared for rating errors, rating errors still can be detected by examining the distribution of one examiner’s ratings across multiple patients. Rasch analysis is another useful method for detecting examiner errors on specific items or as an overall trend. Rating errors can be classified into five categories:
  • Error of central tendency
  • Error of standards
  • Halo effect error
  • Logical error
  • Examiner drift error
An indication of an error of central tendency is when one rater’s scores are clustered around the center of the scale and the other rater’s scores are spread more evenly over the entire scale. Errors of standards occur when one rater awards either all low or all high scores, indicating that his standards are set either too high (i.e., error of severity) or too low (i.e., error of leniency), respectively. Leniency errors are the most common type of rating error (52). Halo effect errors can be detected if several experienced examiners rate a number of people under identical conditions and the score distributions are examined. There should be little variability between well-trained examiners’ scores. If one examiner’s scores fall outside of this limited range of variability, a halo rating error may have occurred as a result of preset examiner impressions or expectations. A logical error occurs when multiple traits are rated and an examiner awards similar ratings to traits that are not necessarily related.
A fifth type of rating error is examiner drift. Examiner drift refers to the tendency of examiners to alter the manner in which they apply rating criteria over time (60). Examiner drift is not easily detected. Interrater agreement may remain high even though examiners are deviating from the standardized rating criteria (59,60). This occurs when examiners who work together discuss rating criteria to clarify rating definitions. They may inadvertently alter the criteria, diminishing rating accuracy, and yet high levels of interrater agreement are maintained. If examiners alter rating criteria over time, data obtained from serial examinations may not be comparable. Examiner drift can be detected by assessing interrater agreement between examiners who have not worked together, or by comparing ratings from examiners who have been conducting assessments for an extended period of time with scores obtained from a newly trained examiner (60). Presumably, recently trained examiners adhere more closely to the original criteria than examiners who have had the opportunity to drift. Comparing videotaped samples of patient performance from selected evaluation sessions with actual examiner ratings obtained over time is another method of detecting examiner drift.
REDUCING EXAMINER ERRORS
Examiner ratings can be improved in several ways (52,59,60). Operational definitions of the behavior or trait must be clearly stated, and examiners must understand the rating criteria. If examiners periodically review rating criteria, receive feedback on their adherence to the test protocol through procedural reliability checks, and are informed of the accuracy of their observations through interrater agreement checks, examiner drift can be minimized. Examiners should be aware of common rating errors and how these errors may influence their scoring. Adequate time needs to be provided to observe and rate behaviors. If the observation period is too brief for the number of behaviors or people to be observed, rating accuracy is adversely affected. The reliability of ratings also can be improved by averaging ratings from multiple observers because the effects of individual rater biases tend to be balanced. Averaging multiple scores obtained from one rater is not advantageous for reducing rating error, however, because a given rater’s errors tend to be relatively constant.
The complexity of observations negatively affects interrater reliability and agreement because observers may have difficulty discriminating between rating criteria (60). With more complex observations, examiners need to attain higher levels of agreement for each behavior during the training phase.
P.1149

These high levels of interrater agreement need to be achieved under the exact conditions that will be used for data collection (60). If multiple behaviors are observed on several patients, it is best to rate all patients on one behavior before rating the next behavior. This practice facilitates more consistent application of operational definitions and rating criteria for the individual behavior. It also tends to reduce the incidence of logical errors.
Another method for improving scoring is to make raters aware of examiner idiosyncrasies or expectations that can affect ratings. According to Verducci (52), there are five patient-rater characteristics that may affect scoring:
  • If an examiner knows the person being evaluated, ratings can be either positively or negatively influenced. The longer the prior relationship has existed, the more likely the ratings will be influenced.
  • The rater tends to rate more leniently if the rater is required to disclose ratings directly to the person, or if the person confronts the examiner about the ratings
  • Examiner gender also can influence ratings. In general, male examiners tend to rate more leniently than female examiners.
  • There is a tendency to rate members of one’s sex higher than those of the opposite sex.
  • Knowledge of previous ratings may bias examiners to rate similarly. Consequently, examiners should remain blind to previous scores until current ratings have been assigned.
Other potential sources of rater bias are the examiner’s expectations about the patient’s outcome and feedback received regarding ratings (59,60). If examiners expect improvement, their ratings are more likely to show improvement. This is especially true when examiners are reinforced for patient improvement. In a research setting, examiner bias can be minimized if the observers remain blind to the purposes and hypotheses of the study. In a clinical setting, the baseline, intervention, and follow-up sessions often can be videotaped. Blind, independent observers can then rate the behaviors when shown the videotaped sessions in a random order.
Test Administration Strategies
Consistency in test administration is essential to permit comparison of test results from one session to another or between people. Multiple factors that might influence performance must be held constant during testing. These factors include test materials and instrumentation, the testing environment, test procedures and scoring, state of the person being assessed, observers present in the room, and time of day. Examiners must be aware of the potential influence of these factors and document any conditions that might affect test performance. Examiners ideally should remain blind to previous test results until after conducting the evaluation to avoid potential bias.
If more than one method is acceptable for testing, it is important to document which protocol is used so that the same method can be used during future evaluations. If it is necessary to alter the method of measurement as a result of a change in status or the development of an improved measurement technique, measurements should be taken using both the new and old methods so there is overlap of at least one evaluation. This overlap permits comparison with previous and future test results so that trends over time can be monitored.
Multiple trials should be administered when assessing traits, such as muscle strength, which require consistent efforts on the part of the patient. An average score of multiple trials is more stable over time than a single effort (97). A measure of central tendency and the range of scores both should be reported.
Standardized test positions always should be used unless a medical condition prevents proper positioning (e.g., joint contractures). In this event, the patient should be positioned as closely as possible to the standardized position, and the altered position should be documented. It is important to make sure that patients are posturally secure and comfortable during the evaluation. For patients with neurologic involvement, the head should be positioned in neutral to avoid subtle influences of tonic neck reflexes. An exception occurs when testing is conducted in the prone position. In this case, the head should be turned consistently toward the side being tested.
A key to obtaining reliable and valid test results is providing clear directions and demonstrations to the patient. Standardized instructions always must be provided verbatim and may not be modified or repeated unless specifically permitted in the test manual. Verbal directions often are enhanced by tactile, kinesthetic, and visual cues, if permitted. If confusion about the task is detected, this should be documented. If the examiner believes that a given patient could complete a task successfully with further instructions that are not specified in the standardized protocol, this item can be readministered at the end of the test session. The person’s test score should be based solely on performance exhibited when given standardized instructions. Test performance with augmented instructions can be documented in the clinical note but should not be considered when scoring.
When conducting tests that do not have standardized instructions (e.g., strength testing), it is important to use short, simple, consistent commands. If repetitive or sustained efforts are required, the examiner’s voice volume needs to be consistent and adequate to heighten the arousal state and motivate patients to give their best effort.
Verbal reinforcement and feedback regarding performance can influence performance levels (98). Consequently, it must be provided consistently, according to the procedures specified in the test manual. For tests in which reinforcement and feedback intervals are not specified and are permitted as needed, the frequency and type of feedback provided should be documented.
Test Scoring, Reporting, and Interpretation of Scores
Examiners should be thoroughly familiar with scoring criteria so that scores can be assigned accurately and efficiently during evaluation sessions. It is not appropriate for examiners to look up scoring criteria during or after the evaluation. Uncertainty about the criteria prolongs the evaluation and leads to scoring errors. It is helpful to include abbreviated scoring criteria on the test form to assist the examiner during the evaluation. Test forms should be well organized and clearly written to facilitate efficient and accurate recording of test results. If multiple types of equipment and test positions are required, it is useful if the equipment and position are identified on the score sheet using situation codes for each item. Such a coding system expedites test administration by assisting the examiner in grouping test items with similar positioning and equipment requirements. Examples of well-organized test forms that use situation codes are the Bayley Scales of Infant Development (99), the MAP (32), and the revised version of the Peabody Developmental Motor Scales test forms (40).
If the scoring criteria for a test are not well defined, it may be necessary for examiners within a given center or referral region to clarify the criteria. This was the case for many items on the Peabody Developmental Motor Scales. Interrater reliability levels of highly trained examiners were low for several items,
P.1150

so therapists at the Child Development and Mental Retardation Center in Seattle, Washington, clarified the scoring criteria to improve reliability. Examiners in the surrounding referral area were educated about the clarified criteria by means of in-services and videotapes to ensure that all examining centers in the area would be using identical criteria (40). If scoring criteria are augmented to improve reliability, it is imperative to document that the test was administered with altered criteria. Future results are comparable only if administered using identical scoring criteria. Additionally, if scores are compared with normative data, it is important to document that the test scores obtained may not be directly comparable because altered scoring criteria were used.
Raw scores obtained from testing are meaningless in the absence of additional interpretive data. To compare meaningfully a person’s current test results to previous scores, the SEM of the test must be known. To determine how a person’s performance compares with that of other people, normative data must come from a representative standardized sample of people with similar characteristics. In the latter case, the raw score must be converted into a derived or relative score to permit direct comparison with the normative group’s performance. These concepts are discussed in detail below.
Raw scores may be compared with previous scores obtained from a given person to monitor changes in status. However, the SEM of the test must be known to determine if a change in a score is clinically significant. A change in a test score exceeding the SEM is indicative of a meaningful change in test performance. As was discussed earlier, in the section on reliability and agreement, it is best to report test scores as a range, based on confidence intervals, rather than as an absolute score. This is because a person’s score is expected to vary as a result of random fluctuations in performance. It is only when a score changes beyond the range of random fluctuation that we can be confident that a true change in performance has occurred. This true score range usually is based on the 95% confidence interval. This rigorous level of confidence minimizes the likelihood of a type I error (i.e., believing a change occurred when actually there was no change) and is considered the confidence level of choice when looking for improvement in performance, resulting from a specific treatment regimen or improved physical status. A lower level of confidence (e.g., 75%, 50%) may be desirable when monitoring the status of people who are at risk for loss of function over time. For these people, it is important to minimize the likelihood of a type II error (i.e., believing no change occurred when actually there was a change). In such cases, if a person’s score falls outside a true score range that is based on a lower level of confidence, it may indicate the need to conduct further diagnostic tests or to monitor the person more closely over time.
If normative data are available for a given test, a person’s score can be compared directly to the normative group performance by converting the score into a derived or relative score. Normative scores provide relative rather than absolute information (100). Normative data should not be considered as performance standards but rather as a reflection of how the normative group performed. Derived scores are expressed either as a developmental level or as a relative position within a specified group. Derived scores are calculated by transforming the raw score to another unit of measurement that enables comparison with normative values. Most norm-referenced tests provide conversion tables of derived scores that have been calculated for the raw scores so that hand calculations are not required. However, it is important for examiners to understand the derivation, interrelationship, and interpretation of derived scores. Specific calculation of these scores is beyond the scope of this chapter. For computational details and the practical application of these statistical techniques, the reader is referred to
P.1151

textbooks on psychological or educational statistics and measurement theory (91,93,100).
TABLE 53-2. Descriptive and Standard Scores Commonly Reported in Rehabilitation Medicine
Summary Statistic Definition and Interpretation
Descriptive Score
   Raw Scores Expressed as number of correct items, time to complete a task, number of errors, or some other objective measure of performance
   Percentage Scores Raw scores expressed as percent correct
   Percentile Scores Expressed in terms of the percentage of people in the normative group who scored lower than the client’s score (e.g., a client scoring in the 75th percentile on a norm-referenced test has performed better than 75% of the people in the normative group). Often stratified for age, gender, or other pertinent modifying varieties
   Age-equivalent Score Average score for a given age group
   Grade-equivalent Score Average score for a given grade level
   Developmental Age The basal age score, plus credit for all items earned at higher age levels (up to the ceiling level of the test). Also called motor age for tests of motor development. The basal age level is defined as the highest age at and below which all test items are passed.
   Scaled Score The client’s total score, summed across all sections of the test. Used for comparison to previous and future scores.
Standard Scores
   z score The client’s raw score minus the mean score of normative group, divided by the standard deviation of the normative group. The mean of a z score is 0 with a standard deviation of 1. Scores may be plus or minus. Reported to two significant digits.
   T Score Z-score times 10 plus 50. The mean of a T score is 50 with a standard deviation of 10
   Stanine Standard scores which range from 1 to 9. A stanine of 5 indicates average performance and the standard deviation is 2. Often used to minimize the likelihood of overinterpreting small differences between individual scores.
   DMQ The ratio of the client’s actual score on the test (expressed as developmental age) and the client’s chronologic age, DMQ = DA/CA. The DMQ equals the z score times 15, plus 100. The mean DMQ is 100, with a standard deviation of 15.
   Deviation IQ A standard score deviation of the ratio between the client’s actual score on the test, expressed as a mental age and the client’s chronological age. The mean deviation IQ is 100, with a standard deviation of 15, based on the Wechsler deviation IQ distribution.
CA, chronological age; DA, developmental age; DMQ, developmental motor quotient; IQ, intelligence quotient; MA, motor age.
Selection of the particular type of score to report depends on the purpose of testing, the sophistication of the people reading the reports, and the types of interpretations to be made from the results (100). Table 53-2 summarizes various descriptive and standard scores that are commonly used. Figure 53-2 shows the relationship of these scores to the normal distribution and the interrelationship of these scores. Calculation of standard scores (e.g., z scores, T scores, stanines, developmental motor quotients, deviation IQ) is appropriate only with interval or ratio data. They express where a person’s performance is with regard to the mean of the normative group, in terms of the variability of the distribution. These standard scores are advantageous because they have uniform meaning from test to test. Consequently, a person’s performance can be compared between different tests.
Written Evaluation
Thorough documentation of testing procedures and results is essential in both clinical and research settings to permit comparison of test results between and within individuals. The tests administered should be identified clearly. Any deviations from the standardized procedures, such as altered test positions or modified instructions, should be documented (14). If multiple procedural options are available for a given test item (e.g., measuring for a flexion contracture at the hip), the specific method used should be specified in the report. The patient’s behavior, level of cooperation, alertness, attention, and motivation during the evaluation should be documented. Any potential effect of these factors on test performance should be stated. Other factors that might have influenced the validity of test results also should be noted (e.g., environmental factors, illness, length of test session, activity level before the test session). It should be indicated whether optimal performance was elicited. If a person’s performance is compared with normative data, the degree of similarity of the person’s characteristics to those of the normative group should be stated. It is imperative to distinguish between facts and inferences in the written report.
The use of a standard written evaluation format facilitates communication between and within disciplines. In addition, computerized databases provide standardized formats useful for both clinical and research purposes. Serial examinations of a given person can be reviewed easily, and a patient’s status can be compared directly with that of other people with similar characteristics. Clinical and research applications of computer data bases for documentation in rehabilitation medicine are discussed by Shurtleff (101) and Lehmann and associates (102).
OBJECTIVE MEASUREMENT WHEN A STANDARDIZED TEST IS NOT AVAILABLE
Rationale for Systematically Observing and Recording Behavior
Standardized tests and objective instrumentation are not always available to measure the parameters of clinical and research interest. Consequently, rehabilitation professionals often resort to documentation of subjective impressions (e.g., “head control is improved,” “wheelchair transfers are more independent and efficient”). However, functional status and behaviors can be documented objectively by observing behavior using standardized techniques that have been demonstrated to be reliable.
P.1152

Systematically observing and recording behavior provides objective documentation of behavior frequency and duration, identifies the timing and conditions for occurrence of a particular behavior, and identifies small changes in behavior. Several of the procedures for objective documentation described below are based on the principles of single-case research designs. These research designs have been suggested to be the most appropriate method of documentation of treatment-induced clinical change in rehabilitation populations, owing to the wide variability in clinical presentation, even within a given diagnostic category (94,103). In addition, such designs have been recommended to evaluate and compare the effects of two different treatments on individual patients (104). Selected single-case research concepts that specifically pertain to objective documentation for either clinical or research purposes are presented in this chapter. The reader is referred to Hayes and colleagues (105), Barlow and Hersen (59), Bloom and colleagues (46), Kazdin (60), and Ottenbacher (94) for more thorough discussions of documentation using single-case research standardized testing techniques.
Figure 53-2. Relationships among standard scores, percentile ranks, and the normal distribution. (Adapted with permission from Anastasi A. Pschological testing, 6th ed. New York: Macmillan, 1988:97.)
Procedures for Objective Observation and Recording of Behavior
STEP 1: IDENTIFY THE TARGET BEHAVIOR TO BE MONITORED
The target behavior must be identified by specifying the parameters of interest and their associated conditions. The prerequisite conditions required must be defined, such as verbal directions, visual or verbal cues, or physical assistance provided. In addition, environmental conditions must be described because different responses may be observed in the therapy, inpatient ward, or home setting. The duration, frequency, and timing of the observation period also must be specified. Ideally, these conditions should be constant from one observation period to the next for comparison purposes.
STEP 2: OPERATIONALLY DEFINE THE TARGET BEHAVIOR
An operational definition is stated in terms of the observable characteristics of the behavior that is being monitored. The definition must describe an observable or measurable action, activity, or movement that reflects the behavior of interest. The beginning and ending of the behavior must be clearly identified. Objective, distinct, and clearly stated terminology should be used (59,94). The definition should be elaborated to point out how the response differs from other responses. Examples of borderline or difficult responses, along with a rationale for inclusion and exclusion, should be provided. An example of an operational definition used to determine success or failure in drawing a circle is provided in Figure 53-3.
STEP 3: IDENTIFY THE MEASUREMENT STRATEGY
There are five methods of sampling behavior: event recording, rate recording, time sampling, duration recording, and discrete categorization (46,59,94). Each of these methods will be described below, along with indications and contraindications for their use.
Event Recording
The number of occurrences of the behavior is tallied in a given period of time, or per given velocity in the case of mobility activities. Indications for event recording include when the target response is discrete, with a definite beginning and end, or when the target response duration is constant. The target behavior frequency should be low to moderate, and the behavior duration should be short to moderate. It is best to augment the number of occurrences with real-time information to permit sequential, temporal, and reliability and agreement analyses. Contraindications for using event recording techniques include behaviors that have a high incidence of occurrence because of the increased probability of error in counting the high-frequency behavior and behaviors that have an extended duration or that occur infrequently (59,94) (e.g., wheelchair transfers). Duration recording should be used in the latter case. The following is an example of event recording:
A man with hemiplegia successfully fastened 5 of 10 shirt buttons during a 10-minute period of time using his involved hand to hold his shirt and his uninvolved hand to manipulate the buttons. The number of successes, number of trials, and duration of the observation period were recorded.
Figure 53-3. The operational definition of a circle (dashed lines, circle path template; solid line, patient’s drawing of a circle). The patient is instructed to draw a circle inside the two dashed lines. An adequate circle is one in which the two ends meet, and the line of the circle stays within the circle path template. It can touch the edges of the template but cannot extend beyond the edges.
Rate Recording
The number of occurrences of the behavior is divided by the duration of the observation period (e.g., the number of occurrences per minute). This method is indicated when the observation period varies from session to session. Rate recording is advantageous because it reflects changes in either the duration or frequency of response and is sensitive for detecting changes or trends because there is no theoretical upper limit. The following is an example of rate recording.
A child with Down syndrome exhibits five occurrences of undesirable tongue thrusting during a 10-minute observation period the first day and eight times during a 20-minute observation period the second day. The observations were made from videotapes recorded immediately after the child’s oral motor therapy program. An independent observer, who was blind to the child’s intervention program, performed the frequency counts. The rate of responding was 0.5 behaviors per minute (five per 10 minutes) for the first day and 0.4 behaviors per minute (eight per 20 minutes) for the second day.
Time Sampling
This method involves recording the state of a behavior at specific moments or intervals in time. It also has been described in the literature as scan sampling, instantaneous time sampling, discontinuous probe time sampling, and interval sampling. Time sampling is analogous to taking a snapshot and then examining
P.1153

it to see if a particular behavior is occurring. This method often is used in industrial settings to determine exposure to risk factors or compliance with injury prevention techniques.
To monitor behavior using this method, the behavior of interest is observed for a short block of time (e.g., a 5-second observation period) at specified recording intervals (e.g., 5-minute intervals) during a particular activity (e.g., a 30-minute meal period). The recording interval is signaled to the observer by means of a timer, audiotape cue, or a tone generator. The target behavior is scored as either occurring or not occurring during the observation period of each recording period. Fixed (i.e., preset) or random intervals can be used, but it is important to avoid a situation in which the signal coincides with any regular cycle of behavior. The sampling should occur at various times throughout the day and in different settings to obtain a representative picture of the behavior frequency. The recording interval length depends on the behavior duration and frequency, as well as on the observer’s ability to record and attend to the person. The more frequent the behavior, the shorter the interval. For low to medium response rates, 10-second intervals are recommended. For high response rates, shorter intervals should be used (106). An advantage of this type of recording is that several patients can be observed simultaneously by one rater in a group setting (e.g., during meal times or recreational events) by staggering the recording intervals for each patient.
Variations of time sampling include observing the behavior during a single block of time that is divided into short intervals (i.e., interval recording) or during brief intervals that are spread out over an entire day (i.e., time sampling); combining time sampling and event recording, when the number of responses occurring during a given interval are recorded; and combining time sampling and duration recording, when the duration of the response during a given interval is recorded. The following are examples of time sampling.
To document a patient’s ability to maintain his head in an upright position, the nursing staff observed him for 15 seconds at 5-minute intervals during one 30-minute meal period, during one 30-minute self-care/dressing period, and during one 30-minute recreation period.
To estimate compliance of 12 industrial workers with suggestions provided in a back school program, the time individual workers spent in appropriate versus inappropriate postures was recorded for 5 minutes each hour during an 8-hour shift.
Duration Recording
Either the duration of the response or the length of the latency period is recorded. The duration is reported as the total time if the observation period is constant, or as the percentage of time that a behavior occurred during observation periods of varying length. Indications for this method include continuous target responses, behaviors with high or even response rates, and behaviors with varying durations, such as a wheelchair transfer, for which a frequency count would be less meaningful. The behavior duration is timed with a stopwatch, electromechanical event recorder, or electronic keyboard. Variations of duration recording include timing the response latency (i.e., the time that elapses between a cue and a response); measuring the time required to complete a particular task; or monitoring the time spent performing a particular activity. The following are examples of duration recording:
  • The amount of time that it takes an adult with a spinal cord injury to dress in the morning
  • The length of time that a child is able to stand independently with and without orthotics before losing his or her balance.
Discrete Categorization
With this method of behavior measurement, several different behaviors of interest are listed and checked off as being performed or not performed. This method is useful in determining whether certain behaviors have occurred. It is indicated when behavioral responses can be classified into discrete categories (e.g., correct/incorrect, performed/not performed). An example of this method is a checklist of the different steps for performing a wheelchair transfer, such as positioning the wheelchair, locking the brakes, removing feet from footrests, and so forth. The observer checks off whether each of these steps was performed during a given transfer.
STEP 4: ESTABLISH INTERRATER RELIABILITY
There are four reasons for assessing interrater (i.e., interobserver) reliability and agreement.
  • To establish how consistently two observers can measure a given behavior.
  • To minimize individual observer bias by establishing interrater reliability and then retraining observers if the level of reliability is unacceptable
  • To reduce the chances of an examiner altering or “drifting” from the standard method of rating by implementing periodic interrater reliability or agreement checks to ensure that observers are consistent over time.
  • To examine the adequacy of operational definitions, rating criteria, and scoring procedures. Items that have poor agreement should be revised.
Before the onset of data collection, two people should independently observe and score pilot subjects who have characteristics that are similar to those of the clinical or study population. Behaviors of interest are rated according to predetermined operational definitions. Interrater reliability and agreement are then calculated using an appropriate statistic (see section on reliability and agreement). The minimum acceptable level of agreement depends on the type of statistic calculated (see Table 53-1).
If interrater reliability or agreement is below the target level, improvement may occur by discussing operational definitions of the behaviors. If problems with reliability or agreement continue, it may be necessary to redefine behaviors, improve observation and recording conditions, reduce the number of behaviors being recorded, provide additional training, and, if necessary, further standardize the data collection environment (64,84). Interrater reliability or agreement should be reestablished once remedial steps have been taken. As stated previously, periodic checks of interrater reliability or agreement should be conducted in the clinic and at least once during each phase of a research study (64,65). Reliability and agreement data should be plotted along with clinical or research data to show the level of consistency in measurements.
STEP 5: REPORT SCORES AND GRAPH DATA
Baseline, intervention, and follow-up data should be plotted on a graph or chart to provide a pictorial presentation of the results. Graphing strategies include using standard graph paper or a standard behavior chart (i.e., six-cycle graph paper). Advantages of the latter are that it permits systematic, standardized recording using a semilog scale that allows estimation of linear trends. Extremely high and low rates can be recorded on the chart. Behavior rates that range from once per 24 hours to 1,000 per minute can be accommodated; therefore, data are not lost as a result of floor or ceiling effects. In addition, continuous recording of data for up to 20 weeks is permitted. For further information on graphing strategies, the reader is referred to
P.1154

White and Haring (107) and Carr and Williams (108) for use of the standard behavior chart in clinical settings.
The time period of data collection is plotted on the horizontal axis (e.g., hours, days, weeks) and changes in the target behavior on the vertical axis. Appropriate scaling should be used to accommodate the highest expected response frequency and the longest anticipated documentation period duration. The measurement interval on both axes should be large enough to permit visual detection of any changes in behavior. Interrater reliability data from each phase should be plotted on the same graph, along with the study results, as discussed previously.
Considerations When Reporting Scores
The percentage of correct scores often is reported because of the ease of calculation and interpretation. However, usefulness of this summary statistic is limited because it does not provide information on the number of times a patient has performed correctly (94). Consequently, it can be misleading if the total number of opportunities varies from day to day. For example, three successes of six trials on day 1 versus three successes out of four trials on day 2 would yield percentages of 50% and 75%, respectively. Based on percentage scores, it would appear that the patient’s performance was improved, and yet the absolute number of successes has not changed. Additionally, if an odd number of trials is administered on some days and an even number of trials on other days, performance changes may occur based on percentage scores simply because it is not possible to receive half-credit for a trial on days when an odd number of trials are given (e.g., five successes out of 10 trials versus three successes out of five trials).
SUMMARY
Rehabilitation practitioners and researchers in rehabilitation medicine increasingly are using objective tests and measurements as a scientific basis for communication, to establish credibility with other professionals, and to document treatment effectiveness. The increased use of such measures has resulted in greater responsibility of the user for appropriate implementation and interpretation of tests and measures. Rehabilitation professionals must be familiar with the principles of objective measurement to use the tools properly.
The initial section of this chapter described the psychometric parameters used to evaluate the state of development and quality of available objective measures. The four basic levels of measurement—nominal, ordinal, interval, and ratio scales—were defined. The purposes for testing were discussed, including screening tests, in-depth assessment tests, and criterion-referenced tests. Several issues of practicality for selection and use of tests also were identified. The various forms of reliability, agreement, and validity described are of great importance for using the various measurements effectively. A test that does not provide reproducible results, or does not measure what it is purported to measure, is of no value and is potentially harmful by giving a false implication of meaningfulness. Consequently, caution in the use and interpretation of test results must be exercised when information on reliability or validity of a measure is not available or if their values are below accepted levels.
The second section of this chapter discussed the principles of evaluation, testing, and interpretation that help to ensure that adequate reliability and validity are obtained from test administration. The issues of standardization, interrater reliability, and procedural reliability are of particular importance. Care must be taken during test administration to control for the potential rater errors of central tendency, standards, halo effect, logical errors, and examiner drift.
For many applications in rehabilitation medicine practice and research, standardized measures have not yet been developed. Methods derived from single-subject research paradigms provide guidelines for objective measurement when a standardized test is not available. These guidelines, which are discussed in the third section of this chapter, include identifying the behavior to be monitored; operationally defining the behavior; identifying the measurement strategy (e.g., event recording, rate recording, time sampling); establishing interrater reliability; and properly reporting scores and graphing the data.
Specific tests and objective measurement instruments are not discussed because of the number and broad spectrum of measures used by rehabilitation professionals. Rather, a detailed table of references (see Appendix A) describing measures is provided, categorizing measures by the domains assessed.
The principles discussed in this chapter provide the framework for the readers critically to assess the measures available for their specific application needs. Such critical analysis will further emphasize the need for ongoing development and improvement of objective measures at the disposal of rehabilitation professionals.
REFERENCES
1. Frese E, Brown M, Norton BJ. Clinical reliability of manual muscle testing: middle trapezius and gluteus medius muscles. Phys Ther 1987;67:1072–1076.
2. Harris SR, Smith LH, Krukowski L. Goniometric reliability for a child with spastic quadriplegia. J Pediatr Orthop 1985;5:348–351.
3. Hinderer KA, Hinderer SR. Muscle strength development and assessment in children and adolescents. In: Harms-Ringdahl K, ed. Muscle strength series: international perspectives in physical therapy: muscle strength. Edinburgh: Churchill-Livingstone, 1993.
4. Hinderer SR, Nanna M, Dijkers MP. The reliability and correlations of clinical and research measures of spasticity [Abstract]. J Spinal Cord Med 1996;19:138.
5. Iddings DM, Smith LK, Spencer WA. Muscle testing, part 2: reliability in clinical use. Phys Ther Rev 1961;41:249–256.
6. Lilienfeld AM, Jacobs M, Willis M. A study of the reproducibility of muscle testing and certain other aspects of muscle scoring. Phys Ther Rev 1954;34: 279–289.
7. Sackett DL. Clinical epidemiology: a basic science for clinical medicine. Boston: Little, Brown, 1991.
8. Bartlett MD, Wolf LS, Shurtleff DB, Staheli LT. Hip flexion contractures: a comparison of measurement methods. Arch Phys Med Rehabil 1985;66:620–625.
9. Hinderer KA, Gutierrez T. Myometry measurements of children using isometric and eccentric methods of muscle testing [Abstract]. Phys Ther 1988; 68:817.
10. Hinderer KA, Hinderer SR. Stabilized vs. unstabilized myometry strength test positions: a reliability comparison [Abstract]. Arch Phys Med Rehabil 1990;71:771–772.
11. Hinderer KA, Hinderer SR, Deitz JL. Reliability of manual muscle testing using the hand-held dynamometer and the myometer: a comparison study. Paper presented at: American Physical Therapy Association Midwinter Sections Meeting; February 11, 1988; Washington, DC,.
12. Gowland C, King G, King S, et al. Review of selected measures in neurodevelopmental rehabilitation: a rational approach for selecting clinical measures. Research report no. 91–2. Hamilton, Ontario: McMaster University, Neurodevelopmental Clinical Research Unit, 1991.
13. Kirshner B, Guyatt G. A methodological framework for assessing health indices. J Chronic Dis 1985;38:27–36.
14. American Physical Therapy Association. Standards for tests and measurements in physical therapy practice. Phys Ther 1991;71:589–622.
15. Rothstein JM, Echternach JL. Primer on measurement: an introductory guide to measurement issues. Alexandria, VA: American Physical Therapy Association, 1993.
16. American Educational Association, American Psychological Association, National Council on Measurement in Education. Standards for educational and psychological testing. Washington, DC: American Psychological Association, 1985.
17. Johnston MV, Keith RA, Hinderer SR. Measurement standards for interdisciplinary medical rehabilitation. Arch Phys Med Rehabil 1992;73[suppl 12S]:S3–S23.
18. Rothstein JM. Measurement and clinical practice: theory and application. In: Rothstein JM, ed. Measurement in physical therapy:clinics in physical therapy, vol. 7. New York: Churchill-Livingstone, 1985:1–46.
P.1155

19. Krebs DE. Measurement theory. Phys Ther 1987;67:1834–1839.
20. Hislop HJ, Montgomery J. Daniels and Worthingham’s muscle testing: techniques of manual examination, 7th ed. Philadelphia: WB Saunders, 2002.
21. Janda V. Muscle function testing. Boston: Butterworths, 1983.
22. Cutter NC, Kevorkian CG. Handbook of Manual Muscle Testing. New York: McGraw-Hill, 1999.
23. Clarkson HM. Musculoskeletal assessment, 2nd ed. Philadelphia: Lippincott Williams & Wilkins, 2000.
24. Kendall FP, McCreary EK, Geise PG. Muscles, testing and function, 4th ed. Baltimore: Williams & Wilkins, 1993.
25. Granger CV, Gresham GE, eds. Functional assessment in rehabilitation medicine. Baltimore: Williams & Wilkins, 1984.
26. Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of misinference. Arch Phys Med Rehabil 1989;70:308–312.
27. Wright BD, Linacre JM. Observations are always ordinal; measurements, however, must be interval. Arch Phys Med Rehabil 1989;70:857–860.
28. Deitz JC, Beeman C, Thorn DW. Test of orientation for rehabilitation patients (TORP). Tucson, AZ: Therapy Skill Builders, 1993.
29. Deitz JC, Tovar VS, Beeman C, et al. The test of orientation for rehabilitation patients: test-retest reliability. Occup Ther J Res 1992;12:172–185.
30. Deitz JC, Tovar VS, Thorn DW, Beeman C. The test of orientation for rehabilitation patients: interrater reliability. Am J Occup Ther 1990;44:784–790.
31. Thorn DW, Deitz JC. A content validity study of the Test of Orientation for Rehabilitation Patients. Occup Ther J Res 1990;10:27–40.
32. Miller LJ. Miller assessment for preschoolers, 2nd ed. San Antonio, TX: Psychological Corporation, 1999.
33. Peterson HA, Marquardt TP. Appraisal and diagnosis of speech and language disorders, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall, 1994.
34. Anastasi A, Urbina S. Psychological testing, 7th ed. Upper Saddle, NJ: Prentice-Hall, 1997.
35. Gans BM, Haley SM, Hallenborg SC, et al. Description and inter-observer reliability of the Tufts Assessment of Motor Performance. Am J Phys Med Rehabil 1988;67:202–210.
36. Haley SM, Ludlow LH, Gans BM, et al. Tufts Assessment of Motor Performance: an empirical approach to identifying motor performance categories. Am J Phys Med Rehabil 1991;72:359–366.
37. Ludlow LH, Haley SM. Polytomous Rasch models for behavioral assessment: the Tufts Assessment of Motor Performance. In: Wilson M, ed. Objective measurement: theory into practice, vol. 1. Norwood, NJ: Ablex Publishing, 1992:121–137.
38. Haley SM, Ludlow LH. Applicability of the hierarchical scales of the Tufts Assessment of Motor Performance for school-aged children and adults with disabilities. Phys Ther 1992;72:191–206.
39. Ludlow LH, Haley SM, Gans BM. A hierarchical model of functional performance in rehabilitation medicine: the Tufts Assessment of Motor Performance. Evaluation Health Prof 1992;15:59–74.
40. Hinderer KA, Richardson PK, Atwater SW. Clinical implications of the Peabody Developmental Motor Scales: a constructive review. Phys Occup Ther Pediatr 1989;9:81–106.
41. Chaffin DB, Anderson GBJ, Martin BJ. Occupational biomechanics, 3rd ed. New York: Wiley-Interscience, 1999.
42. Granger CV, Kelly-Hayes M, Johnston M, et al. Quality and outcome measures for medical rehabilitation. In: Braddom RL, ed. Physical medicine and rehabilitation, 2nd ed. Philadelphia: WB Saunders, 2000.
43. Lawlis GF, Lu E. Judgment of counseling process: reliability, agreement, and error. Phys Occup Ther Pediatr 1989;9:81–106.
44. Tinsley HE, Weiss DJ. Interrater reliability and agreement of subjective judgments. J Counsel Psychol 1975;22:358–376.
45. Baumgartner TA, Jackson AS. Measurement for evaluation in physical education and exercise science, 7th ed. Boston: McGraw-Hill, 2003.
46. Bloom M, Fischer J, Orme JG. Evaluating practice: guidelines for the accountable professional, 4th ed. Boston: Allyn and Bacon, 2003.
47. Bartko JJ, Carpenter WT. On the methods and theory of reliability. J Nerv Ment Dis 1976;163:307–317.
48. Hartmann DP. Considerations in the choice of interobserver reliability estimates. J Appl Behav Anal 1977;10:103–116.
49. Hollenbeck AR. Problems of reliability in observational research. In: Sackett GP, ed. Observing behavior: data collection and analysis methods, vol. 2. Baltimore: University Park Press, 1978:79–98.
50. Liebetrau AM. Measures of association. Sage University paper series on quantitative applications in the social sciences. Series no. 07–032. Newbury Park, CA: Sage, 1983.
51. Safrit MJ. Introduction to measurement in physical education and exercise science, 2nd ed. St. Louis: Times Mirror/Mosby College Publishing, 1990.
52. Verducci FM. Measurement concepts in physical education. St. Louis: CV Mosby, 1980.
53. Deitz JC. Reliability. Phys Occup Ther Pediatr 1989;9:125–147.
54. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420–428.
55. Fleiss JL. Measuring agreement between two judges on the presence or absence of a trait. Biometrics 1975;31:651–659.
56. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Measure 1973;33:613–619.
57. Fleiss JL, Harvey B, Park MC. The measurement of interrater agreement. In: Fleiss JL, ed. Statistical methods for rates and proportions, 3rd ed. Chichester: Wiley, 2002.
58. Krippendorff K. Bivariate agreement coefficients for reliability of data. In: Borgatta EF, ed. Sociological methodology. San Francisco: Jossey-Bass, 1970: 139–150.
59. Barlow DH, Hersen M. Single case experimental designs:strategies for studying behavior change, 2nd ed. New York: Pergamon, 1984.
60. Kazdin AE. Single-case research designs. New York: Oxford University Press, 1982.
61. Harris FC, Lahey BB. A method for combining occurrence and nonoccurrence interobserver agreement scores. J Appl Behav Anal 1978;11:523–527.
62. Haley SM, Osberg JS. Kappa coefficient calculation using multiple ratings per subject: a special communication. Phys Ther 1989;69:90–94.
63. Plewis I, Bax M. The uses and abuses of reliability measures in developmental medicine. Dev Med Child Neurol 1982;24:388–390.
64. Cicchetti DV, Aivano SL, Vitale J. Computer programs for assessing rater agreement and rater bias for qualitative data. Educ Psychol Measure 1977;37: 195–201.
65. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measure 1960;20:37–46.
66. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174.
67. Hubert L. Kappa revisited. Psychol Bull 1977;84:289–297.
68. Cicchetti DV, Lee C, Fontana AF, Dowds BN. A computer program for assessing specific category rater agreement and rater bias for qualitative data. Educ Psychol Measure 1978;38:805–813.
69. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968;70:213–220.
70. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull 1971;76:378–382.
71. Light RJ. Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull 1971;76:365–377.
72. Wackerly DD, McClave JT, Rao PV. Measuring nominal scale agreement between a judge and a known standard. Psychometrika 1978;43:213–223.
73. Williams GW. Comparing the joint agreement of several raters with another rater. Biometrics 1976;32:619–627.
74. Fleiss JL. The design and analysis of clinical experiments. New York: John Wiley & Sons, 1986:1–32.
75. Cronbach LJ. Essentials of psychological testing, 5th ed. New York: Harper & Row, 1990.
76. Krebs DE. Computer communication. Phys Ther 1984;64:1581–1589.
77. Brennan RL. Elements of generalizability theory. Iowa City, IA: ACT Publications, 1983.
78. Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The dependability of behavioral measurements. New York: Wiley, 1972.
79. Lahey MA, Downey RG, Saal FE. Intraclass correlations: there’s more there than meets the eye. Psychol Bull 1983;93:586–595.
80. Fleiss JL. Estimating the accuracy of dichotomous judgments. Psychometrika 1965;30:469–479.
81. Dunn WW. Validity. Phys Occup Ther Pediatr 1989;9:149–168.
82. Thorn DW, Deitz JC. Examining content validity through the use of content experts. Occup Ther J Res 1989;9:334–346.
83. Wilson M, Engelhard G, Draney K, eds. Objective measurement: theory into practice. Norwood, NJ: Ablex, 1992.
84. Wright BD, Masters GN. Rating scale analysis. Chicago: Mesa Press, 1982.
85. Wright BD, Stone MH. Best test design: Rasch measurement. Chicago: Mesa Press, 1979.
86. Francis DJ. An introduction to structural equation models. J Clin Exp Neuropsychol 1988;10:623–639.
87. Long JS. Confirmatory factor analysis. Sage University paper series on quantitative application in the social sciences. Series no. 07–033. Newbury Park, CA: Sage Publications, 1983.
88. Bond TG, Fox CM. Applying the Rasch model: fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates, 2001.
89. Law M. Measurement in occupational therapy: scientific criteria for evaluation. Can J Occup Ther 1987;54:133–138.
90. Ottenbacher KJ, Tomchek SD. Measurement variation in method comparison studies: an empirical examination. Arch Phys Med Rehabil 1994;75:505–512.
91. Gyurke J, Prifitera A. Standardizing an assessment. Phys Occup Ther Pediatr 1989;9:63–90.
92. Cleary TA, Linn RL, Walster GW. Effect of reliability and validity on power of statistical tests. In: Borgatta EF, ed. Sociological methodology. San Francisco: Jossey-Bass, 1970:130–138.
93. Miller LJ, ed. Developing norm-referenced standardized tests. Phys Occup Ther Pediatr 1989;9:1–205.
94. Ottenbacher KJ. Evaluating clinical change: strategies for occupational and physical therapists. Baltimore: Williams & Wilkins, 1986.
95. Paul GL, Lentz RJ. Psychosocial treatment of chronic mental patients: milieu versus social-learning programs. Cambridge, MA: Harvard University Press, 1977.
96. Billingsley F, White OR, Munson R. Procedural reliability: a rationale and an example. Behav Assess 1980;2:229–241.
97. Hinderer KA. Reliability of the myometer in muscle testing children and adolescents with myelodysplasia. Unpublished master’s thesis, University of Washington, Seattle, WA, 1988.
98. Schmidt RA. Feedback and knowledge of results. In: Schmidt RA. Lee TD, eds. Motor control and learning, 3rd ed. Champaign, IL: Human Kinetics Publishers, 1999.
P.1156

99. Bayley N. The Bayley scales of infant development, 2nd ed. New York: Psychological Corporation, 1993.
100. Cermak S. Norms and scores. Phys Occup Ther Pediatr 1989;9:91–123.
101. Shurtleff DB. Computer data bases for pediatric disability: clinical and research applications. Phys Med Rehabil Clin N Am 1991;2:665–687.
102. Lehmann JF, Warren CG, Smith W, Larson J. Computerized data management as an aid to clinical decision making in rehabilitation medicine. Arch Phys Med Rehabil 1984;65:260–262.
103. Martin JE, Epstein L. Evaluating treatment of effectiveness in cerebral palsy. Phys Ther 1976;56:285–294.
104. Guyatt G, Sackett D, Taylor W, et al. Determining optimal therapy. N Engl J Med 1986;314:889–892.
105. Hayes SC, Barlow DH, Nelson-Gray RO. The scientist practitioner: research and accountability in the age of managed care, 2nd ed. Boston: Allyn & Bacon, 1999.
106. Repp AC, Roberts DM, Slack DJ, et al. A comparison of frequency, interval, and time-sample methods of data collection. J Appl Behav Anal 1976;9:501–508.
107. White OR, Haring NG. Exceptional teaching: a multimedia training package. Columbus, OH: Charles E Merrill, 1976.
108. Carr BS, Williams M. Analysis of therapeutic techniques through the use of the Standard Behavior Chart. Phys Ther 1982;62:177–183.
109. Hughes CJ, Weimar WH, Sheth PN, Brubaker CE. Biomechanics of wheelchair propulsion as a function of seat position and user-to-chair interface. Arch Phys Med Rehabil 1992;73:263–269.
110. Fife SE, Roxborough LA, Armstrong RW, et al. Development of a clinical measure of postural control for assessment of adaptive seating in children with neuromotor disabilities. Phys Ther 1991;71:981–993.
111. McClenaghan BA, Thombs L, Milner M. Effects of seat-surface inclination on postural stability and function of the upper extremities of children with cerebral palsy. Dev Med Child Neurol 1992;34:40–48.
112. Myhr U, von Wendt L. Improvement of functional sitting position for children with cerebral palsy. Dev Med Child Neurol 1991;33:246–256.
113. Deitz JC, Jaffe KM, Wolf LS, et al. Pediatric power wheelchairs: evaluation of function in the home and school environments. Assist Technol 1991;3:24–31.
114. Bader DL, ed. Pressure sores: clinical practice and scientific approach. London: Macmillan, 1990.
115. Webster JG, ed. Prevention of pressure sores. Bristol, England: Adam Hilger Publishers, 1991.
116. Harris GF. A method for the display of balance platform center of pressure data. J Biomech 1982;15:741–745.
117. Shumway-Cook A, Horak FB. Assessing the influence of sensory interaction on balance. Phys Ther 1986;66:1548–1554.
118. Nashner LM. Adapting reflexes controlling the human posture. Exp Brain Res 1976;26:59–72.
119. Winter DA. Biomechanics and motor control of human movement, 2nd ed. New York: Wiley, 1990.
120. Schmidt RA. Methodology for studying motor behavior. In: Schmidt RA, Lee TD, eds. Motor control and learning, 3rd ed. Champaign, IL: Human Kinetics Publishers, 1999.
121. Bernstein N, Wilberg RB, Woltring HJ. The techniques of the study of movement. In: Whiting HTA, ed. Human motor actions: Bernstein reassessed. Amsterdam: Elsevier, 1984:1–73.
122. University of Pittsburgh. Health instruments file database. Pittsburgh, PA: University of Pittsburgh, 1992.
123. Educational Testing Service. Medical/health science bibliographies. Princeton, NJ: Educational Testing Service, 1992.
124. Medical Device Register, Inc. Medical device register: United States and Canada. Montvale, NJ: Medical Economics Co., 1996–2001.
125. Institute for Scientific Information, Inc. Science citation index. Philadelphia: Institute for Scientific Information, Inc., 1945–present.
126. Siu AL, Reuben DB, Moore AA. Comprehensive geriatrics assessment. In: Hazard WR, ed. Principles of geriatric medicine and gerontology, 4th ed. New York: McGraw-Hill, 1999.
127. Dumitru D. Electrodiagnostic medicine, 2nd ed. Philadelphia: Hanley & Belfus, 2002.
128. Delisa JA, Lee HJ, Baran EM, et al. Manual of nerve conduction velocity and clinical neurophysiology, 3rd ed. New York: Raven, 1994.
129. Bruett BS, Overs RP. A critical review of 12 ADL scales. Phys Ther 1969;49: 857–862.
130. Law M, Letts L. A critical review of scales of activities of daily living. Am J Occup Ther 1989;43:522–528.
131. Halpern AS, Fuhrer MJ, eds. Functional assessment in rehabilitation. Baltimore: Paul H Brookes, 1984.
132. Barer D, Nouri F. Measurement of activities of daily living. Clin Rehabil 1989; 3:179–187.
133. Jebsen RH, Taylor N, Trieschmann RB, Trotter MH. Measurement of time in a standardized test of patient mobility. Arch Phys Med Rehabil 1970;51:170–175.
134. Shores M. Footprint analysis in gait documentation. Phys Ther 1980;60:1163–1167.
135. Lerner-Frankiel MB, Vargas S, Brown M, et al. Functional community ambulation: what are your criteria? Clin Manage 1986;6:12–15.
136. Perry J. Gait analysis: normal and pathological function. Thorofare, NJ: Slack, 1992.
137. Eastlack ME, Arvidson J, Synder-Mackler L, et al. Interrater reliability of videotaped observational gait-analysis assessments. Phys Ther 1991;71:465–472.
138. Krebs DE, Edelstein JE, Fishman S. Reliability of observational kinematic gait analysis. Phys Ther 1985;65:1027–1033.
139. Rose SA, Ounpuu S, DeLuca PA. Strategies for the assessment of pediatric gait in the clinical setting. Phys Ther 1991;71:961–980.
140. Winter DA. The biomechanics and motor control of human gait, 2nd ed. Waterloo, Ontario: University of Waterloo Press, 1991.
141. Rondinelli RD, Katz RT, eds. Impairment rating and disability evaluation. Philadelphia: WB Saunders, 2000.
142. Lister MJ, Currier DP. Clinical measurement. Phys Ther 1987;67:1829–1897.
143. Rothstein JM, ed. Measurement in physical therapy. New York: Churchill-Livingstone, 1985.
144. Johnston MV, Findley TW, DeLuca J, Katz RT. Research in physical medicine and rehabilitation. XII: measurement tools with application to brain injury. Am J Phys Med Rehabil 1991;70[Suppl]:114–130.
145. McDowell I, Newell C. Measuring health: a guide to rating scales and questionnaires, 2nd ed. New York: Oxford University Press, 1996.
146. Spilker B, ed. Quality of life assessments in clinical trials. New York: Raven, 1990.
147. Salek S, ed. Compendium of quality of life instruments. New York: John Wiley & Sons, 1998.
148. Tulsky DS, Rosenthal M, eds. Quality of life measurement: applications in health and rehabilitation populations, part I. Arch Phys Med Rehabil 2002;83 [Suppl]:S1–S54.
149. Blake BS, Impara JC. The fourteenth mental measurements yearbook. Lincoln, NE: University of Nebraska, Buros Institute of Mental Measurements, 2001.
150. Amundsen LR, ed. Muscle strength testing: instrumented and noninstrumented systems. New York: Churchill-Livingstone, 1990.
151. Kellor M, Kondrasuk R, Iversen I, et al. Technical manual: hand strength and dexterity tests. Minneapolis, MN: Sister Kenny Institute, 1977.
152. Bohannon RW, Smith RB. Interrater reliability of a modified Ashworth scale of muscle spasticity. Phys Ther 1987;67:206.
153. Lee KC, Carson L, Kinnin E. The Ashworth scale: a reliable and reproducible method of measuring spasticity. J Neurol Rehabil 1989;3:205.
154. Meythaler JM, Guin-Refroe S, Grabb P. Long-term continuous infused intrathecal baclofen for spastic-dystonic hypertonia in traumatic brain injury: 1-year experience. Arch Phys Med Rehabil 1999;80:13.
155. Frollo I, Kneppo P, Krizik M, Rosik V. Microprocessor-based instrument for Achilles tendon reflex measurements. Med Biol Eng Comput 1981;19:695–700.
156. Lehmann JF, Price R, de Lateur BJ, et al. Spasticity: quantitative measurements as a basis for assessing effectiveness of therapeutic intervention. Arch Phys Med Rehabil 1989;70:6–15.
157. Katz RT, Rymer WZ. Spastic hypertonia: mechanisms and measurement. Arch Phys Med Rehabil 1989:70:144–155.
158. Rodgers SH. Ergonomic design for people at work, vols. 1 and 2. Rochester, NY: Eastman Kodak, 1983, 1986.
159. Asher IE. An annotated index of occupational therapy evaluation tools. Rockville, MD: American Occupational Therapy Association, Inc., 1989.
160. Wittmeyer M, Barrett JE. Housing accessibility checklist. Seattle: University of Washington, Health Sciences Learning Resources Center, 1980.
161. Hemphill BJ, ed. Mental health assessment in occupational therapy. Thorofare, NJ: Black Publishers, 1988.
162. Crepeau EB, Cohn ES, Willard AS, Scheil BAB. Willard and Spackman’s occupational therapy, 10th ed. Baltimore: Lippincott, 2003.
163. Haley SM, Coster WJ, Ludlow LH. Pediatric functional outcome measures. Phys Med Rehabil Clin N Am 1991;2:689–723.
164. Gledhill N. Discussion: assessment of fitness. In: Bouchard C, Shephard RJ, Stephens T, et al., eds. Exercise, fitness, and health. Champaign, IL: Human Kinetics Books, 1990:121–126.
165. Skinner JS, Baldini FD, Gardner AW. Assessment of fitness. In: Bouchard C, Shephard RJ, Stephens T, et al, eds. Exercise, fitness, and health. Champaign, IL: Human Kinetics Books, 1990:109–119.
166. Edwards RHT. Human muscle function and fatigue. In: Ciba Foundation symposium 82 on human muscle fatigue: physiological mechanisms. London: Pitman Medical, 1981:1–18.
167. Hashimoto K, Kogi K, Grandjean E. Methodology in human fatigue assessment. London: Taylor & Francis, 1971.
168. Minor MAD, Minor SD. Patient evaluation methods for the health professional. Reston, VA: Reston, 1985.
169. Comrey AL, Backer TE, Glaser EM. A sourcebook for mental health measures. Los Angeles: Human Interaction Research Institute, 1973.
170. Lezak MD. Neuropsychological assessment, 3rd ed. New York: Oxford University Press, 1995.
171. Bellack AS, Hersen M. Behavioral assessment: a practical handbook, 4th ed. Boston: Allyn and Bacon, 1998.
172. Nicol AC. Measurement of joint motion. Clin Rehabil 1989;3:1–9.
173. Norkin CC, White DJ. Measurement of joint motion: a guide to goniometry, 2nd ed. Philadelphia: FA Davis, 1995.
174. Batti MC, Bigos SJ, Fisher LD, et al. The role of spinal flexibility in back pain complaints within industry. Spine 1990;15:768–773.
175. Burton, AK. Regional lumbar sagittal mobility: measurement by flexicurves. Clin Biomech 1986;1:20–26.
176. Domjan L, Nemes T, Balint GP, et al. A simple method for measuring lateral flexion of the dorsolumbar spine. J Rheumatol 1990;17:663–665.
177. Hart DL, Rose SJ. Reliability of a noninvasive method for measuring the lumbar curve. J Orthop Sports Phys Ther 1986;8:180–184.
178. Lovell FW, Rothstein JM, Personius WJ. Reliability of clinical measurements of lumbar lordosis taken with a flexible rule. Phys Ther 1989;69:96–105.
P.1157

179. Mellin GP. Physical therapy for chronic low back pain: correlations between spinal mobility and treatment outcome. Scand J Rehabil Med 1985;17:163–166.
180. Mellin GP. Accuracy of measuring lateral flexion of the spine with a tape. Clin Biomech 1986;1:85–89.
181. Merrit JL, McLean TJ, Erickson RP, Ojford KP. Measurement of trunk flexibility in normal subjects: reproducibility of three clinical methods. Mayo Clin Proc 1986;61:192–197.
182. Rose MJ. The statistical analysis of the intra-observer repeatability of four clinical measurement techniques. Physiotherapy 1991;77:89–91.
183. Frattali CM, ed. Measuring outcomes in speech-language pathology. New York: Thieme, 1998.
184. Johnson AF, Jacobsen BH, eds. Medical speech-language pathology: a practitioner’s guide. New York: Thieme, 1998.
P.1158

Appendix
Appendix A: Measurement Scales and Test Methods used in Physical Medicine and Rehabilitation: Critiques and References
Adaptive Equipment Assessments for Positioning and Function
  • Biomechanics of wheelchair propulsion as a function of seat position and user-to-chair interface (109). Describes an experimental protocol for determining three-dimensional wheelchair propulsion kinematics with varied hand placements (push-levers versus hand rims) and seat positions
  • Development of a clinical measure of postural control for assessment of adaptive seating in children with neuromotor disabilities (110). Reviews the literature on seating assessment, including measures that require complex instrumentation and clinical evaluation scales. Describes the development of a clinical evaluation scale, the Seated Postural Control Measure (SPCM) for use with children requiring adaptive seating systems. The SPCM consists of postural alignment and functional movement items, scored on a four-point scale. A modified version of the Level of Sitting Ability Scale (LSAS) also is described. The LSAS is used to rate sitting ability based on the amount of support required to maintain sitting and the degree of sitting stability. Interrater and test-retest reliability data are reported for both scales.
  • Effects of seat-surface inclination on postural stability and function of the upper extremities of children with cerebral palsy (111). Describes methods of evaluating optimal seating surface inclination through postural, center of pressure, and upper-extremity function data. Postural data were obtained by means of videotape analysis. Center of pressure data were acquired using a force platform. Upper-extremity performance was assessed through six motor control tasks.
  • Improvement of functional sitting position for children with cerebral palsy (112). Describes a method of determining the most optimal functional sitting position by using videotapes and photographs. The Sitting Assessment Scale was used to rate head control, trunk control, foot control, arm function, and hand function. Interrater reliability of this scale is reported.
  • Pediatric power wheelchairs: evaluation of function in the home and school environments (113). Describes a standardized functional task assessment for use in evaluating indoor function in a wheelchair, both at home and at school. The tasks assessed are classified into three categories: positioning, reaching, and driving.
  • Pressure sores: clinical practice and scientific approach (114). Describes pressure distribution measurements, movement studies during sleep, and remote monitoring mechanical force measurements, wound-healing measurements, tissue-distortion measurements, and compressive loading regimens of wheelchair sitting behavior
  • Prevention of pressure sores: engineering and clinical aspects (115). Describes skin blood flow measurement, seat cushion evaluation techniques, pressure measurement using bladder pressure sensors and conventional pressure sensors, interface pressure distribution visualization, and sheer measurement techniques.
Balance Measurement Techniques
  • Method for the display of balance platform center of pressure data (116). Describes the measurement of center of pressure.
  • Assessing the influence of sensory interaction on balance (117). Describes procedures for assessing balance under six conditions in the typical clinic setting. Three visual conditions (i.e., normal, blindfolded, visual-conflict dome) are tested with two surface inputs (i.e., normal, standing on foam). Suggestions of quantifying postural sway under each condition are provided.
  • Adapting reflexes controlling the human posture (118). Describes a method of assessing balance using a displacement platform and a visual surround, which is used to assess the influence of various sensory conditions on balance.
Biomechanics and Motor Control Assessment Techniques
  • Biomechanics and motor control of human movement (119). Describes measurement of kinematic data (e.g., by using goniometers, accelerometers, and imaging techniques); anthropometric data (e.g., density, mass, center of mass, moment of inertia, joint centers of rotation, muscle anthropometry); kinetic data (e.g., joint reaction forces, bone-on-bone forces, force transducers, force plate data, muscle force estimates); mechanical work, energy, and power measurements; muscle mechanics; and electromyography.
  • Methodology for studying motor behavior (120). Describes methods of measuring movement kinematics, electromyography, movement errors, tracking, balance, coordination, reaction time, movement time, and motor skills.
  • Techniques of the study of movement (121). Describes methods of studying movements, including cinematography, stroboscopic photography, cyclography, stereoscopic recording, determining masses and centers of gravity, electrogoniometry, ultrasound, optoelectronics, accelerometry, photogrammetry, rigid-body kinematics, derivative estimation, state-space modeling, force plates, body segment description, kinetic modeling, and data processing. Includes descriptions of historical techniques and compares these with contemporary methods.
Computerized Assessment Data Bases and Techniques
  • Computer databases for pediatric disability: clinical and research applications (101). Reviews computer-based medical records and evaluation systems for individuals with disabilities. These systems have applications for clinical and research settings.
  • Computerized data management as an aid to clinical decision making in rehabilitation (102). Describes a computerized database that has been developed for clinical decision making in rehabilitation. Multidisciplinary patient performance data can be stored and accessed by all team members.
Computerized Medical Instrument Databases and Citation Indexes
  • Health instruments file database (122). A computerized database that contains information on instruments (e.g., questionnaires, interview protocols, observation checklists, index measure, rating scales, projective techniques, tests) in health, health-related, and behavioral sciences. Designed to identify measures needed for research studies, clinical assessments, and program evaluation. The database contains
    P.1159

    information on selected measurement instruments, instruments constructed for a particular study, and modifications of existing instruments.
  • Medical/health science bibliographies (123). A computerized database of annotated test bibliographies. The database includes tests of personality, sensory-motor function, vocation/occupation, behavior, developmental scales, family interaction, environmental influences, manual dexterity, learning, social skills, and social perception and judgment.
  • Medical device register: United States and Canada (124). Cross-references lists of medical instruments and devices.
  • Science citation index (125). References that have cited specific instruments are indexed according to the specific name of the instrument.
Elderly Assessment Instruments
  • Assessing the elderly (126). Reviews selected instruments for measuring physical health, physical functioning, activities of daily living, cognitive functioning, affective functioning, general mental health, social interactions and resources, person-environment compatability, and multidimensional measure.
Electrodiagnostic Assessment Techniques
  • Electrodiagnostic evaluation of the peripheral nervous system (127). Describes electrodiagnostic procedures, including sensory nerve conduction studies, motor nerve conduction studies, single-fiber electromyography, needle electrode examination, and findings for specific diagnostic categories.
  • Manual of nerve conduction velocity and somatosensory evoked potentials (128). Describes techniques and normal value ranges for nerve conduction studies and somatosensory evoked potentials.
Functional Assessment Instruments
  • A critical review of 12 activities of daily living (129). Discusses parameters measured, type of scoring, scaling of scores, and the advantages and disadvantages of each scale.
  • A critical review of scales of activities of daily living (130). Reviews scales of basic self-care according to standard criteria. The evaluation criteria include purpose, clinical utility, test construction, standardization, reliability, and validity. Specific recommendations are made regarding which activities of daily living scales are most suitable for describing, predicting, or evaluating activities of daily living function.
  • Functional assessment in rehabilitation (131). Reviews functional assessments for people with physical disabilities, mental retardation, and psychiatric impairments, functional communication assessments, quantitative muscle function testing, upper-extremity functional capabilities, job-related social competence, learning potential for people with mental retardation, environmental influences on behavior, rehabilitation indicators, self-observation and report techniques, and vocational rehabilitation assessments.
  • Functional assessment in rehabilitation medicine (25). Reviews functional assessment instruments used in outcome measurement, rehabilitation nursing, and in assessing the elderly, the arthritic patient, and people with mental retardation. Also reviews functional measurement of verbal impairments, assessments of support systems for the elderly, assessment of family functioning, and functional assessments used in primary care.
  • Measurement of activities of daily living (132). Reviews the characteristics of several commonly used standardized activities of daily living assessments. The characteristics reviewed include number of test items, target population, parameters assessed, method of administration, and reliability.
  • Measurement of time in a standardized test of patient mobility (133). Describes a standardized assessment for evaluating the efficiency of bed mobility, wheelchair activities, transfer activities, and ambulation. Normative values are provided for 20- to 69-year-old people.
Gait Assessment Techniques
  • Footprint analysis in gait documentation (134). Provides instructions for obtaining footprint data in the typical clinic setting. Instructions for measuring velocity, cadence, foot progression angle, base of support, stride length, and step length are provided. Observations of toe drag and symmetry of pressure also are suggested.
  • Functional community ambulation: what are your criteria (135). Criteria are provided for evaluating functional community ambulation. Distances required for independent community ambulation at the post office, bank, doctor’s office, supermarket, department store, drugstore, and to cross intersections are provided. Typical curb heights and crosswalk times also are presented.
  • Gait analysis: normal and pathological function (136). Discusses observational gait analysis, oxygen consumption measures, ground reaction force measurements, dynamic electromyography, and gait assessment using motion analysis systems. Normal and pathologic gait patterns are described. Applications of assessment techniques to specific patient populations are discussed.
  • Interrater reliability of videotaped observational gait analysis assessments (137). Interrater reliability of 54 therapists observing videotapes of patients exhibiting abnormal gait was determined. The parameters assessed included knee flexion, genu valgum, cadence, step length, stride length, stance time, and step width. The therapists received no special training in preparation for this study, beyond their physical therapy education. The results indicate that observational gait analysis, in the absence of common rater training, has low to moderate interrater reliability.
  • Reliability of observational kinematic gait analysis (138). Methods of observational analysis are discussed. Descriptions are provided of the procedures used to develop a reliable observational gait analysis format and the protocol used to train raters. Interrater and test-retest reliability data were obtained by having raters observe gait videotapes. The results indicate that observational kinematic gait analysis is a convenient but only moderately reliable technique.
  • Strategies for the assessment of pediatric gait in the clinical setting (139). Describes observational and video gait analysis, measurement of time-distance parameters, electromyography, kinematics, kinetics, and energy expenditure. The pros and cons of each method are discussed, and instrumentation required is described. The use of gait analysis measurements for surgical and orthotic decision making also is presented.
  • Biomechanics and motor control of human gait (140). Discusses gait terminology, temporal and stride measures, kinematics, kinetics, electromyography. Selected normal values are provided.
  • P.1160

  • Impairment rating and disability evaluation (141). Section two of this text provides assessment tools for rating musculoskeletal impairment and work disability; functional capacity evaluation; psychological, social, and behavioral assessment tools; and physician assessment of work capacity.
Multifactorial Rehabilitation Assessment References
  • Clinical measurements (142). Reviews measures of isokinetic strength, clinical measures, functional disability, sensorimotor performance, range of motion, developmental parameters, infant movement, postural control and cardiopulmonary function.
  • Measurement in physical therapy (143). Reviews measures of strength testing (e.g., manual muscle testing, instrumented muscle performance measures), joint motion, functional assessment, gait assessment, children with central nervous system dysfunction, pulmonary function testing, cardiovascular function, nerve conduction velocity, and electromyographic testing.
  • Measurement tools with application to brain injury (144). Reviews measures of coma and global function, disability measures, communicative function, cognitive function, degree of handicap, general outcome measures, environmental measures, preinjury history, and sensory impairments.
  • Measuring health: a guide to rating scales and questionnaires (145). Reviews measures of functional disability and handicap, activities of daily living, psychological well-being, social health, quality of life and life satisfaction, pain measurements, and general health measurements.
  • Quality of life assessments in clinical trials (146). Reviews economic scales and tests, quality of life assessments, social interaction tests and scales, psychological tests and scales, and functional disability scales. Applications of these scales in rehabilitation and for specific patient populations are discussed.
  • Compendium of quality of life instruments (147). Contains more than 150 questionnaires and translations covering a wide range of disorders. It is divided into four parts to ensure easy access to the required instruments: (a) general section containing nondisease specific quality of life measures; (b) disease- or disorder-specific questionnaires; (c) section devoted to caregivers, children, elderly, and women; and (d) economic specific quality of life indices.
  • Quality of life measurement: applications in health and rehabilitation populations, part I (148). Special journal edition sponsored by the American Congress of Rehabilitation Medicine addressing an agenda for future QOL test development, SF-36, and other health-related QOL measures to assess persons with disabilities, measuring QOL in chronic illness, QOL issues in individuals with spinal cord injury, activity-related QOL in rehabilitation and traumatic brain injury, measuring health outcomes in stroke survivors, QOL outcomes perspective, and abstracts covering QOL issues.
  • Fourteenth-Mental Measurements Yearbook (149, http://www.unl.edu/buros). Reviews standardized tests in the areas of achievement, aptitude, development, education, intelligence, neuropsychology, personality, sensory-motor, speech and hearing, and vocation. Bibliographies of references for specific tests, related to the construction, validity, or use of the tests in various settings, also are included. The tests are indexed by periodical, author, publisher, tests or book title, and tests classification. Reviews, descriptions, and references associated with older tests are contained in previous editions of the Yearbook.
Muscle Strength Assessment Techniques
  • Manual muscle strength assessment methods (20–24). Describes standard tests positions and grading criteria for manual assessment of strength.
  • Muscle strength development and assessment in children and adolescents (3). Reviews the literature pertaining to the reliability and validity of strength testing using manual muscle testing and objective techniques. Describes principles of strength testing with both traditional manual methods and objective myometry techniques. Suggestions for testing infants, children, and adolescents are provided.
  • Muscle strength testing: instrument and noninstrumented systems (150). Discusses strength assessment techniques, including skeletal muscle strength testing with instrumented and noninstrumented systems, isometric testing with fixed-load cells, dynamic strength testing, trunk strength testing, and grip and pinch strength measurements.
  • Technical manual: hand strength and dexterity tests (151). Describes tests of grip strength, pinch strength, and finger-hand coordination, and provides normative values.
Muscle Tone Assessment Techniques
  • Clinical measures of spasticity: are they reliable? (4). Intertrial, interrater, and test-retest reliability results of clinical measures of spasticity, including clonus and tendon tap reflexes obtained on a group of people with traumatic spinal cord injuries are reported.
  • Modified Ashworth Scale (152–154). Describes reliability testing results for the Modified Ashworth Scale in stroke, multiple sclerosis, and traumatic brain injury subjects.
  • Microprocessor-based instrument for Achilles tendon reflex measurements (155). Describes quantification of reflex responses by means of tendon tapping with measured forces.
  • Spasticity: quantitative measurements as a basis for assessing effectiveness of therapeutic intervention (156). Detailed description of a method for measuring mechanical output from spastic reflex muscle response to sinusoidal ankle motion at varying frequencies of oscillation and confirming spastic muscle response with surface electromyographic monitoring.
  • Spastic hypertonia: mechanisms and measurement (157). Describes a device with a servo-controlled motor that applies ramp and hold movements to the elbow. Surface electromyographic activity of the biceps, brachioradialis, and lateral triceps muscles are measured in response to the stretch stimulus.
Occupational Biomechanics, Ergonomics, and Work Capacity Evaluation Techniques
  • Ergonomic design for people at work (158). Volume 1 discusses design issues for the workplace, equipment, hand tools, and the environment. Volume 2 describes evaluation of job demands, lifting, manual materials handling by means of surveys, timed activity analysis, biomechanical analysis, energy expenditure measurements, and motion analysis techniques.
  • Occupational biomechanics (41). Reviews measurement of anthropometry, joint motion, muscle strength, motion analysis, postural analysis, force platform data, work capacity, vibration exposure, manual materials handling, hand tool analysis, preemployment screening, job analysis, and ergonomic assessments in clerical and industrial settings. Manual
    P.1161

    work evaluation techniques also are discussed, including motion time measurement methods, physical demands analysis, manual lifting analysis, job static strength analysis, and job postural analysis.
Occupational Therapy Evaluation Techniques
  • An annotated index of occupational therapy evaluation tools (159). Reviews the purposes, advantages, and limitations of standardized and nonstandardized tests of activities of daily living, adaptive skills, cognitive skills, developmental skills, oral function, person-environment interactions, play skills, psychosocial skills, roles and habits, sensory integration, visual-perceptual skills, and vocational skills.
  • Housing accessibility checklist (160). Specific criteria are described for determining housing accessibility. Recommended minimum standards are provided for parking areas, walks and ramps, curbs, stairs, doorways, elevators, and interior rooms.
  • Mental health assessment in occupational therapy (161). Reviews selected assessments of human function pertaining to mental health, including checklists, interest inventories, assessment of older adults, prevocational assessments, work tolerance screening, research analysis of evaluation tools used to assess mental health patients, and the Milwaukee Daily Living Skills and Kohlman Evaluation of Living Skills assessment scales.
  • Willard and Spackman’s occupational therapy (162). Reviews tests of manual dexterity, motor function, developmental, sensory integration, intelligence, and psychological tests.
Quality of Life Assessments
PEDIATRIC ASSESSMENT INSTRUMENTS
  • Pediatric functional outcome measures (163). Reviews the technical and clinical merits of selected functional outcome measures used in pediatric rehabilitation practice. Review of selected measures in neurodevelopmental rehabilitation. Reviews measures of gross and fine motor function, activities of daily living, general cognitive abilities, speech and language, and child and parent adjustment.
PHYSICAL FUNCTION ASSESSMENT TECHNIQUES
  • Assessment of fitness (164,165). Discusses methods of objective assessment and interpretation of test results for objective evaluation of flexibility, body composition (e.g., body density, anthropometry, total body water, muscle mass estimation), muscle strength and endurance, anaerobic abilities, aerobic abilities, leisure time and occupational activity, and physiologic fitness (e.g., blood pressure, blood lipids and lipoproteins, glucose intolerance). References for specific tests are provided.
  • Human muscle function and fatigue (166). Describes the mechanism of muscle fatigue, distinguishing between central and peripheral factors. Describes tests of contractile function and electromyographic changes with fatigue.
  • Introduction to measurement in physical education and exercise science (51). Reviews measures of physical fitness, including body composition (e.g., hydrostatic weighing, skinfold thickness), aerobic fitness tests, performance-based measures, muscle strength and endurance, balance, flexibility, posture, and motor ability.
  • Measurement for evaluation in physical education and exercise sciences (45). Reviews measures of physical abilities (e.g., muscle strength, power, endurance, flexibility, balance, kinesthetic perception), youth fitness, aerobic fitness, body composition (e.g., hydrostatic weighing, skinfold thickness), and skill achievement.
  • Methodology in human fatigue assessment (167). Describes methods of assessing fatigue, including psychological ratings, the blink method, urinary metabolite measurements, assessment of fatigue at work, direct estimation of circulatory fatigue using bicycle ergometry, determination of muscular work performed with different muscle groups, increasing workloads under different environmental conditions, mental fatigue and stress, and fatigue assessments of specific worker populations.
  • Patient evaluation methods for the health professional (168). Describes standardized techniques for measuring limb girth, limb length, limb volume, joint range of motion, muscle length, activities of daily living, motor control, and neurologic parameters.
PSYCHOSOCIAL ASSESSMENT INSTRUMENTS
  • A sourcebook for mental health measures (169). Contains 1,100 abstracts of mental health-related psychological measures that describe questionnaires, scales, inventories, tests, and other types of measuring devices. The emphasis is on instruments that have been developed for research or clinical purposes and are less well known than commercially published tests. Abstracts are grouped into 45 categories. These categories include alcoholism, cognitive tests, counseling and guidance, crime and juvenile delinquency, differential psychological diagnosis, drugs, educational adjustment, environments, family interaction, generations differences, geriatrics, marriage and divorce, mental health attitudes, mental retardation, mental status and level of psychological functioning, occupational adjustment, parent behavior and viewpoints, personal history and demographic data, personality, physical handicap, racial attitudes, psychiatric rehabilitation, service delivery, sex, social issues, student and teacher attitudes, suicide and death, therapeutic outcomes, therapeutic processes, and vocational tests. A description of each instrument is provided, along with the source. In addition, the sourcebook references several other sources of mental health measures.
  • Evaluating practice: guidelines for the accountable professional (46). Reviews a group of nine instruments that measure generalized contentment, self-esteem, marital satisfaction, sexual satisfaction, parental attitudes, child’s attitudes, family relations, and peer relations. Also provides references and briefly discusses reviews of various psychological measures, including mental health measures, psychotherapy change measures, behavioral assessment questionnaires, behavior checklists, psychological assessment, social attitudes, social functioning, adult assessment, rapid assessment instruments for practice, and rating scales which are useful to evaluate patient performance using an interview or observation format.
  • Neuropsychological assessment (170). Reviews measures of intellectual abilities, verbal functions, perceptual functions, constructional functions, memory functions, conceptual function, executive functions, motor performance, orientation, attention, tests for brain injury, observational methods, rating scales, and inventories, and tests of personal adjustment and functional disorders.
  • Psychological testing (34). Reviews intelligence and developmental tests for the general population and special populations, educational achievement and competency tests, creativity and reasoning tests, projective testing techniques, environmental attitudes tests, vocational aptitude tests, occupational
    P.1162

    cognitive screening, psychomotor tests, aptitude tests, personality tests, behavioral assessments, measures of interests, values, and personal orientation, and tests for learning disabilities and neuropsychological dysfunctions.
  • Self-report inventories in behavioral assessment (171). Reviews instruments that measure fears, anxiety, assertiveness, social skills, and depression.
RANGE-OF-MOTION AND MUSCLE EXTENSIBILITY ASSESSMENT TECHNIQUES
  • Measurement of joint motion (23,172). Reviews static and dynamic methods of measuring joint motion.
  • Measurement of joint motion: a guide to goniometry (173). Describes standardized procedures for measuring range of motion of the extremities, spine, and temporomandibular joint. Photos show each test position. Normative values are provided for ranges of motion of each joint.
  • Measurement of trunk motion and flexibility (174–182). This series of references describe measurement techniques and reliability of trunk lateral flexion (176,179,180,182), forward flexion (174,181), and extension (175,177,178,182).
SPEECH ASSESSMENT TECHNIQUES
  • Appraisal and diagnosis of speech and language disorders (33). Reviews measures of articulation, speech-sound discrimination, language, developmental skills, motor skills, nonverbal intelligence, speech production, structural disorders, fluency, and neurologic disorders.
  • Measuring outcomes in speech-language pathology (183). Comprehensive text covering multiple content domains, including definitions, dimensions, perspectives, and requirements of measurement; measuring modality-specific behaviors, functional abilities, and quality of life; measuring consumer satisfaction; collecting, analyzing, and reporting financial outcomes; treatment efficacy research; program evaluation; quality improvement; outcomes measurement in culturally and linguistically diverse populations; outcomes measurement in aphasia; outcomes measurement in cognitive communication disorders (traumatic brain injury, right hemisphere brain damage, dementia); efficacy outcomes and cost-effectiveness in dysphagia; outcomes in motor speech disorders; outcomes measurement in voice disorders; outcomes measurement in fluency disorders; outcomes measurement in child language and phonologic disorders; outcomes measurement in specific settings (schools, health-care facilities, universities, private practice).
  • Medical speech-language pathology: a practitioner’s guide (184). Clinically oriented text with practical measurement methods for selected issues such as measures of swallowing useful in defining the efficacy of treatment during radiographic study of oropharyngeal swallow; neurocommunicative monitoring tools; indicators of malnutrition; ASHA Functional Assessment of Communication Skills; electrodiagnostic testing for neuromuscular disorders; language and communication assessment measures to use with dementia patients.