Test-retest reliability of the GAPP functional capacity evaluation in healthy adults

Key words. Intraclass correlation coefficient. Intrarater reliability. Stability. Work capacity evaluation. Mots clés. Coefficient de corrélation intraclasse. Évaluation de la capacité de travail. Fiabilité intra-évaluateurs. Stabilité. Abstract. Background. Functional capacity evaluations are commonly used in work rehabilitation practice to assess a person's capacity to perform work-related activities. Purpose. This study examined the test-retest reliability of participants' performance and administrator ratings using the Gibson Approach to Functional Capacity Evaluation (GAPP FCE). Methods. Forty-eight healthy adults were evaluated twice on 12 recommended core items of the GAPP FCE and rated for overall performance. Findings. The ICCs and 95% CIs for the Physical Level of Work and Alternative Physical Level of Work Ratings were 0.93 (0.87-0.96) and 0.86 (0.72-0.93) respectively. The ICCs for the core item-level ratings ranged from 0.15 to 0.94, and the ICCs for the actual loads handled in the manual handling items ranged from 0.88 to 0.95. Implications. The stability of an overall physical level of work rating shows potential for use in functional capacity evaluation practice and research. Further research is needed to investigate other measurement properties of the GAPP FCE using populations with injury or disability. Résumé. Description. Les évaluations des capacités fonctionnelles sont souvent utilisées en réadaptation professionnelle pour déterminer l'aptitude d'une personne à effectuer des activités associées au travail. But. Cette étude avait pour but d'examiner la fiabilité de test-retest du rendement des participants ainsi que des cotes des administrateurs, à l'aide de la Gibson Approach to Functional Capacity Evaluation (GAPP FCE). Méthodologie. Quarante-huit adultes en bonne santé ont été évalués à deux reprises face aux 12 activités de base recommandées par la GAPP FCE. Leur rendement global a également été évalué. Résultats. Les coefficients de corrélation intraclasse (CCI) et les intervalles de confiance (CI) à 95 % pour les cotes associées au degré de travail physique et au degré de travail physique alternatif étaient respectivement de 0,93 (0,87-0,96) et de 0,86 (0,72-0,93). Les CCI pour les cotes des activités de base variaient de 0,15 à 0,94, et celles des charges réelles manipulées pendant les activités exigeant une manipulation manuelle variaient de 0,88 à 0,95. Conséquences. La stabilité de l'échelle du degré de travail physique global montre que cette échelle pourrait être utilisée en recherche ou dans la pratique, en ce qui a trait à l'évaluation des capacités fonctionnelles. Il faudrait pousser l'étude plus loin afin d'examiner les autres propriétés de mesure de la GAPP FCE auprès de populations de personnes blessées ou handicapées.






Publication: The Canadian Journal of Occupational Therapy
Author: Gibson, Libby A
Date published: February 1, 2010

Functional capacity evaluations (FCEs) are commonly used in work rehabilitation practice to assess a person's capacity to perform work-related activities (Deen, Gibson, & Strong, 2002; Jundt & King, 1999). Information gained from such evaluations is used to make recommendations about whether a person can return to work and the levels of work the person can safely perform (Gouttebarge, Wind, Kuijer, & Frings-Dresen, 2004; Gross & Battie, 2002). Employers and insurers often rely on FCEs to help make important returnto- work decisions (King, Tuckwell, & Barrett, 1998). However, such use has occurred despite limited, albeit increasing, evidence of the measurement properties of FCEs (Innes, 2006; Innes & Straker, 1999a, 1999b; King et al.).

The need for FCEs and other work-related assessments to demonstrate sound properties of reliability and validity is widely recognised in the literature (Innes, 2006; Innes & Straker, 1999a, 1999b). Over the past decade, there has been an increase in the examination of such measurement properties in both existing and newly developed assessments (Innes). Although some commercially available FCEs have been studied for their reliability and validity, other assessments still lack sufficient investigation (Gouttebarge et al., 2004; Innes).

Reliability refers to the consistency of measurements (Innes & Straker, 1999a; Portney & Watkins, 2009) and is an essential characteristic for FCEs and assessments in general (Innes & Straker; King et al., 1998). Without evidence of reliability, confidence in FCE measurements is diminished (Gardener & McKenna, 1999; Innes & Straker). The three types of reliability commonly associated with work-related assessments are test-retest, intra-rater reliability, and inter-rater reliability (Innes & Straker). Some authors have noted that when a rater is involved in scoring the evaluation, intra-rater reliability is equivalent to test-retest reliability because the accuracy of the FCE is influenced by the skill of the rater (Gouttebarge et al., 2004; Gouttebarge, Wind, Kuijer, Sluiter, & Frings-Dresen, 2005; Portney & Watkins) For the purposes of the remainder of the paper, the term test-retest reliability will be used to refer to the stability of ratings made across two testing occasions. In a systematic review of the reliability and validity of FCEs, Gouttebarge et al. (2004) indicated the need for more rigorous procedures to demonstrate the reliability of FCE methods, particularly on a retest basis. They proposed guidelines for FCE research, such as required time intervals between test administrations, use of clearly defined statistics that are appropriate to the objective of the study, and providing an adequate description of the study population (Gouttebarge et al., 2004).

A new approach to FCE, called the Gibson Approach to Functional Capacity Evaluation (GAPP FCE), has been developed to provide a standard framework for evaluating people with chronic back pain (Gibson & Strong, 2002; Gibson, Strong, & Wallace, 2005). As with some other existing approaches (Brouwer et al., 2003; Legge & Burgess-Limerick, 2007; Tuckwell, Straker, & Barrett, 2002), the content of this approach is based on the physical demands of work from the Dictionary of Occupational Titles ([DOT] United States Department of Labor, 1991). The GAPP FCE adopts a conceptual framework that parallels the World Health Organisation's International Classification of Functioning, Disability, and Health (ICF) (World Health Oganization [WHO], 2001) by measuring an individual's activity limitations based on their performance of the physical demands (Gibson & Strong, 2003) and by making recommendations about participation in work based on this performance.

The GAPP FCE has undergone some examination of its measurement properties (Gibson et al., 2005). The research and test development process of the GAPP FCE has included expert review (Gibson & Strong, 2002), pilot testing with participants without injury and clients with chronic back pain (Gibson et al.), examination of inter-rater reliability (Gibson et al.), examination of safety aspects with rehabilitation clients with chronic back pain (Gibson & Strong, 2005), and assessment of item validity (Kersnovske, Gibson, & Strong, 2005).

Previous studies evaluating the test-retest reliability of FCEs have used both injured and noninjured participants (Brouwer et al., 2003; Gouttebarge, Wind, Kuijer, Sluiter, & Frings-Dresen, 2006; Gouttebarge et al., 2005; Gross & Battie, 2002; Horneij, Holmstrom, Hemborg, Isberg, & Ekdahl, 2002; Legge & Burgess-Limerick, 2007; Lygren, Dragesund, Joensen, Ask, & Moe-Nilssen, 2005; Reneman et al., 2004; Reneman, Dijkstra, Westmaas, & Goeken, 2002; Smeets, Hijdra, Kester, Hitters, & Knottnerus, 2006; Soer, Gerrits, & Reneman, 2006; Tuckwell et al., 2002). In the studies that have been conducted with injured participants, fluctuation in pain levels, motivation or self-limiting behaviours were found to affect performance on the FCEs (Gross & Battie; Lygren et al.; Reneman et al., 2002; Tuckwell et al.). Therefore, the current study examined the test-retest reliability of participants' performance and administrator ratings using the GAPP FCE in a healthy, noninjured adult population to provide an element of stability in the measurements, in that the participants' health status should not have substantially varied between tests nor greatly affected their performance on the FCE.

Methods

Participants

Participants were 52 healthy adult volunteers who were recruited by convenience and consented to participate in the study. The study aimed for a sample size of 55 participants as required for adequate power and an intra-class correlation of 0.8; however, we were unable to achieve this due to financial and time restraints (Bonett, 2002). The participants were undergraduate and graduate entry master's students from programs in occupational therapy, physiotherapy, and pharmacy programs who met the inclusion criteria. Clinical educators personally invited students to participate in the study during fieldwork experience, and additional students were recruited via bulk e-mail to student year groups.

Data Collection

Ethical approval was granted by the relevant ethics committee at The University of Queensland. A repeated-measures design was used for this study. Two FCE sessions were held within the recommended time interval of 7-14 days to ensure that the participants had recovered from the first FCE session and to minimise the effect of learning (Gouttebarge et al., 2004). Participants were retested whenever possible on the same day of the week at the same time. Testing took place within the Division of Occupational Therapy at The University of Queensland using the standard equipment available.

An independent and experienced occupational therapist administered and evaluated all the FCEs using the GAPP FCE Procedures and User's Manual, which is described in detail in Gibson et al. (2005). This occupational therapist screened participants using eligibility criteria and obtained written informed consent from participants before testing.

Before carrying out the evaluations, the therapist undertook training in the GAPP FCE (total of 10 hours) with the developer and the first author (LG). Procedural reliability was assessed between the therapist and the first author (LG) using a healthy volunteer. The therapist achieved 93% procedural reliability on the first practice administration and scoring of the test, which is above the minimum 90% level required, as suggested by Hinderer and Hinderer (1993).

The participants were given brief instructions and a demonstration of the required performance of each item before testing. A total of 12 items were evaluated, and these are listed in Table 1. The administering therapist completed the standard screening procedures as required in the standard GAPP FCE process, including completion of a Physical Activity Readiness Questionnaire (PAR-Q) (Canadian Society for Exercise Physiology, 2002). At the retest, the administering therapist also checked the participants' overall health status and the effects of the first FCE through brief discussion with the participant before retesting.

For each of the recommended core items, the therapist noted elements about the nature and quality of the person's performance using the GAPP FCE scoresheets (such as the load handled, observed muscular effort, perceived effort, heart rate) and, based on these, rated the amount of difficulty the person had performing the items on a five-point scale ranging from "no difficulty" to "complete difficulty." This is called the Physical Demand Performance Rating (PDPR) and is compatible with the ICF view of activity limitation (WHO, 2001). The items that require handling of loads are divided into categories of loads ranging from "very light loads" to "very heavy loads," based on the definitions of loads in the DOT (United States Department of Labor, 1991). The therapist also rated the person's performance in each of these load levels.

After completing the evaluation, the therapist made an overall rating about the physical level of work the person can perform on a 10-point scale ranging from "sedentary with restrictions" to "very heavy." This rating, called the Physical Level of Work Rating (PLW), is based on the different strength levels of work defined in the DOT (United States Department of Labor, 1991) and is determined by the person's overall performance on the GAPP FCE using an algorithm.

Recommendations were also made about the frequency with which the person can perform each of the physical demands in the workplace. A five-point scale, called the Physical Demand Frequency Rating (PDFR), which ranged from "never" to "constantly," was used. This scale is based on frequency definitions from the DOT (United States Department of Labor, 1991). In the GAPP FCE, other ratings are made for return-to-work recommendations (Gibson et al., 2005). However, as the participants in this study were healthy and not involved in returning to work, these were not applicable for investigation of test-retest reliability.

In the course of the study, an alternative scale for rating the recommended physical level of work, called the Alternative Physical Level of Work Rating (APLW), was introduced for investigation of its test-retest reliability for potential use in FCE. The difference between the alternative rating scale and the physical level of work rating scale is the classification of the "heaviness" of load handling in jobs (Gibson & Strong, 2005). It is a more conservative measure that aligns with the recommended U.S. National Institute for Occupational Safety and Health (NIOSH) lifting guidelines (Gibson & Strong; National Institute for Occupational Safety and Health, 1994).

The GAPP FCE also recommends the use of instruments to measure key psychosocial variables during the FCE. The participants in this study completed the Spinal Function Sort (SFS) (Gibson & Strong, 1996; Matheson & Matheson, 1989; Matheson, Matheson, & Grant, 1993), but not the complete battery of recommended psychosocial questionnaires as these were not indicated for a healthy sample. The SFS provides a rating of perceived capacity (RPC), ranging from 0 to 200, for the performance of work-related tasks.

Data Analysis

The test-retest reliability of the Physical Demand Performance Ratings (PDPRs), the Physical Level of Work Rating (PLW), the Alternative Physical Level of Work Rating (APLW), and the Physical Demand Frequency Ratings (PDFRs) was assessed using intraclass correlation coefficients (ICCs), Krippendorff's alpha (α) and 95% confidence intervals (CIs). For the manual handling items (Lifting Waist to Waist, Lifting Floor to Waist, Carrying Bilateral) and the Spinal Function Sort Rating of Perceived Capacity (RPC) scores, means, standard deviations, ranges, ICCs, Krippendorff's α, and 95% CI were also calculated. Paired t-tests with a 0.05 significance level were performed to determine if the mean maximum safe loads handled in the first FCE differed significantly from the retest and whether there were significant differences in mean RPC scores between testing occasions.

ICCs are a commonly preferred statistical method for analysis of reliability (Portney & Watkins, 2009) that have been used by researchers in previous studies examining the test-retest reliability of FCEs (Brouwer et al., 2003; Gouttebarge et al., 2006; Gouttebarge et al., 2005; Gross & Battie, 2002; Horneij et al., 2002; Legge & Burgess-Limerick, 2007; Lygren et al., 2005; Reneman et al., 2004; Reneman et al., 2002; Smeets et al., 2006; Soer et al., 2006; Tuckwell et al., 2002). The model ICC (3,1), which is based on a two-way, mixed-effect analysis of variance model, was used to assess the degree of agreement among ratings made from the first FCE and the retest (Physical Demand Performance Rating, Physical Demand Frequency Rating, Physical Level of Work Rating, Alternative Physical Level of Work Rating). This model has been proposed for measuring intra-rater reliability with multiple scores from the same rater (Portney & Watkins). The model ICC (1,1), which is based on a one-way, random analysis of variance model, was used to calculate the test-retest reliability of continuous data, including the maximum safe loads handled, and the RPC scores (Portney & Watkins). Other studies have used this model when examining test-retest reliability of actual loads handled (Brouwer et al.; Gross & Battie; Lygren et al.; Reneman et al., 2004; Reneman et al., 2002; Smeets et al.; Soer et al.).

An alternative statistic, called Krippendorff's alpha (α), has been proposed by Hayes and Krippendorff (2007) as a useful standard measure of reliability. It measures the observed agreement and the agreement that is expected by chance (Axelrod & Hone, 2006; Lombard, Snyder-Duch, & Bracken, 2002), and can be calculated for different levels of measurement (nominal, ordinal, interval, ratio) and for any number of observers (Hayes & Krippendorff). As there is no agreement amongst researchers regarding the most appropriate statistical approach to use in reliability studies (Hayes & Krippendorff; Portney & Watkins, 2009; Rankin & Stokes, 1998), and the general consensus is that no single estimate of reliability is sufficient, it is preferable that more than one type of analysis is applied (Bruton, Conway, & Holgate, 2000; Lombard et al.). Therefore, Krippendorff's α was also used to assess the reliability of all GAPP FCE measures. Where ICCs and Krippendorff's α could not be calculated due to lack of variance between participants (i.e., when all or most were rated as having "no difficulty" as they achieved the ceiling or criterion required by the GAPP FCE procedure), the ratings were described qualitatively.

All analyses were performed using the statistical software package SPSS Version 15.0 (SPSS Inc., Chicago, IL, USA). Krippendorff's α was calculated using a macro written for SPSS called KALPHA (Hayes & Krippendorff, 2007). ICCs were interpreted as follows: (1) ICC . 0.90 were considered excellent and sufficient for clinical testing; (2) 0.75 < ICC < 0.90 were considered good; and (3) ICC . 0.75 were considered poor to moderate (Innes & Straker, 1999a; Portney & Watkins, 2009). There is no well-established standard for interpreting Krippendorff's α (Krippendorff, 2004; Lombard et al., 2002). Some authors have proposed that α coefficients of 0.70 or greater are an acceptable level of reliability for exploratory research (Lombard et al.; Lombard, Snyder-Duch, & Bracken, 2005). Others have suggested α coefficients of 0.80 or greater as a reliability standard (Krippendorff). Where low coefficient values were obtained, raw data was examined in an effort to offer explanations for the findings.

Findings

Participants

Of the 52 participants, 4 did not complete the retest, due to scheduling difficulties and unavailability in the required retest interval. Therefore, a total of 48 participants completed both FCE sessions. There were 9 men and 39 women. The majority of participants (69.2%) were working whilst undertaking their university studies, primarily in sales or personal service. Most were employed in medium-level occupations as per the DOT definitions of work (United States Department of Labor, 1991). Their mean age was 21.90 years (SD = 2.01, range 19-30), mean weight was 61.33 kg (SD = 10.45, range 43-85) and mean height was 168.28 cm (SD = 7.96, range 152-187). One participant did not perform the Kneeling Repetitively item; one participant did not perform the Carrying Bilateral item; and one participant did not perform the Lifting Floor to Waist, Kneeling Repetitively, and Carrying Bilateral items due to a pre-existing, but stable, musculoskeletal condition.

FCE Administration

The average duration of FCE sessions was 1 hour and 36 minutes (SD = 17 minutes), ranging from 1 hour and 15 minutes to 3 hours and 8 minutes. This time included collecting demographic information and completing the questionnaires. The average time interval between retests was 8.81 days (SD = 3.15, range 6-15).

Test-Retest Reliability

Physical Level of Work Rating.

The results for the reliability of the Physical Level of Work Ratings are presented in Table 2. The ICC was 0.93 and α coefficient was 0.87. A sub-sample of 29 participants was rated using the alternative rating scale. The ICC for the Alternative Physical Level of Work Rating was 0.86 and α coefficient was 0.84.

Physical Demand Performance Rating.

The results of the Physical Demand Performance Ratings are presented in Table 3. Analyses could not be performed on six core items and three sub-items due to lack of variance among the ratings on the first and second FCE sessions (i.e., many were rated as having "no difficulty"). All participants completed the required procedure with "no difficulty" (i.e., they completed the required distance or repetitions) for Walking, Lifting Waist to Waist (very light), Lifting Floor to Waist (very light), Climbing Stairs, and Carrying Bilateral (very light). All but one participant completed the required procedure with "no difficulty" for Sitting, Standing, and Crouching Repetitively. For Reaching Overhead, only two participants were not rated as having "no difficulty" by the therapist. For these participants, there was a change of one level in the activity limitation rating scale between testing occasions.

Lower ICC and α values were obtained for Lifting Floor to Waist (light and very heavy), Carrying Bilateral (light), and Stooping (Sustained Semi-squatting). Examination of the actual values indicated that most participants completed these items with "no difficulty," except for Lifting Floor to Waist (very heavy); most participants did not reach this load level and were rated as having "complete difficulty." For Stooping (Sustained Semi-squatting), all but two participants completed the required procedure with "no difficulty" on both testing occasions. These two participants had changed ratings on retest, one increasing by two levels on the rating scale, the other decreasing by two levels.

For the other items, the ICC values ranged from 0.64 to 0.94 and α coefficients ranged from 0.58 to 0.88.

Physical Demand Frequency Rating.

Analyses could not be performed on three core items and three sub-items due to insufficient variation among the ratings as for the Physical Demand Performance Ratings. For the items that could be calculated, the ICC values ranged from 0.37 to 0.95, and α coefficients ranged from 0.37 to 0.93. See Table 4 for the results of the Physical Demand Frequency Ratings.

Heterogeneity of sample.

For some of the GAPP FCE items, low reliability coefficients and negative confidence intervals were obtained. Negative values can neither be relied upon (Krippendorff, 2004) nor considered valid (Portney & Watkins, 2009). For ICC values, the significance of the between-subjects variance in the analysis of variance was examined to determine if the participants in the study were different from each other (Portney & Watkins). If this effect is significant, it is an indication of sufficient heterogeneity within the sample (Portney & Watkins). The betweensubjects variance was significant (p < 0.05) for the Physical Level of Work Rating, Alternative Physical Level of Work Rating, all items in the Physical Demand Frequency Ratings and all items in the Physical Demand Performance Ratings except for one, Lifting Floor to Waist (light) (p = 0.149).

Manual handling items.

The mean maximum safe weights handled during the first and second FCE and the reliability results of the manual handling items are shown in Table 5. Paired t tests showed that, on average, the participants lifted and carried more on retest, and these differences were statistically significant for two manual handling items: Lifting Waist to Waist (p < 0.0001) and Lifting Floor to Waist (p < 0.0001). ICC values ranged from 0.88 to 0.95 and α coefficients ranged from 0.88 to 0.92.

Spinal Function Sort (Rating of Perceived Capacity).

The results of the Rating of Perceived Capacity scores are also presented in Table 5. No significant differences in mean Rating of Perceived Capacity scores for the whole sample were found between testing occasions. The ICC for test-retest reliability was 0.82 and α coefficient was 0.80.

Discussion

In this study, despite poor to moderate test-retest reliability for some of the individual GAPP FCE items, excellent test-retest reliability for the overall Physical Level of Work Rating and good test-retest reliability for the Alternative Physical Level of Work Rating were achieved. The Physical Level of Work Rating is an overall rating of a person's performance in the GAPP FCE, and is made by the therapist with consideration of the person's performance in individual GAPP FCE items. These findings suggest that an overall rating scale may provide more useful information than individual rating scales in terms of their stability for making return to work recommendations. The GAPP FCE appears to be one of the few evaluations in which an overall rating is used to determine the physical level of work a person can perform. The only other published results of an FCE with this feature appears to be a study on the testretest reliability of the Physical Work Performance Evaluation (PWPE) (Brassard, Durand, Loisel, & Lemaire, 2006). In this study, published in French, Brassard et al. investigated the testretest reliability of the PWPE, including the final PWPE score, which achieved a kappa value of 0.43. Though the PWPE uses a different scale and procedures than the GAPP FCE and we cannot compare kappa statistics with ICCs, a kappa of 0.43, which indicates moderate reliability at best, does not appear as favourable as an ICC of 0.93, as we found for the overall rating made in the GAPP FCE. Other FCEs reportedly use pass or fail ratings for the scoring of the performance of the physical demands. Oliveri, Jansen, Oesch, and Kool (2005) questioned the value of such pass or fail ratings for determining return-to-work capacity.

Reliability coefficients for the individual Physical Demand Performance Ratings and the Physical Demand Frequency Ratings ranged from poor to excellent. A possible explanation for the lower reliability coefficients obtained could be due to the similar performances between the healthy participants (e.g., the majority rated as having "no difficulty") (Portney & Watkins, 2009). However, checking of the significance of between-subjects variance indicated that only one item of the GAPP FCE was not significant, indicating sufficient variation for reliability analysis in all other items (Portney & Watkins). For this item, Lifting Floor to Waist (light), a negative ICC value in the confidence interval was obtained. A negative ICC value indicates that the within-subjects variance is greater than the between-subjects variance, and therefore cannot be relied upon (Portney & Watkins).

The lower values might also be explained by the limitations of using ICCs in reliability analysis as commonly noted in the literature (Geisser, Alschuler, Theisen-Goodvich, & Haig, 2008; Legge & Burgess-Limerick, 2007; Rankin & Stokes, 1998; Smeets, 2008; Smeets et al., 2006). The ICC is a ratio of the between-subject variance to the within-subject and error variance (Brouwer et al., 2003; Bruton et al., 2000; Rankin & Stokes). It provides no indication of the magnitude of the disagreement between two scores (Brouwer et al.; Rankin & Stokes). The greater the performance variability between participants, the higher the ICCs will tend to be (Bruton et al.; Gouttebarge et al., 2006; Horneij et al., 2002; Rankin & Stokes; Smeets; Smeets et al.). Recent publications have suggested using alternative methods of analysis such as the Bland and Altman plots and limits of agreement (LOA) in conjunction with ICCs (Rankin & Stokes; Smeets; Smeets et al.). However, there is still no consensus on the most appropriate statistical approach for reliability analysis in rehabilitation research (Portney & Watkins, 2009; Rankin & Stokes). Moreover, calculating LOA on ratings made from a five-point scale (e.g., Physical Demand Performance Ratings, Physical Demand Frequency Ratings) would not be very useful.

Furthermore, upon inspection of the actual values, for example, for Stooping (Sustained Semi-squatting), it appears that changes in performances between tests in a small number of participants (e.g., some improved, some worsened), greatly affected the ICCs despite the majority of participants meeting the ceiling or criterion required by the GAPP FCE procedure. So the low values may be due to low variance or may be due to the small number of participants who changed scores between tests, or both.

Interestingly, the Physical Demand Frequency Ratings demonstrated higher reliability than the Physical Demand Performance Ratings. The Physical Demand Frequency Ratings are ratings about how frequently the person should perform the physical demand in the workplace and so are specifically defined according to the definitions of frequency from the DOT. The Physical Demand Performance Ratings, on the other hand, are ratings of difficulty in actual performance, so this may indicate that participants performed the physical demands more variably than how they are recommended to perform them in the workplace. It may be that the Physical Demand Frequency Ratings, which are based on the DOT definitions of frequency for performance in the workplace, are more clearly operationally defined than the Physical Demand Performance Ratings, which are up to the therapist's judgment of the person's performance on that day. These findings suggest that there may be problems associated with using a difficulty rating scale to rate performance in GAPP FCE items, and a need for further consideration and investigation by the developer is warranted.

Due to the different procedures, criteria, scoring methods, and statistics used in this and other FCE studies, it is difficult to compare our findings with other FCE studies. However, to allow for comparison and for test-retest purposes, we analysed the actual loads handled during the manual handling items in the GAPP FCE. The results showed excellent test-retest reliability for two of the manual handling items (Lifting Floor to Waist and Carrying Bilateral) and good test-retest reliability for Lifting Waist to Waist. In the GAPP FCE, the level to which the therapist takes the person in terms of kilograms handled is not prescribed in the procedures and is at the discretion of the administering therapist, within broad safety guidelines. So this gives some support to this approach. These findings are comparable with other studies that demonstrated good to excellent test-retest reliability for lifting and carrying tests using ICCs (Brouwer et al., 2003; Gouttebarge et al., 2006; Gouttebarge et al., 2005; Gross & Battie, 2002; Legge & Burgess-Limerick, 2007; Lygren et al., 2005; Reneman et al., 2004; Reneman et al., 2002; Smeets et al., 2006). As expected with any test requiring repeated strength measurements, a practice or carryover effect was evident (Bruton et al., 2000; Innes & Straker, 1999a; Portney & Watkins, 2009). Participants generally performed better on retest and this has also been found in other FCE studies (Brouwer et al.; Legge & Burgess-Limerick; Lygren et al.; Reneman et al., 2002; Reneman et al., 2004).

In addition, the results of the test-retest reliability of the Spinal Function Sort (SFS) Rating of Perceived Capacity scores (ICC = 0.82) demonstrate the stability of the measure for rating perceived capacity in performance of work-related tasks, which confirms previous findings from Gibson and Strong's (1996) study, which found support for the test-retest reliability of the SFS (ICC = 0.89) with a sample of rehabilitation clients with chronic back.

The present study has met many of the guidelines proposed by Gouttebarge et al. (2004) for a high methodological quality study examining the test-retest reliability of the GAPP FCE. The sample size is greater than those used in other FCE studies (Brouwer et al., 2003; Gouttebarge et al., 2005; Gross & Battie, 2002; Lygren et al., 2005; Reneman et al., 2004; Tuckwell et al., 2002), with only two other studies having achieved sample sizes of about 50 participants (Reneman et al., 2002; Smeets et al., 2006). A new measure of reliability, called Krippendorff's α, was introduced for further verification of the findings. To the best of our knowledge, this is the first study in FCE research to date to use an alternative statistic for reliability analysis. Generally Krippendorff's α values were lower than ICC values, which is not surprising in that Krippendorff's α is a more conservative statistic and, like the ICC, it does not account for point differences in values (Axelrod & Hone, 2006; Lombard et al., 2002). Also, the model ICC (1,1) was used so that comparison of our reliability results with other studies was possible.

Limitations

One of the limitations of the study was that participants were predominantly female, decreasing the generalisability of the findings. The results of this study are only applicable to female student cohorts. In a study conducted by Lygren et al. (2005) on the test-retest reliability of the Progressive Inertial Lifting Evaluation (PILE), ICC values for women were lower than ICC values for men. The authors reported that both the betweensubject and within-subject variance was larger in men compared to women, which resulted in a higher ICC value (Lygren et al.). In our study, it was not possible for us to analyse the data according to gender due to the limited number of men involved.

Another limitation of the study was that no formal measure was used to detect possible changes in participants' health status between FCE sessions, for example, like the one used in Gouttebarge et al. (2006). However in our study, no significant differences in Ratings of Perceived Capacity scores, a measure of perceived capacity, were found between testing occasions, providing some indication of the stability of the participants. Moreover, the recommended test-retest interval proposed by Gouttebarge et al. (2004) was used to minimise changes in participants' health status (Innes & Straker, 1999a; Portney & Watkins, 2009).

Also, the fact that the developers of the GAPP FCE have conducted the research, thereby potentially biasing the results, is acknowledged as a possible limitation of the study.

Conclusion

It is important that FCEs used in occupational rehabilitation practice have demonstrated test-retest reliability, particularly when evaluating return-to-work programs, as changes in FCE measurements over time need to be indicative of changes in a person's physical abilities and not due to instability of the measurement tool. This present study has investigated the test-retest reliability of ratings made on 12 recommended core items from the GAPP FCE and has found support for the overall rating of the Physical Level of Work made from the GAPP FCE in a sample of healthy young adults. Further examination of the GAPP FCE on an individual core item level is required. Future research should be conducted on heterogenous samples, such as testing on participants with stable musculoskeletal conditions, chronic low-back pain, and other samples with injury or disability.

The findings provide support for the potential use of the Physical Level of Work scale in future occupational rehabilitation and FCE practice and research. The alternative Physical Level of Work scale also shows promise, especially for use with people with injury, as it is a more conservative scale and aligns with the prevailing lifting guidelines. However, further investigation of the inter-rater reliability of the Physical Level of Work ratings with a larger sample of therapists and participants with injury than has been done to date is needed. Such examination is required as a foundation for examination of the validity of the GAPP FCE for practice. Of particular interest and a challenge for future research would be whether therapists' ratings of the recommended Physical Level of Work is predictive of actual level of physical work on return to work. Such research would be a challenge given the many factors affecting return to work after injury that would need to be controlled; however, it would make a major contribution to FCE research and practice.

References

Axelrod, L., & Hone, K. S. (2006). Affectemes and all affects: A novel approach to coding user emotional expression during interactive experiences. Behavior & Information Technology, 25, 159- 173. doi: 10.1080/01449290500331164

Bonett, D. G. (2002). Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine, 21, 1331-1335. doi: 10.1002/sim.1108

Brassard, B., Durand, M. J., Loisel, P., & Lemaire, J. (2006). Étude de fidélité test-retest de l'Évaluation des Capacités Physiques reliées au Travail [Test-retest reliability of the Work-related Physical Capacity Evaluation]. Canadian Journal of Occupational Therapy, 73, 206-214.

Brouwer, S., Reneman, M. F., Dijkstra, P. U., Groothoff, J. W., Schellekens, J. M. H., & Goeken, L. N. H. (2003). Test-retest reliability of the Isernhagen work systems functional capacity evaluation in patients with chronic low back pain. Journal of Occupational Rehabilitation, 13, 207-218.

Bruton, A., Conway, J. H., & Holgate, S. T. (2000). Reliability: What is it, and how is it measured? Physiotherapy, 86, 94-99.

Canadian Society for Exercise Physiology. (2002). Physical Activity Readiness Questionnaire (Par-Q). Retrieved from http://www.csep.ca/main.cfm?cid=574

Deen, M., Gibson, L., & Strong, J. (2002). A survey of occupational therapy in Australian work practice. Work, 19, 219-230.

Gardener, L., & McKenna, K. (1999). Reliability of occupational therapists in determining safe, maximal lifting capacity. Australian Occupational Therapy Journal, 46, 110-119. doi: 10.1046/j.1440-1630.1999.00184.x

Geisser, M. E., Alschuler, K. N., Theisen-Goodvich, M. E., & Haig, A. J. (2008). A comparison of the relationship between depression, perceived disability, and physical performance in people with chronic back pain (Author's response to letter to the editor). European Journal of Pain, 13, 111-112. doi: 10.1016/j. ejpain.2007.11.003

Gibson, L., & Strong, J. (1996). The reliability and validity of a measure of perceived capacity for work in chronic back pain. Journal of Occupational Rehabilitation, 6, 159-175.

Gibson, L., & Strong, J. (2002). Expert review of an approach to functional capacity evaluation. Work, 19, 231-242.

Gibson, L., & Strong, J. (2003). A conceptual framework of functional capacity evaluation for occupational therapy in work rehabilitation. Australian Occupational Therapy Journal, 50, 64-71. doi: 10.1046/j.1440-1630.2003.00323.x

Gibson, L., & Strong, J. (2005). Safety issues in functional capacity evaluation: Findings from a trial of a new approach for evaluating clients with chronic back pain. Journal of Occupational Rehabilitation, 15, 237-251. doi: 10.1007/s10926-005-1222-z

Gibson, L., Strong, J., & Wallace, A. (2005). Functional capacity evaluation as a performance measure: Evidence for a new approach for clients with chronic back pain. Clinical Journal of Pain, 21, 207-215.

Gouttebarge, V., Wind, H., Kuijer, P., Sluiter, J. K., & Frings-Dresen, M. H. (2006). Reliability and agreement of 5 Ergo-kit functional capacity evaluation lifting tests in subjects with low back pain. Archives of Physical Medicine and Rehabilitation, 87, 1365-1370. doi: 10.1016/j.apmr.2006.05.028

Gouttebarge, V., Wind, H., Kuijer, P. P., & Frings-Dresen, M. H. (2004). Reliability and validity of Functional Capacity Evaluation methods: A systematic review with reference to Blankenship system, Ergos work simulator, Ergo-Kit and Isernhagen work system. International Archives of Occupational and Environmental Health, 77, 527-537. doi: 10.1007/s00420-004-0549-7

Gouttebarge, V., Wind, H., Kuijer, P. P., Sluiter, J. K., & Frings-Dresen, M. H. (2005). Intra- and interrater reliability of the Ergo-Kit functional capacity evaluation method in adults without musculoskeletal complaints. Archives of Physical Medicine and Rehabilitation, 86, 2354-2360. doi: 10.1016/j.apmr.2005.06.004

Gross, D. P., & Battie, M. C. (2002). Reliability of safe maximum lifting determinations of a functional capacity evaluation. Physical Therapy, 82, 364-371.

Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1, 77-89.

Hinderer, S. R., & Hinderer, K. A. (1993). Quantitative methods of evaluation. In J. D. Lisa (Ed.), Rehabilitation medicine: Principles and clinical practice (pp. 96-121). Philadelphia: Lippincott Company.

Horneij, E., Holmstrom, E., Hemborg, B., Isberg, P., & Ekdahl, C. (2002). Inter-rater reliability and between-days repeatability of eight physical performance tests. Advances in Physiotherapy, 4, 146-160. doi: 10.1080/14038190260501596

Innes, E. (2006). Reliability and validity of functional capacity evaluations: An update. International Journal of Disability Management Research, 1, 135-148.

Innes, E., & Straker, L. (1999a). Reliability of work-related assessments. Work, 13, 107-124.

Innes, E., & Straker, L. (1999b). Validity of work-related assessments. Work, 13, 125-152.

Jundt, J., & King, P. M. (1999). Work rehabilitation programs: A 1997 survey. Work, 12, 139-144.

Kersnovske, S., Gibson, L., & Strong, J. (2005). Item validity of the physical demands from the dictionary of occupational titles for functional capacity evaluation of clients with chronic back pain. Work, 24, 157-169.

King, P. M., Tuckwell, N., & Barrett, T. E. (1998). A critical review of functional capacity evaluations. Physical Therapy, 78, 852-866.

Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Thoursand Oaks, CA: Sage Publications.

Legge, J., & Burgess-Limerick, R. (2007). Reliability of the JobFit system pre-employment functional assessment tool. Work, 28, 299-312.

Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28, 587-604. doi: 10.1111/j.1468-2958.2002.tb00826.x

Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2005). Practical resources for assessing and reporting of intercoder reliability. Retrieved from www.temple.edu/mmc/reliability

Lygren, H., Dragesund, T., Joensen, J., Ask, T., & Moe-Nilssen, R. (2005). Test-retest reliability of the progressive isoinertial lifting evaluation (PILE). Spine, 30, 1070-1074. doi: 10.1097/01. brs.0000160850.51550.55

Matheson, L. N., & Matheson, M. (1989). Spinal function sort. Rancho Santa Margarita, CA: Performance Assessment and Capacity Testing.

Matheson, L. N., Matheson, M. L., & Grant, J. (1993). Development of a measure of perceived functional ability. Journal of Occupational Rehabilitation, 3, 15-30.

National Institute for Occupational Safety and Health. (1994). Applications manual for the revised NIOSH lifting equation. Retrieved from http://www.cdc.gov.niosh/94-110.htm

Oliveri, M., Jansen, T., Oesch, P., & Kool, J. (2005). The prognostic value of functional capacity evaluation in patients with chronic low back pain: Part 1: Timely return to work. And part 2: Sustained recovery (Letter to the editor). Spine, 30, 1232-1233. doi: 10.1097/01.brs.0000162273.35526.eb

Portney, L. G., & Watkins, M. P. (2009). Foundations of clinical research: Applications to practice (3rd ed.). Upper Saddle River, NJ: Pearson Education, Inc.

Rankin, G., & Stokes, M. (1998). Reliability of assessment tools in rehabilitation: An illustration of appropriate statistical analyses. Clinical Rehabilitation, 12, 187-199.

Reneman, M. F., Brouwer, S., Meinema, A., Dijkstra, P. U., Geertzen, J. H. B., & Groothoff, J. W. (2004). Test-retest reliability of the Isernhagen work systems functional capacity evaluation in healthy adults. Journal of Occupational Rehabilitation, 14, 295- 305.

Reneman, M. F., Dijkstra, P. U., Westmaas, M., & Goeken, L. N. H. (2002). Test-retest reliability of lifting and carrying in a 2-day functional capacity evaluation. Journal of Occupational Rehabilitation, 12, 269-275.

Smeets, R. (2008). A comparison of the relationship between depression, perceived disability, and physical performance in persons with chronic back pain: A comment on Alschuler et al. (Letter to the editor). European Journal of Pain, 13, 109-110. doi: 10.1016/j.ejpain.2008.08.005

Smeets, R. J. E. M., Hijdra, H. J. M., Kester, A. D. M., Hitters, M. W. G. C., & Knottnerus, J. A. (2006). The usability of six physical performance tasks in a rehabilitation population with chronic low back pain. Clinical Rehabilitation, 20, 989-998. doi: 10.1177/0269215506070698

Soer, S., Gerrits, E. H. J., & Reneman, M. F. (2006). Test-retest reliability of a WRULD functional capacity evaluation in healthy adults. Work, 26, 273-280.

Tuckwell, N. L., Straker, L., & Barrett, T. E. (2002). Test-retest reliability on nine tasks of the physical work performance evaluation. Work, 19, 243-253.

United States Department of Labor. (1991). Dictionary of occupational titles (4th ed.). Washington, DC: U.S. Government Printing Office.

World Health Organization. (2001). The international classification of occupational functioning, disability and health (ICF). Geneva, Switzerland: Author.

Author affiliation:

Libby A. Gibson, PhD, is Lecturer, The University of Queensland, School of Health and Rehabilitation Sciences, Division of Occupational Therapy, Brisbane, Qld, 4072, Australia. Phone: (+ 61) 7-3365-3004. Fax: (+ 61) 7-3365-1622. E-mail: libby@uq.edu.au.

Monica Dang, BOccThy(Hons), is Research Assistant, The University of Queensland, School of Health and Rehabilitation Sciences, Division of Occupational Therapy, Brisbane, Qld, 4072, Australia.

Jenny Strong, PhD, is Professor, The University of Queensland, School of Health and Rehabilitation Sciences, Division of Occupational Therapy, Brisbane, Qld, 4072, Australia.

Asad Khan, PhD, is Senior Lecturer, The University of Queensland, School of Health and Rehabilitation Sciences, Brisbane, Qld, 4072, Australia.

The use of this website is subject to the following Terms of Use