Results from an exploratory study to test the performance of EQ-5D-3L valuation subsets based on orthogonal designs, and an investigation into some modeling and transformation alternatives for the utility function
© Bailey et al.; licensee Springer. 2014
Received: 21 July 2014
Accepted: 27 October 2014
Published: 8 November 2014
EQ-5D-3L valuation studies continue to employ the MVH protocol or variants of MVH. One issue that has received attention is the selection of the states for direct valuation by respondents. Changes in the valuation subset have been found to change the coefficients of the utility function. The purpose of this study was to test the performance of valuation subsets based on orthogonal experiment designs. The design of the study also allowed a comparison of models based on raw or untransformed VAS values with values transformed at the level of the respondent and at the aggregate level.
Two different valuation subsets were developed based on orthogonal arrays. A VAS elicitation was undertaken with two groups of similar respondents and the resulting utility functions based on the valuations of the two different valuation subsets were compared using mean absolute errors between model and observed values, and by correlation with values in and out of sample. The impact of using untransformed versus VAS values transformed at the level of the individual and at aggregate level and the inclusion of a constant term in the utility functions were also investigated.
The utility functions obtained from the two valuation subsets were very similar. The models that included a constant and based on raw VAS values from the two valuation studies returned rank correlation coefficients of 0.994 and 0.995 when compared with respective observed values. MAEs of model values with observed values were 2.4% or lower for all models that included a constant term. Several models were developed and evaluated for the combined data (from both valuation subsets). The model that included the N3 term performed best.
The finding that two very different valuation subsets can produce strikingly similar utility functions suggests that orthogonal designs should be given some attention in further studies. The impact of rescaling VAS values at the level of the individual versus at aggregate level had minimal impact on the performance of the models when compared to models based on the raw VAS values.
Health states defined by EQ-5D and other health status classification systems such as HUI  and SF-6D  are typically represented by a summary index score computed once the value of different dimensions and levels within dimensions have been established. Studies that generate such value sets for these instruments often adopt a similar approach in order to overcome the respondent burden involved in assessing large numbers of health states. In the case of EQ-5D-3L a total of 243 health states are defined by the descriptive classification, there being 5 dimensions (mobility, self care, usual activities, pain/discomfort and anxiety/depression) for each of which there are 3 possible responses levels. The response level for each dimension is used to create a numeric code that acts as a nominal descriptor for each state. The logically best health state is coded as 11111 (no problem on any of the 5 dimensions); the logically worst health state is coded as 33333 (an extreme problem on all 5 dimensions). It is usually the case that a smaller number of selected health states are presented for direct evaluation in any valuation study. These directly observed values are then used to construct a statistical model from which to estimate the value decrements associated with each dimension/level. These derived values are then applied to compute index scores for the full set of health states defined by the classification system.
Valuation studies have taken several approaches when selecting a subset of EQ-5D states for direct assessment. The first large EQ-5D valuation study was the Measurement and Valuation of Health (MVH) study in the United Kingdom carried out in 1993 .
The MVH protocol used a subset of 43/243 EQ-5D states plus unconscious and immediate death (a total of 45 states). Valuation of all states in this subset was considered to be too much by way of respondent burden and a block design was used so that each respondent evaluated a total of 15 states. The reduction of respondent burden in this way necessitated an increase in the size of the study sample. Subsequent interest in identifying efficient subsets for EQ-5D valuation studies has yielded a number of alternative designs.
The valuation subset used in the MVH study comprised 43 states that were selected to cover a wide range of severity, to maintain consistency with an earlier study that had been conducted in Finland, and to include only states that would be considered by the researchers to be ‘plausible’ to the average respondent . As an example of the ‘plausibility’ criterion, any states that combined level 3 on Mobility (confined to bed) with level 1 on ability to perform Usual Activities or Self Care (that is all states comprising 3X1XX or 31XXX) were excluded.
One study  that specifically set about to investigate the performance of valuation sets for EQ-5D evaluated several subsets of states used in the MVH study by testing the performance of models in terms of correlation between observed and predicted values and Mean Absolute Error (MAE). This study used a backward sequential elimination algorithm to remove the state at each step with the smallest effect on the regression models. A final subset of 17 states referred to as the Macran and Kind States performed best on the correlation criteria. These analyses were conducted on the MVH data.
In another study that was based on MVH data, Lamers et al.  used simulated sampling strategies to model the performance of various subsets of the MVH valuation set. The resulting models were compared in terms of correlation and MAE with observed values. This approach was not able to identify a valuation subset that outperformed the Macran and Kind Set.
Zarate and Kind  attempted to identify smaller valuation subsets in other countries that had conducted EQ-5D-3L valuation studies using the MVH valuation set. This approach was taken for data from the USA, Chile and the UK. In all three cases, the minimum number of states in the valuation subset that could be kept while avoiding a ‘large’ increase in MAE versus observed values was 17. Removal of further states from the valuation set resulted in MAE of the model values versus observed values moving from 0.05 to over 0.1 (on a 0–1 scale) in all three cases. The problem was that the identity of the 17 states differed between the three countries in the study. This suggests that a single small (eg 17 state) valuation subset that can be applied to all countries may not exist however a common set of 31 states was found which may perform reasonably well when applied to the data for the three countries in the study. These studies suggest that the states comprising the valuation subset affect the model that is obtained.
A large valuation subset comprising 101 states was used in South Korea . Using large valuation subsets may improve precision since this leaves fewer states that must be valued based on modeling. However, using large valuation subsets also increases the number of respondents required in valuation studies, since blocking methods must be used to break the subset into smaller components for valuation by individual respondents.
Plausibility: by examining large empirical data sets to find states that are observed in the population for which the value set is being developed.
Relevance: the states selected for direct valuation should be those most frequently reported by the population for which the value set is being developed.
Coverage over severity range: This is related to the ‘code score’ of an EQ-5D state which is obtained by adding the value of the level of each dimension in the state. Thus state 11111 has a code score of 5 × 1 = 5, and state 12223 would have a code score of 1 + (3 × 2) + 3 = 10. The state that lies furthest from 11111 is 33333 which has a code score of 5 × 3 = 15. This measure gives a general indication of severity, so a valuation set based on this approach would include states covering all possible levels of code score from 5 to 15.
Simple severity increments: valuation subsets should comprise states that represent single ‘adjacent steps’ (i.e. states having a difference in code score of 1) in progressing from 11111 to 33333. It is argued that this would allow direct measurement of the lowest level of differentiation that can be obtained from the EQ-5D-3L system.
This approach produces a set of 55 states in 5 blocks, so that each respondent values a sub-set of 11 states. The study developed a valuation subset based on MVH data, but there is no application or empirical data regarding the performance of this valuation set. The approach would require the self reported EQ-5D states for thousands of citizens of the country for which a value set is being developed in order to identify the states that meet criteria 1 and 2. Most developed countries now have EQ-5D Value Sets . Moving forward, it is expected that ‘new’ countries for which EQ-5D-3L value sets are to be developed will comprise middle income or developing countries for which self reported health for such large numbers of citizens will not be available. This approach also requires 5 respondents per replicate.
The purpose of our study was to test the performance of valuation subsets based on orthogonal experiment designs. An orthogonal design is one in which the columns of the independent variables are orthogonal to each other. For the design of an EQ-5D valuation subset, this would mean that in each replicate, each level of every dimension would appear an equal number of times. Historically, orthogonal experiment designs have been used extensively in many fields . For this study, a Visual Analogue Scale (VAS) was used to capture the observed values.
The design of this study also allowed the opportunity to test two further issues involved in modeling EQ-5D valuation data. These are 1- the question of whether to transform the VAS values (on to a 0 to 1 scale) at the level of the individual, or at the aggregate level, and 2- the effect of the inclusion of a constant term and additional dummy variables in the regression model.
The same form of transformation could be applied to aggregated observed data taking the mean value, for example, as the measure of central tendency. The advantage of this approach is that it effectively dampens the effect of variability within an individual respondent’s data thereby introducing a degree of smoothing and potentially giving rise to a simplified estimation model. On the other hand, this approach could be criticized for losing some of the individual data. To compare these two approaches, regression analyses were run on VAS values transformed at the level of the individual and at aggregate level. The regression analyses were also run on the raw or untransformed VAS values to allow for comparison with the models based on transformed VAS values.
A further element of consideration in developing estimation models was the use of a constant term. Many EQ-5D valuation studies include a constant term which is interpreted behaviorally as representing the value decrement accounted for by any departure from full health . However, the impact of including such a constant term has not been the subject of any systematic investigation. Its inclusion seems to be a consequence of adopting previous custom and practice rather than being a deliberate choice. The use of a constant term may mask imperfections in the specification of the model and/or the volume of information under investigation and its use may simply be to act as a proxy for unobserved variance not otherwise specified. Alternate models were developed in this study in which the regression lines were forced through the origin. These models were considered and evaluated as counterfactuals to the models in which the constant term was permitted.
A sample of 230 university students took part in a valuation exercise conducted at the St. Augustine campus of the University of the West Indies. All of the elicitations were conducted in a 1:1 office setting and respondents received TT$50.00 (equivalent to US$8) at the start of the interview. Respondents were randomly assigned to the Green and Blue valuation sets.
Each card had a two letter code printed in the bottom right corner so that the interviewer could record the rank data and VAS valuations. It was explained to the respondents that these codes were generated randomly and had no significance.
The cards were ranked from best to worst along the edge of a desk with respondents first being handed two randomly selected cards and instructed to place one card on the desk and decide whether the second was better or worse than the first, placing it above or below accordingly. A third card was then introduced and the respondent was asked to decide whether this should go above, below or in between the other two. This process was repeated until all 23 cards for that respondent were ranked. Tied ranks were permitted.
Once this ranking task was complete the interviewer noted the order of health states and then placed a 1-metre version of the VAS alongside the ranked cards. Respondents adjusted the location of each card so that the rhomboid edge pointed to the VAS rating corresponding to their assessment of the value of each state on the 0 – 100 scale. This allowed the respondent to see all of the cards on the VAS at the same time and to adjust their positions and values. Respondents were reminded that ties were permitted and that they had the freedom to change the order of states if they so chose. Interviewers had been instructed that if a respondent raised the issue of an implausible state, they were to respond with a statement explaining that some people do find that some of the states are difficult to imagine and to encourage respondents to carry out the valuation (or ranking task) for the state to the best of their ability. Once the VAS task was finished the interviewer recorded the rating scores for all health states.
Although panel regression methods would generally be appropriate, the analysis of pooled data from an orthogonal experiment design using ordinary least squares (OLS) regression produces identical coefficients to fixed and random effects models . Given that the valuation sets used in this study were based on orthogonal arrays, the models were produced using OLS. All regression analyses were carried out using Stata Statistical Software 12.0.
In the absence of having access to a ‘true’ underlying utility function, the models obtained in EQ-5D valuation studies are usually evaluated on such criteria as internal validity, Mean Absolute Error (MAE) versus observed values, R-Squared etc. In addition to these criteria, this study design allowed for a comparison between the utility functions based on valuations by two groups of similar respondents, using two completely different valuation subsets (with no states in common) that were both developed from orthogonal arrays.
Model 1: with the dependent variable as 100 – raw VAS value with no constant term in the model.
Model 2: with the dependent variable as the raw VAS value with a constant term in the model.
Model 3: with the dependent variable as 1 –VAS value rescaled at the level of the individual respondent with no constant term in the model.
Model 4: with the dependent variable as the VAS value rescaled at the level of the individual respondent with a constant term in the model.
Model 5: with the dependent variable as 1 –VAS value rescaled at aggregate level using mean values with no constant term in the model.
Model 6: with the dependent variable as the VAS value rescaled at aggregate level using mean values with a constant term in the model.
Testing models 3 through 6 allowed a comparison of the performance of models with and without the constant term, as well as with VAS values transformed at the level of the respondent and at aggregate level. Including models 1 and 2 allowed a comparison of the models with equivalent analyses based on untransformed VAS values. Each model was evaluated using the following criteria - adjusted R2, within-sample correlation, correlation with out-of-sample values, MAE of estimated and observed values, and the percentage of model versus observed residuals that were above 5% (i.e. residuals above 5 VAS points for the Raw VAS models and residuals above 0.05 for the Rescaled models).
In the second stage of the analysis, the best performing model using rescaled data based on these criteria was used to develop a model for the pooled data (combining the Blue and Green observed values). The performance of this model was compared with variants of this baseline model that included dummy variables which had previously been specified in the other valuation studies. These dummy variables indicated the presence of any 1’s, 2’s and 3’s in a state (N1, N2 and N3 respectively). In addition to these, regressions were also run with dummy variables giving the numbers of 1’s, 2’s and 3’s in a state (C1, C2 and C3 respectively) and the squares of these counts (C1Sq, C2Sq, C3Sq). These regressions were run to test whether these additional variables would improve performance over the baseline model.
These models were compared based on adjusted R2, correlation of model values with observed values, and MAE of model values and values observed for the sample and hold out states.
Problem rates reported by the respondents a
% of Respondents
Mean observed and rescaled VAS values b
VAS rescaled individual
VAS rescaled aggregate
Results of the first stage analysis for the blue and green sets
Model 4 versus Model 6
% MAE: Green Model w/ Green Observed
% Residuals >5% (i.e. 5.0 for Raw, and 0.05 for Rescaled)
% MAE: Green Model w/ Blue Observed
% MAE: Green Model w/ Green Holdouts
Correlation within sample (Green Model w/ Green Observed)
Correlation out of sample (Green model w/ Blue Observed)
Model 4 versus Model 6
% MAE: Blue Model w/ Blue Observed
% Residuals >5% (i.e. 5.0 for Raw, and 0.05 for Rescaled)
% MAE: Blue Model w/ Green Observed
% MAE: Blue Model w/ Blue Holdouts
Correlation within sample (Blue Model w/ Blue Observed)
Correlation out of sample (Blue model w/ Green Observed)
Comparison of the 5 models based on the combined set
Adj R Sq
Correl Model vs Observed
% MAE Model vs Observed
% MAE Model vs Holdouts
% Residuals >0.05
The challenges associated with the orthogonal design (inclusion of implausible states and the concentration of states in the moderate range) would have contributed to the relatively low Spearman’s rank correlation coefficients between the results of the ranking task and the ranks of VAS scores of 0.8880 and 0.8884 for the orthogonal valuation subsets (versus 0.94-0.96 for most of the studies that use the MVH protocol). Only 4 respondents in our study (1.7%) preserved the rank order of the states in moving from the ranking task to the VAS task (versus 19% in the MVH study). If it is accepted that the ranking task produces the ordinal preferences of the respondent, then the transfer of cards to the VAS allows the respondent an opportunity to correct any mistakes made during the initial ranking task. Such errors would be most likely among states perceived to be very similar in terms of preference level to the respondent . This is an exploratory study. It was not designed to produce a value set that can be used in resource allocation decision making, but to test the performance of orthogonal valuation subsets and to investigate the impact of modeling and transformation strategies on the utility function. Thus, the respondents used were students because this allowed the convenient creation of two similar respondent groups. Their demographic characteristics and problem rates in Table 1 would not reflect the general population of Trinidad and Tobago. The sample size was also small relative to the sample sizes of VAS studies in the published VAS valuation studies . Despite the small size of the sample, the models in Tables 3 and 4 were all internally valid. Further research could be undertaken using similar studies with larger respondent samples and smaller orthogonal valuation set designs.
This study also demonstrates the performance of VAS as a valuation method for EQ-5D studies, and adds to the literature in support of the VAS as an elicitation instrument ,. Over the last 5–10 years the use of VAS has declined as a means of eliciting health state valuations in EQ-5D studies due partly to a preference for other methods such as Time Trade Off (TTO) and Discrete Choice Experiments (DCE) but also reflecting a criticism of some aspects of VAS methods . One criticism of the VAS is that it is not ‘choice-based’. This criticism has led many researchers away from the method towards choice based approaches such as TTO and DCE. By beginning the VAS valuation with a ranking exercise in which respondents are given the cards one at a time and asked to place each new card in a position based on its level of disutility relative to the other cards in series, this protocol brings ‘choice’ directly into the valuation process. In a cognitive debriefing study of this VAS protocol  respondents described the decision making process in the ranking and ranking-to-VAS stages using terms that were virtually identical to their description of their approaches in performing paired comparisons for a DCE. These and other theoretical issues concerning the VAS have been partially dealt with ,, but there is still resistance to accepting VAS-based valuations in economic evaluation as can be seen in the technical guidance published by national regulatory agencies. Nonetheless, VAS methods are widely used to record consumer preferences in a variety of non-health settings whilst it continues to remain a legitimate method for obtaining the value of self-reported health status—notably as part of the EQ-5D instrument.
The studies by Lamers , and Zarate and Kind  suggest that the states that are included in the valuation set have an influence on the model that is obtained in the analysis. In this study, 230 similar subjects (students) divided into two groups each gave VAS valuations of two different sets of EQ-5D states (with no common states between them). When the two data sets were analyzed using the same regression methods, they produced strikingly similar models that performed creditably. This is despite the disadvantages that the orthogonal valuation sets would present (the inclusion of implausible states, and the concentration of states in the moderate range). These encouraging results suggest that further research should be undertaken into using orthogonal array based approaches to developing valuation sets for EQ-5D valuation studies.
This study employed orthogonal arrays with 18 rows (producing valuation sets of 18 states). Further research should be undertaken to test smaller orthogonal designs that can used to produce main effects models. This would allow for smaller samples thus reducing the cost of conducting valuation studies in developing countries. Small orthogonal designs may also permit valuation subsets for TTO studies that do not require blocking, such that each respondent can provide one replicate.
This study found small differences in performance of the models based on data transformed at the level of the individual and at the aggregate level. Differences in performance between the models based on raw VAS data and the models based on transformed data were also very small. The inclusion of the constant term improved the performance of all of the models.
- Health Utilities Index [homepage on the internet]. Hamilton ON, Canada: HU Inc;c1998-2014 [updated 2014 March 13; cited 2014 March 30]: Available from: ., [http://www.healthutilities.com/]
- Brazier J, Roberts J, Deverill M: The estimation of a preference-based measure of health from the SF-36. J Health Econ 2002, 21: 271–292. 10.1016/S0167-6296(01)00130-8View ArticlePubMedGoogle Scholar
- Group MVH: The Measurement and Valuation of Health: First report on the main survey. 1994.Google Scholar
- Dolan P, Kind P, Williams A: The time trade-off method: results from a general population study. Health Econ 1996, 15: 209–231. 10.1016/0167-6296(95)00038-0View ArticleGoogle Scholar
- Macran S, Kind P: Valuing EQ-5D health states using a modified MVH protocol: preliminary results. In Proceedings of the 16 th Plenary Meeting of the EuroQol Group. Edited by Badia X, Herdman M, Roset M. Barcelona, Spain: EuroQol Research Group 2000; 1999.Google Scholar
- Lamers L, McDonnell J, Stalmeier P, Krabbe P, Bussbach J: The Dutch tariff: results and arguments for an effective design for national EQ-5D valuation studies. Health Econ 2006, 15: 1121–1132. 10.1002/hec.1124View ArticlePubMedGoogle Scholar
- Zarate V, Kind P: Efficient Survey Design for EQ-5D Valuation Studies: Revising the 17 Macran-Kind Set. 29th Plenary Meeting of the EuroQol Group, Rotterdam, Holland; 2012.Google Scholar
- Chevalier J, Pourvoville G: Valuing EQ-5D using time trade-off in France. Eur J Health Econ 2013, 14(1):57–66. 10.1007/s10198-011-0351-xView ArticlePubMedGoogle Scholar
- Scalone L, Cortesi P, Ciampichini R, Belisari A, D'Angilella L, Cesana G, Mantovani L: Italian population-based values of EQ-5D health states. Value Health 2013, 16: 814–822. 10.1016/j.jval.2013.04.008View ArticlePubMedGoogle Scholar
- Lee Y, Nam H, Chuang L, Kim K, Yang H, Kwon I, Kind P, Kweon S, Kim Y: South Korean time trade off values for EQ-5D health states: modeling with observed values for 101 health states. Value Health 2009, 12: 1187–1193. 10.1111/j.1524-4733.2009.00579.xView ArticlePubMedGoogle Scholar
- Bagust A: Improving valuation sampling of EQ-5D health states. Health Qual Life Outcomes 2013, 11: 14. 10.1186/1477-7525-11-14PubMed CentralView ArticlePubMedGoogle Scholar
- Devlin N, Krabbe P: The development of new research methods for the valuation of EQ-5D-5L. Eur J Health Econ 2013, 14: 1–3. 10.1007/s10198-013-0502-3PubMed CentralView ArticleGoogle Scholar
- Montgomery D: Design and Analysis of Experiments. Wiley, New York; 2012.Google Scholar
- EQ-5D value sets: inventory, comparative review and user guide. Springer, Dordrecht; 2007.Google Scholar
- N.Sloane’s library of orthogonal arrays.., [http://neilsloane.com/oadir/index.html]
- Thurstone LL: A law of comparative judgement. Psych Rev 1927, 34: 273–286. 10.1037/h0070288View ArticleGoogle Scholar
- Oaxaca R, Dickinson D: The equivalence of panel data estimators under orthogonal experimental design. 2005.Google Scholar
- Parkin D, Devlin N: Is there a case for using visual analogue scale valuations in cost utility analysis? Health Econ 2006, 15: 653–664. 10.1002/hec.1086View ArticlePubMedGoogle Scholar
- Chuang L, Kind P: Ordinal or cardinal? the VAS strikes back. Value Health 2007, 10: A454-A455. 10.1016/S1098-3015(10)65568-4View ArticleGoogle Scholar
- Brazier J, Ratcliffe J, Tsuchiya A, Salomon J: Measuring and Valuing Health Benefits for Economic Evaluation. Oxford University Press, New York; 2007.Google Scholar
- Bailey H, Kind P, Lascelles K: What are we asking? What are they thinking? Preliminary results from a cognitive debriefing study of EQ-5D elicitation exercises. In Proceedings of the 28th Plenary meeting of the EuroQol Group. Edited by: Yfantopoulos J. EuroQol Group 2011, Athens, Greece; 2010.Google Scholar
- Torrance G, Feeny D, Furlong W: Visual analogue scales: do they have a role in the measurement of preferences for health states? Med Decis Making 2001, 21: 329–334. 10.1177/02729890122062622View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.