We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I work with questionnaire data and am fairly sensitive to good practice in questionnaire construction. Usually I generate my own, but when I draw on existing literature to use reliably reproduced measurement scales, sometimes the items are poorly worded or vague enough that I would not consider them appropriate or valid. Consider the item:
"Weeds should be eradicated because they inhibit the full development of useful and ornamental plants."
A participant can say that they disagree with this item. So do they think weeds shouldn't be eradicated? Or that they should be eradicated, but not for the reason given. Maybe they think that weeds do inhibit the full development of useful plants, but that is not sufficient cause for eradication of weeds. The question is vague enough that it doesn't confer any useful information about the participant's broader view on weeds except in the one fringe case where they agree with it as stated. Clearly it should be split into two items and already the scale is much improved by it.
There are other examples that are much more shockingly bad, but have been reproduced enough times that they form reliable scales. Is it better to use them as-is, with all their flaws, or to use them as fully as possible while fixing any glaring errors, or not to use them at all?
If you change the items in a questionnaire, you have a different questionnaire. All the psychometric properties such as reliability or validity and the normative data no longer apply. You need to be able to compare your measurement to a normative sample to receive a meaningful measurement. If you change a questionnaire, you have to test it against your construct, calculate its reliability etc., and normalize it over a large representative sample. Otherwise any conclusions you draw from your experiments will be at best tentative and potentially false.
The meaning of a questionnaire item is not the literal meaning you make of that "sentence", but the relation that the behavior of answering this item has with other behavior that you know represents your construct. For example, the item "Weeds should be eradicated because they inhibit the full development of useful and ornamental plants." is from a questionnaire on environmental attitudes (Milfont & Duckitt, 2004) and measures "necessity of development" (NoD). If a significant number of your sample answer "yes" to the question about weeds, and they also rate high in NoD in a valid NoD task, then this item can be used to measure NoD. If you change the words, this correlation might no longer hold.
- Milfont, T. L., & Duckitt, J. (2004). The structure of environmental attitudes: A first- and second-order confirmatory factor analysis. Journal of Environmental Psychology, 24, 289-303. Available online at http://gip.uniovi.es/gdiyad/docume/arti_jep01.pdf.
Here's a slightly more compromising (yet arguably more demanding) answer than what's, which is probably a good representative of how the idea of changing items will generally be received by reviewers, supervisors, or substantive theorists. From the perspective of a methodology nerd (don't mean to deny @what that dubious distinction too BTW), there are some reasons to consider reworking this questionnaire and certainly very many other measures in common usage. It's true that this changes the latent construct and threatens its comparability to that estimated in previous research, but changes may be justified, and comparability may be worth sacrificing or even entirely undesirable if the measurement model in previous research was just that badly invalid.
It's also true that there is a lot of sophisticated, challenging, expensive work to do in the construct validation process, but as such, much of this work never gets done. For both theoretical and practical reasons, as much of this work as is feasible should be completed, reported, and kept up-to-date, and it's not necessarily any less important to do when one isn't motivated to change the measure. Building on a long line of previous research by using old methods mostly protects comparability of results within that line of research; it doesn't protect the line of research as a whole from its systemic flaws. Enough long lines of heralded theory have fallen over time to discourage complacency based on unquestioning deference to seemingly established methods.
Scrutiny of Milfont & Duckitt (2004)
The question in question is questionable.
To demonstrate the methodological vulnerabilities of even relatively modern research, I'll nitpick your specific example. First, as you've noted, the focal question itself is double- or even triple-barreled:
- Should weeds be eradicated?
- Do weeds inhibit the full development of useful plants?
- Do weeds inhibit the full development of ornamental plants?
This decreases interpretability of results somewhat needlessly (the measure is long enough already that one extra question wouldn't hurt noticeably… and what good does the distinction between useful and ornamental plants do?) and probably increases measurement error, as respondents are likely to rate their agreement somewhat haphazardly based on whichever part of the question elicits the strongest attitudinal response.
Second, this particular item loads rather weakly ($lambda=.38)$ on a factor (the tenth and least reliable factor, $alpha=.58$) with no negatively-weighted items. Milfont and Duckitt (2004) critique another measure for possessing unbalanced subscales, which by their own admission are "open to acquiescence bias". Yet they haven't fully balanced their own subscales, including also those for factors 2, 4, and 6 - this is just one shy of the number of unbalanced subscales in the measure they critique!
Regardless, "balancing" subscale items doesn't eliminate acquiescence bias; it only "balances" its effects on latent factor scores to the extent that they are calculated based on an equal number of equally-and-oppositely weighted items. In the authors' own usage of the measure (the structural equation model in Figure 1, p. 299), it appears that factor scores were based on the items weighted by their factor loadings, which are not equal. Hence even the factors that do have equal numbers of oppositely-weighted items (of which there are only two: factors 3 and 8) do not truly balance the effects of acquiescence bias.
Furthermore, even the false ideal of a set of items with equal, opposite, and balanced weighting would not truly balance acquiescence effects, as these also depend on item content and respondents' item-specific expertise (McClendon, 1991a), both of which vary across items, of course. In the case of our double/triple-barreled question, the second/third barrel particularly concerns horticultural knowledge, so this item seems likely to suffer specially from acquiescence bias due to that second/third barrel that arguably does more harm than good. Personally, I'm somewhat surprised to find theoretical basis for suspecting that this expertise issue with one (or two) particular barrel(s) causes more acquiescence problems than the ambiguity of the question overall (McClendon, 1991b), as is due to its compound nature. Acquiescence may be measurable and controllable (Billiet & McClendon, 2000), but it's not always a big threat, it's not the only response-style-based threat (Hinz, Michalski, Schwarz, & Herzberg, 2007), and balancing oppositely-weighted items halfheartedly is surely an incomplete amelioration (Schriesheim & Hill, 1981).
The model is not a model model.
This leads to further concerns with the overall measurement model. It seems to have been developed with methods that are somewhat inappropriate for Likert rating data, which are technically ordinal (i.e., discrete and ranked). The authors don't specify how exactly they handled item scoring, so they probably gave it the simple, conventional treatment as continuous data. This is not altogether indefensible with a seven-point Likert scale (see this answer on Cross Validated), but it's not ideal. The particular item in question appears on a scale with only four items, which is probably insufficient for producing a scale score by simple summation or averaging according to classical test theory (CTT); this may further explain the scale's relatively dismal $alpha$ reliability. Nonetheless, the authors probably used some such CTT method to calculate the scale scores they tested for gender differences in their Table 4, as they haven't referred to these as factor scores, much less specified which kind of factor scores they might be (see DiStefano, Zhu, & Mîndrilă, (2009) for a review of six popular options).
Moreover, the scale scoring system appears to have been developed through application of principal components and principal axis factor extraction from a simple covariance matrix that probably treated the data as continuous, as well as maximum likelihood estimation of a confirmatory factor analytic (CFA) model's fit to that same matrix. See this question on Cross Validated, "Factor analysis of questionnaires composed of Likert items" for discussion of preferable methods and other unaddressed issues, including:
- Other estimation algorithms that are less biased by ordinal data than maximum likelihood
- Fitting latent factor models to a polychoric correlation matrix, not a covariance matrix
- Exploration of bifactor structure in the case of a strong general factor amidst multiple dimensions
- This may be a particularly important alternative to consider alongside a second-order model.
- Preemptive control of various biases, including extreme response style, as via unfolding models
It's also a bit fishy to perform exploratory factor analysis on the same sample as a CFA (see yet another question on Cross Validated, "Regression testing after dimension reduction"), though at least Milfont and Duckitt (2004) did this in the less objectionable order to confirm and explore somewhat distinct latent structures. In sum, reasons such as these may explain the authors' failure to achieve good model fit by the conventional standards they cited themselves. When model fit is inadequate, path coefficients may be biased, which may pose several further problems for the proposed latent factor structure.
General reflection on alternative priorities
In the end, problems such as these vary widely in consequentiality depending on the egregiousness of particular instances. Older measures probably generally suffer more for their age, as many of the methodological issues I've raised here arise from relatively modern research. Unfortunately, this may threaten the validity of retrospective comparisons most where they might otherwise be most richly informative - where lines of research reach back across several decades and even generations - but other sociological factors (e.g., cohort effects) may limit the equivalence of old research to new research anyway. Reasons such as these to give up on traditional methods are at least as numerous and important as reasons to adhere loyally.
In more modern instances such as Milfont and Duckitt (2004), there may be less to lose by parting ways with less established and replicated methods (though I see their article has been cited at least 180 times already), but there may also be less to gain. Some of these issues I've raised may be fairly inconsequential, or if resolved, might even absolve their theory by improving model fit and demonstrating a lack of cause for further concern. One simply can't be sure without trying.
How much of this effort is worthwhile is an open question, a balancing act, a fundamental career choice, and even a bit of a gamble. Deep, unresolved, and even unresolvable epistemic problems underlie every application of empirical scientific methodology - this may be especially true of the social sciences - but time spent on basic research is not generally time spent directly on application of theory to contemporarily exigent problems. Thus one confronts the fundamental dilemma of addressing "real-world problems" now vs. theoretical problems for future applications, and through it the career choice of applied vs. theoretical and methodological research.
One way to strike a balance is to prognosticate the societal impact of your options, and in any given choice to go with the option most likely to matter. If your methodological problem seems more pragmatically consequential than your applied problem, solve it first, or instead if necessary! (This I've stated as an idealized, no-brainer platitude, but intended to reflect a subtler judgment call.) Another is to contribute in the ways you value most - the ways most motivating for you, be they immediate career advancement or long-term legacy - as service to your own values is service you're most likely to perform best and most valuably to all beneficiaries (yourself included). Another is to consider your implicit motives and core strengths: what do you like to do, and what do you do best? These factors will affect the value of your work too, and should inform your choices of basic priorities and specific endeavors.
Hopefully you can find ways to make these priorities align in your work, but everyone's bound to face conflicts. I'm working through some of these conflicts presently myself - hence the indulgence in the preceding exegesis - but I'll spare you further autobiography. Ponder instead the wisdom of Peppy Hare, and let me know if you've got some data and want to collaborate on updating your measurement models!
- Billiet, J. B., & McClendon, M. J. (2000). Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling, 7(4), 608-628.
- DiStefano, C., Zhu, M., & Mîndrilă, D. (2009). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research & Evaluation, 14(20), 1-11. Retrieved from http://pareonline.net/pdf/v14n20.pdf.
- Hinz, A., Michalski, D., Schwarz, R., & Herzberg, P. Y. (2007). The acquiescence effect in responding to a questionnaire. GMS Psycho-Social Medicine, 4. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2736523/.
- McClendon, M. J. (1991a). Acquiescence and recency response-order effects in interview surveys. Sociological Methods & Research, 20(1), 60-103.
- McClendon, M. J. (1991b). Acquiescence: Tests of the cognitive limitations and question ambiguity hypotheses. Journal of Official Statistics, 7(2), 153-166.
- Milfont, T. L., & Duckitt, J. (2004). The structure of environmental attitudes: A first- and second-order confirmatory factor analysis. Journal of Environmental Psychology, 24, 289-303. Retrieved from http://gip.uniovi.es/gdiyad/docume/arti_jep01.pdf.
- Schriesheim, C. A., & Hill, K. D. (1981). Controlling acquiescence response bias by item reversals: The effect on questionnaire validity. Educational and Psychological Measurement, 41(4), 1101-1114.
To reiterate @what's answer, here are a few simple rules.
Change wording when…
- You want to improve the scale and are prepared to engage in a thorough validation process of the new scale. E.g., comparing the new with the old version; re-running factor analyses, reliability analyses, validation. Naturally you'd only do this if you felt that you could make a substantial improvement to the scale.
- You need to substantially adapt a scale to your context. E.g., the test was looking at attitudes to X and you want to look at attitudes to Y. You only have room in your scale for a subset of the items. In all these cases, you are only loosely drawing on the existing validity information for the scale. And adapting a scale to your context should not be your first choice if there is an existing scale already suited to your context.
Don't change wording when…
- You think that minor changes would make the scale better (e.g., by adding/removing an item; tweaking the wording). Such changes prevent you from comparing your data to other normative studies. While if the changes are minor, the changes to the psychometrics are also likely to be minor, this is ultimately an empirical question. And thus, you are limited in drawing on the psychometric evidence of other studies after you have made changes. Thus, it is rarely worth it.
Thoughts on changes
When using an unmodified scale, you can just refer to it. However, if you do make changes, you need to supply a copy of the full modified scale somewhere when you publish.