Information

Is psychological shock a valid construct?

Is psychological shock a valid construct?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Rather than shock as a physiological reaction to bodily trauma, does anyone know of research on the validity of psychological shock?

I'm not referring to shell shock, or the original name for PTSD. A shock I'm referring to would occur after a trauma but as an immediate response and not a syndrome that follows the trauma indefinitely.


Measurement Matters

After a long and cold journey of 286 days, the Mars Climate Orbiter reached its destination on 23 September 1999. Rather than beginning its mission, however, the satellite disintegrated upon entering the atmosphere because one software module made calculations in US customary units and fed them into a second module that assumed metric units. Four years later, two halves of a large bridge being constructed across the Rhine came together to connect Germany and Switzerland. To the surprise of the engineers, there was a height difference of 54 cm (21 in) between the two sides: Different measurements of sea level had been used (the North Sea vs. the Mediterranean Sea).

Measurement problems can (and do) occur — sometimes with disastrous consequences — as part of even the most remarkable scientific endeavors, such as sending a satellite into space. We are in no different a situation in psychology as we navigate the shifts in our research culture toward a more open and rigorous science. So far, these shifts have largely ignored the topic of measurement, an unfortunate situation because the quality of measurement is even more foundational than statistical practice. A high-powered, perfectly parsimonious statistical model cannot save us from poor measurement.

In psychology, measurement is especially difficult because what we want to measure often does not permit direct observation. We can directly observe the height of a person next to us on the bus, but we often have little insight into latent, psychological attributes such as intelligence, extraversion, or depression. Construct validation — showing that an instrument meant to measure a construct actually measures the construct in question — is no easy task. Not only are psychological constructs difficult to observe, they are also complex. It is relatively easy to settle on which sea should be the benchmark for calculating height above sea level, but clearly defining intelligence, extraversion, or depression is challenging. There are different ways to understand and measure these constructs because they encompass different behaviors, perceptions, subjective experiences, environmental influences, and biological predispositions.

This article highlights the neglect of psychological measurement, explains why this poses a serious and underrecognized threat to the recent replicability efforts in psychological science, and concludes with some suggestions on how to move forward.

The Problem: Neglected Measurement

To measure a psychological construct such as extraversion, psychologists often use questionnaires with multiple items. Items are added up to a score, and it is assumed that this score represents a person’s position on the construct. From “Paul has a high score on an extraversion scale,” we assume that Paul is very extroverted. This inference is not a free psychometric lunch evidence of validity [1] is needed to support the claim. You want to have (1) a good theory supporting the items you include in your scale (2) a scale showing acceptable psychometric properties (e.g., reliability and dimensionality) and (3) a scale related to other constructs in the ways hypothesized (e.g., convergent and discriminant validity) that captures group differences or causal processes expected to exist. Only if your scale meets these criteria can substantive inferences follow.

Unfortunately, evidence of validity is lacking in many areas of psychological research. As an example, depression is assessed in more than 1,000 research studies per year and is used as an outcome, predictor, moderator, or covariate across numerous disciplines (e.g., psychology, psychiatry, epidemiology). More than 280 different scales for assessing depression severity have been developed and used in research in the last century. Commonly used depression scales feature more than 50 different symptoms, and content overlap among scales is low. For example, one third of the symptoms in the most cited scale — the 20-item Center of Epidemiological Studies Depression scale (Radloff, 1977 approximately 41,300 citations) — do not appear in any of the other most commonly used instruments. The result is that different scales can lead to different conclusions, which has been documented many times in clinical trials. For instance, a recent clinical trial queried patients on four different scales to examine whether full-body hyperthermia was an efficacious depression treatment. The hyperthermia group showed significant improvements over placebo on only one of the four scales. Unfortunately, the authors reported the three null findings in the supplementary materials without mention in the paper. This is an important lesson: Although comparing results of multiple measures offers more robust insights, it also opens the door to p-hacking, fishing, and other questionable research practices.

There is more. Major depression had one of the lowest interrater reliabilities of all mental disorders assessed in the DSM-5 field trials, with a coefficient of 0.28, and depression scales in general are often modeled without taking into account their multidimensionality and lack of temporal measurement invariance. Similar to the case of the Orbiter, these theoretical and statistical measurement issues can have drastic consequences, biasing conclusions of research studies and introducing error into inferences — inferences that influence the real-world behavior of scientists and resource allocation in science.

Depression is not an isolated example of poor measurement practices in psychological research. Reviews within specific domains cite similar issues (e.g., emotion Weidman, Steckler, & Tracy, 2016), and our recent work suggests that poor practices span topics and subdisciplines. In a systematic review of a representative sample of 35 empirical articles published in the Journal of Personality and Social Psychology in 2014, we identified 433 scales aimed to measure psychological constructs. Of these, about half contained no citation to any validation study. For many scales, Cronbach’s alpha was the sole psychometric property, and for one in five scales, no psychometric information whatsoever was reported. Simplified, evidence of validity, in practice, forms a hierarchy: (1) none, (2) alpha only, (3) a citation, presumably to another paper that contains validity evidence, and (4) more evidence, which takes a variety of forms. Further, we saw signs of researcher degrees of freedom, similar to the depression literature: Authors used multiple scales to measure one construct without justifying their use of a particular scale. We also noted that scale modification (adding or removing items) was common, as was combining multiple scales to a single index without a transparent rationale.

Poor Measurement Complicates Replications

Taking the results of these studies together, it is difficult to ignore the connection between poor measurement practices and current discussions about replicability. For example, Monin, Sawyer, and Marquez (2008) used a variety of scales in their study, which were also administered in the replication study as a part of the “Reproducibility Project: Psychology.” However, the replication study identified different factor solutions in the primary measures, indicating that different items formed different factors. How are we to interpret the result of this study? Is it a theory failure, a replication failure, or a measurement failure? Again, these questions hold broadly. For depression, for instance, the factor structure of a given scale often differs across samples, across time in the same sample, and even in large subsets of the same sample.

If a scale lacks validity or measures different constructs across samples, there is little benefit in conducting replication studies. We must take a step back and discern how to define and measure the variables of interest in the first place. In such cases, what we need are validity studies, not replication studies. Our work to promote replicability in psychology will be stymied absent improving our measurement practices. Making replications mainstream must go hand in hand with making measurement theory mainstream.

Ways Forward

Norms are changing in psychology, and recent articles and publisher policies push psychological scientists toward more rigorous and open practices. However, contributions focusing on the connection between measurement and replicability remain scant. We therefore close with some nontechnical suggestions that we hope will be relevant to researchers from all subdisciplines of psychology.

Clearly communicate the construct you aim to measure, how you define the construct, how you measure it, and the source of the measure.

Provide a rationale when using a specific scale over others or when modifying a scale. If possible, use multiple measures to demonstrate either robust evidence for a finding or the sensitivity of a finding to particular scales.

Preregister your study. This counters selective reporting of favorable outcomes, exploratory modifications of measures to obtain desired results, and overinterpretation of inconclusive findings across measures.

Consider the measures you use in your research. What category of validity evidence (none, alpha, citation, or more) would characterize them? If your measures fall into the first two categories, consider conducting a validation study (examples are provided below). If you cannot do so, acknowledge measurement as a limitation of your research.

Stop using Cronbach’s alpha as a sole source of validity evidence. Alpha’s considerable limitations have been acknowledged and clearly described many times (e.g., Sijtsma, 2009). Alpha cannot stand alone in describing a scale’s validity.

Take the above points into consideration when reviewing manuscripts for journals or when serving as an editor. Ensure authors report the necessary information regarding the measurement so that readers can evaluate and replicate the measurement in follow-up studies, and help change the measurement standards of journals you work for.

We recognize that measurement research is difficult. Measurement requires both theoretical and methodological expertise. Good psychometric practice cannot make up for a poorly defined construct, and a well-defined construct cannot make up for poor psychometrics. For those reasons, it is hard to come up with a few quick fixes to improve measurement. Instead, we recognize that many psychologists may not have had training in validity theory or psychometrics and provide a list of resources for those interested in learning more. These include a collection of seminal materials on measurement and validation, as well as some accessible examples.

In closing, we want to share the screenshot of the Wikipedia article on Psychological Measurement (see Figure 1), which auto-directs to the page for Psychological Evaluation.

We couldn’t agree more: Measurement deserves more attention.

Figure 1. This screenshot of the Wikipedia article on Psychological Measurement auto-directs to the page for Psychological Evaluation.

The authors would like to thank Jolynn Pek, Ian Davidson, and Octavia Wong for their ongoing work in forming some of the ideas presented here.

1 We acknowledge the old and ongoing philosophical debate about how to best define validity and measurement in psychology. A detailed discussion of validity theory is beyond the scope of this article and is described at length elsewhere (e.g., American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014 Borsboom, Mellenbergh, & van Heerden, 2004 Kane, 2013). Here, we discuss validity consistent with Loevinger’s (1957) seminal work on construct validation.

References and Further Reading

Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America. American Psychologist, 63, 32–50. doi:10.1037/0003-066X.63.1.32

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: Joint Committee on Standards for Educational and Psychological Testing.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033-295X.111.4.1061

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8, 370–378.

Fried, E. I. (2017). The 52 symptoms of major depression. Journal of Affective Disorders, 208, 191–197. doi:10.1016/j.jad.2016.10.019

Fried, E. I., & Nesse, R. M. (2015). Depression is not a consistent syndrome: An investigation of unique symptom patterns in the STAR*D study. Journal of Affective Disorders, 172, 96–102. doi:10.1016/j.jad.2014.10.010

Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring depression over time . . . or not? Lack of unidimensionality and longitudinal measurement invariance in four common rating scales of depression. Psychological Assessment, 28, 1354–1367. doi:10.1037/pas0000275

Janssen, C. W., Lowry, C. A., Mehl, M. R., Allen, J. J. B., Kelly, K. L., Gartner, D. E., … Raison, C. L. (2016). Whole-body hyperthermia for the treatment of Major Depressive Disorder. JAMA Psychiatry, 53706, 1–7. doi10.1001/jamapsychiatry.2016.1031

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.12000

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635–694.

Monin, B., Sawyer, P. J., & Marquez, M. J. (2008). The rejection of moral rebels: Resenting those who do the right thing. Journal of Personality and Social Psychology, 95, 76–93. doi:10.1037/0022-3514.95.1.76

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716-aac4716. doi:10.1126/science.aac4716

Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. doi:10.1177/014662167700100306

Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2013). DSM-5 field trials in the United States and Canada, part II: Test-retest reliability of selected categorical diagnoses. The American Journal of Psychiatry, 170, 59–70. doi:10.1176/appi.ajp.2012.12070999

Santor, D. A., Gregus, M., & Welch, A. (2006). Eight decades of measurement in depression. Measurement, 4, 135–155. doi:10.1207/s15366359mea0403

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach. Psychometrika, 74, 107–120. doi:10.1007/s11336-008-9101-0

Weidman, A. C., Steckler, C. M., & Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17, 267–295.

Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Making replication mainstream. Behavioral and Brain Sciences. Advance online publication. doi:10.1017/S0140525X17001972


Empirical Evaluation of Construct Validity

Campbell's & Fiske's (1959) multitrait-multimethod matrix methodology presented a logic for evaluating construct validity through simultaneous evaluation of convergent and discriminant validity, and the contribution of method variance to observed relationships. iv Wothke (1995) nicely summarized the central idea of MTMM matrix methodology:

The crossed-measurement design in the MTMM matrix derives from a simple rationale: Traits are universal, manifest over a variety of situations, and detectable with a variety of methods. Most importantly, the magnitude of a trait should not change just because different assessment methods are used (p. 125)

Traits are latent variables, inferred constructs. The term trait, as used here, is not limited to enduring characteristics it applies as well to more transitory phenomena such as moods, emotions, as well as to all other individual differences constructs, e.g., attitudes and psychophysical measurements. Methods for Campbell and Fiske are the procedures through which responses are obtained, the operationalization of the assessment procedures that produce the responses, the quantitative summary of which is the measure itself (Wothke 1995).

As Campbell & Fiske (1959) emphasized, measurement methods (method variance) are sources of irrelevant, though reliable, variance. When the same method is used across measures, the presence of reliable method variance can lead to an overestimation of the magnitude of relations among constructs. This can lead to overestimating convergent validity and underestimating discriminant validity. This is why multiple assessment methods are critical in the development of construct validity. Their distinction of validity (the correlation between dissimilar measures of a characteristic) from reliability (the correlation between similar measures of a chartacteristic) hinged on the differences between construct assessment methods.

Campbell & Fiske's (1959) observation remains important today: much clinical psychology research relies on the same method for both predictor and criterion measurement, typically self-report questionnaire or interview. Their call for attention to method variance is as relevant today as it was 50 years ago examination of constructs with different methods is a crucial part of the construct validation process. Of course, the degree to which two methods are independent is not always clear. For example, how different are the methods of interview and questionnaire? Both rely on self-report, so are they independent sources of information? Perhaps not, but they do differ operationally. For example, questionnaire responses are often anonymous, whereas interview responses require disclosure to another. Questionnaire responses are based on the perceptions of the respondent, whereas interview ratings are based, in part, on the perceptions of the interviewer. A conceptually based definition of “method variance” has not been easy to achieve, as Sechrest et al.'s (2000) analysis of this issue demonstrates. Certainly, method differences lie on a continuum where for example, self-report and interview are closer to each other than self-report and informant report or behavioral observation.

The guidance provided for evaluating construct validity in 1959 was qualitative it involved the rule-based examination of patterns of correlations against the expectations of convergent and discriminant validity (Campbell & Fiske 1959). Developments in psychometric theory, multivariate statistics and analysis of latent traits in the decades since the Campbell & Fiske (1959) paper have made available a number of quantitative methods for modeling convergent and discriminant validity across different assessment methods.

Bryant (2000) provides a particularly accessible description of using ANOVA (and a nonparametric variant) and confirmatory factor analysis (CFA) in the analysis of MTMM matrices. A major advantage of CFA in construct validity research is the possibility of directly comparing alternative models of relationships among constructs, a critical component of theory testing (see Whitely 1983). Covariance component analysis of the MTMM matrix has also been developed (Wothke 1995). Both covariance component analysis and CFA are variants of structural equation models (SEM). With these advances eye-ball examinations of MTMM matrices are no longer sufficient for the evaluation of the trait validity of a measure in modern assessment research.

Perhaps the first CFA approach was one that followed very straightforwardly from Campbell & Fiske (1959): it involved specifying a CFA model in which responses to any item can be understood as reflecting additive effects of trait variance, method variance, and measurement error (Marsh & Grayson 1995 Reichardt & Coleman 1995 Widaman 1985). So if traits A, B, and C are each measured with methods X, Y, and Z, there are six latent variables: three for the traits and three for the methods. Thus, if indicator i reflects method X for evaluating trait A, that part of the variance of i that is shared with other indicators of trait A is assigned to the trait A factor, that part of the variance of i that is shared with indicators of other constructs measured by method X is assigned to the method X factor, and the remainder is assigned to an error term (Eid et al. 2003 Kenny & Kashy 1992). The association of each type of factor with other measures can be examined, so, for example, one can test explicitly the role of a certain trait or a certain type of method variance on responses to a criterion measure. This approach can be expanded to include interactions between traits and methods (Campbell & O'Connell 1967, 1982), and therefore test multiplicative models (Browne 1984 Cudeck 1988).

Although the potential advantages of this approach are obvious, it has generally not proven feasible. As noted by Kenny & Kashy (1992), this approach often results in modeling more factors than there is information to identify them the result, often, is a statistical failure to converge on a factor solution. That reality has led some researchers to turn away from multivariate statistical methods to evaluate MTMM results. In recent years, however, two alternative CFA modeling approaches have been developed that appear to work well.

The first is referred to as the 𠇌orrelated uniquenesses” approach (Marsh & Grayson 1995). In this approach, one does not model method factors as in the approach previously described. Instead, one identifies the presence of method variance by allowing the residual variances of trait indicators that share the same method to correlate, after accounting for trait variation and covariation. To the degree there are substantial correlations between these residual terms, method variance is considered present and is accounted for statistically (although other forms of reliable specificity may be represented in those correlations as well). As a result, the latent variables reflecting trait variation do not include that method variance: one can test the relation between method-free trait scores and other variables of interest. And, since this approach models only trait factors, it avoids the over-factoring problem of the earlier approach. There is, however, an important limitation to the correlated uniquenesses approach. Without a representation of method variance as a factor, one cannot examine the association of method variance with other constructs, which may be important to do (Cronbach 1995).

The second alternative approach provides a way to model some method variance while avoiding the over-factoring problem (Eid et al. 2003). One constructs latent variables to represent all trait factors and all but one method factor. Since there are fewer factors than in the original approach, the resulting solution is mathematically identified: one has not over-factored. The idea is that one method is chosen as the baseline method and is not represented by a latent variable. One evaluates other methods for how they influence results compared to the baseline method. Suppose, for example, that one had interview and collateral report data for a series of traits. One might specify the interview method as the baseline method, so an interview method factor is not modeled as separate from trait variance, and trait scores are really trait-as-measured-by-interview scores. One then models a method factor for collateral report. If the collateral report method leads to higher estimates of trait presence than does the interview, one would find that the collateral report method factor correlated positively with the trait-as-measured-by-interview. That would imply that collaterals report higher levels of the trait than do individuals during interviews.

Interestingly, one can assess whether this process works differently for different traits. Perhaps collaterals perceive higher levels of some traits than are reported by interview (unflattering traits?) and lower levels of other traits as reported by interview (flattering traits?). This possibility can be examined empirically using this method. In this way, the Eid et al. (2003) approach makes it possible to identify the contribution of method to measure scores. The limitation of this method, of course, is that the choice of �seline method” influences the results and may be arbitrary (Eid et al. 2003).

Most recently, Courvoisier et al. (2008) have combined this approach with latent state-trait analysis the latter method allows one to estimate variance due to stable traits, occasion-specific states, and error (Steyer et al. 1999). The result is a single analytic method to estimate variance due to trait, method, state, and error. Among the possibilities offered by this approach is that one can investigate the degree to which method effects are stable or variable over time.

We wish to emphasize three points concerning these advances in methods for the empirical evaluation of construct validity. First, the concern that MTMM data could not successfully be analyzed using CFA/SEM approaches is no longer correct. There are now analytic tools that have proven successful (Eid et al. 2003). Second, statistical tools are available that enable one to quantitatively estimate multiple sources of variance that are important to the construct validation enterprise (Eid et al. 2003 Marsh & Grayson 1995). One need not guess at the degree to which method variance is present, or the degree to which it is common across traits, or the degree to which it is stable: one can investigate these sources of variance directly. Third, these analytic techniques are increasingly accessible to researchers (see Kline 2005, for a useful introduction to SEM). Clinical researchers have a validity concern beyond successful demonstration of convergent and discriminant validity. Success at the level of MTMM validity does not assure the measured traits have utility. Typically, one also needs to investigate whether the traits enhance prediction of some criterion of clinical importance.

To this end, clinical researchers can rely on a classic contribution by Hammond et al. (1986). They offered a creative, integrative analytic approach for combining the results of MTMM designs with the evaluation of differential prediction of external criteria. In the best tradition of applying basic science advances to practical prediction, their design integrated the convergent/discriminant validity perspective of Campbell & Fiske (1959) with Brunswik's (1952, 1956) emphasis on representative design in research, which in part concerned the need to conduct investigations that yield findings one can apply to practical problems. They presented the concept of a performance validity matrix, which adds criterion variables for each trait to the MTMM design. By adding clinical outcome variables to one's MTMM design, one can provide evidence of convergent validity, discriminant validity, and differential clinical prediction in a single study.

Such analyses are critical clinically, because this sophisticated treatment of validity is likely to improve the usefulness of measures for clinicians. For many measures, validation research that considers practical prediction improves measures’ “three Ps”: predicting important criteria prescribing treatments, and understanding the processes underlying personality and psychopathology (Youngstrom 2008), thereby improving clinical assessment. Such practical efforts in assessment must rely on observed scores, confounded as they may be with method variance. Construct validity research provides the clinician with an appreciation of the many factors entering into an observed score and, thus, appreciation of the mix of construct-relevant, reliable construct-irrelevant variance and method variance in any score. (see Richters 1992).


Definition of a Construct

concept is a mental image of something concrete that we create in our minds to represents something we have seen. When someone uses a term like, "athlete", we can create a mental picture of what an athlete looks like because we have directly observed different athletes. Specific individuals may come to mind, or at least an image that includes  common qualities of athletes.    

Constructs, however, are more complex. They are "higher level abstractions" that people create that piece together simpler concepts into purposeful patterns. Constructs are useful in interpreting empirical data and building theories. They are used to account for observed regularities and relationships. They summarize observations and provide explanations.


Psychology: Construct Validity in Psychological Measurement

Title: Construct Validity in Psychological Measurement
Discipline(s) or Field(s): Psychology
Authors: Carmen Wilson, Bill Cerbin, Melanie Cary, Rob Dixon, University of Wisconsin – La Crosse
Submission Date: January 15, 2007

Executive Summary

The goal of the lesson is to develop students’ understanding of construct validity as measured by their ability to: 1) explain methods used to determine construct validity for psychological measures and 2) design a study to determine the construct validity of a given measure.

Prior to the lesson. In the two class days prior to the lesson, the instructor presented information on content, criterion, and construct validity. Each type of validity was presented in terms of a question it answered and how it might be assessed. Content validity answers the question, “Do the items represent the domain of interest?” It can be assessed by having an expert in the topic review the test. Criterion validity answers the question, “Do scores on the test predict some non-test behavior?” It can be assessed by correlating scores on the test with some other measure of the behavior (e.g. behavioral observation). Construct validity answers the question, “Does the test measure what it claims to measure?” The lecture highlights several processes to assess construct validity. The answer to the construct validity question is dependent upon what is known about the construct being measured. For example, if the theory about the construct suggests that two groups of people should have different levels of a construct, and the test actually measures the construct, then the groups’ scores should be different.

We evaluated three versions of the lesson across three semesters. In Version 1 (lesson, no lecture – A), students developed a 5-item measure of depression and then designed three research studies to evaluate the validity of their measure prior to receiving any instruction about construct validity. In Version 2 (lesson, no lecture – B) we made minor modifications, but the lesson essentially remained the same. In Version 3 (lesson after lecture), we made significant modifications. The instructor lectured about construct validity first, and in a subsequent class, students analyzed three validity studies and then designed a validity study based on information provided by the instructor.

In Versions 1 and 2, students became bogged down in details of their proposed research studies and missed the more important goal of predicting results that would support the validity of their measure. The team decided to restructure the lesson so that in Version 3 they first heard a lecture, and then read summaries of real validity studies and predicted the results of those studies given the tests were valid. In the last part of the lesson students designed a study to determine if a given test was valid and predicted the results of that study. Interestingly, students who participated in Version 1 of the lesson (lesson, no lecture – A) generally performed better than students who participated in Versions 2 and 3.


“GENERAL STEPS OF TEST CONSTRUCTION”

The development of a good psychological test requires thoughtful and sound application of established principles of test construction. Before the real work of test construction, the test constructor takes some broad decisions about the major objectives of the test in general terms and population for whom the test is intended and also indicates the possible conditions under which the test can be used and its important uses.

These preliminary decisions have far-reaching consequences. For example, a test constructor may decide to construct an intelligence test meant for students of tenth grade broadly aiming at diagnosing the manipulative and organizational ability of the pupils. Having decided the above preliminary things, the test constructor goes ahead with the following steps:

  1. Planning
  2. Writing items for the test.
  3. Preliminary administration of the test.
  4. Reliability of the final test.
  5. The validity of the final test.
  6. Preparation of norms for the final test.
  7. Preparation of manual and reproduction of the test.
  8. PLANNING:

The first step in the test construction is the careful planning. At this stage, the test constructor address the following issues

Definition of the construct to be measured by the proposed test.

The author has to spell out the broad and specific objectives of the test in clear terms. That is the prospective users (For example Vocational counselors, Clinical psychologists, Educationalists) and the purpose or purposes for which they will use the test.

What will be the appropriate age range, educational level and cultural background of the examinees, who would find it desirable to take the test?

What will be the content of the test? Is this content coverage different from that of the existing tests developed for the same or similar purposes? Is this cultural-specific?

The author has to decide what would be the nature of items, that is to decide if the test will be a multiple-choice, true-false, inventive response, or n some other form.

What would be the type of instructions i-e written or to be delivered orally?

Whether the test would be administered individually or in groups? Will the test be designed or modified for computer administration. A detailed agreement for preliminary and final administration should be considered.

What special training or qualifications will be necessary for administering or interpreting the test?

The test constructor must have to decide about the probable length and time for completion of test.

What would be the method of sampling i-e random or selective.

Is there any potential harm for the examinees resulting from the administration of this test? Are there any safeguards built into the recommended testing procedure to prevent any sort of harm to anyone involved in the use of this test.

How will the scores be interpreted? Will the scores of an examinee be compared to others in the criteria group or will they be used to assess mastery of a specific content area? To answer this question, the author has to decide whether the proposed test will be criterion-referenced or norm-referenced.

Planning also include the total number of reproductions and a preparation of manual.

A single question or task that is not often broken down into any smaller units. (Bean, 1953:15)

EXAMPLE: An arithmetical mean may be an item, a manipulative task may be an item, a mechanical puzzle may be an item and likewise sleeplessness may also be an item of a test.

Items in a test are just like atoms in a matter that is they are indivisible.

The second step in item writing is the preparation of the items of the test. Item writing starts with the planning done earlier. If the test constructor decides to prepare an essay test, then the essay items are written down.

However, if he decides to construct an objective test, he writes down the objective items such as the alternative response item, matching item, multiple-choice item, completion item, short answer item, a pictorial form of item, etc. Depending upon the purpose, he decides to write any of these objective types of items.

PREREQUISITES FOR ITEM WRITING:

Item writing is essentially a creative art. There are no set rules to guide and guarantee the writing of good items. A lot depends upon the item writer’s intuition, imagination, experience, practice, and ingenuity. However, there are some essential prerequisites that must be met if the item writer wants to write good and appropriate items. These requirements are briefly discussed as follows

The item writer must have a thorough knowledge and complete mastery of the subject matter. In other words, he must be fully acquainted with all facts, principles, misconceptions, Fallacies in a particular field so that he may be able to write good and appropriate items.

The item writer must be fully aware of those persons for whom the test is meant. He must also be aware of the intelligence level of those persons so that he may manipulate the difficulty level of the items for proper adjustment with their ability level. He must also be able to avoid irrelevant clues to correct responses.

The item writer must be familiar with different types of items along with their advantages and disadvantages. He must also be aware of the characteristics of good items and the common probable errors in writing items.

The item writer must have a large vocabulary. He must know the different meanings of a word so that confusion in writing the items may be avoided. He must be able to convey the meaning of the items in the simplest possible language.

After writing down the items, they must be submitted to a group of subject experts for their criticism or suggestions, which must then be duly modified.

The item writer must also cultivate a rich source of ideas for items. This is because ideas are not produced in the mind automatically but rather require certain factors or stimuli. The common source of such factors are textbooks, Journals, discussions, questions for interviews, coarse outlines, and other instructional materials.

CHARACTERISTICS OF A GOOD ITEM:

An item must have the following characteristics

An item should be phrased in such a manner that there is no ambiguity regarding its meaning for both the item writer as well as the examinees who take the test.

The item should not be too easy or too difficult.

It must have discriminating power, that is, it must clearly distinguish between those who possess the trait and those who do not.

It should not be concerned with the trivial aspects of the subject matter, that is, it must only measure the significant aspects of knowledge or understanding.

As far as possible, it should not encourage guesswork by the subjects.

It should not present any difficulty in reading.

It should not be such that its meaning is dependent upon another item and/or it can be answered by referring to another item.

GENERAL GUIDELINES FOR ITEM WRITING:

Writing items is a matter of precision. It is perhaps more like computer programming than writing a prose. The task of the item writer is to focus the attention of a large group of examinees, varying in background experience, environmental exposure and ability level on a single idea. Such a situation requires extreme care in the choice of words. The item writer must keep in view some general guidelines that are essential for writing good items. These are listed as under

CLARITY OF THE ITEM:

The clarity in writing test items is one of the main requirements for an item to be considered good. Items must not be written as “verbal puzzles”. They must be able to discriminate between those who are competent and those who are not. This is possible only when the items have been written in simple and clear language. The items must not be a test of the examinee’s ability to understand the language.

The item writer should be very cautious particularly in writing the objective items because each such item provides more or less an isolated bit of knowledge and there the problem of clarity is more serious. If the objective item is a vague one, it will create difficulty in understanding and the validity of the item will be adversely affected. Vagueness in writing items may be because of several reasons such as poor thinking and incompetence of the item writer.

NON-FUNCTIONAL WORDS SHOULD BE AVOIDED:

Non-functional words must not be included in the items as they tend to lower the validity of the item. Non-functional words refer to those words which make no contribution towards the appropriate and correct choice of a response by the examinees. Such words are often included by the item writer in an attempt to make the correct answer less obvious or to provide a good distractor.

AVOID IRRELEVANT ACCURACIES:

The item writer must make sure that irrelevant accuracies unintentionally incorporated in the items are avoided. Such irrelevant accuracies reflect the poor critical ability to think on the part of the item writer. They may also lead the examinees to think that the statement is true.

DIFFICULTY LEVEL SHOULD BE ADAPTABLE:

The item must not be too easy or too difficult for the examinees. The level of difficulty of the item should be adaptable to the level of understanding of the examinees. Although it is a fact that exact decisions regarding the difficulty value of an item can be taken only after some statistical techniques have been employed, yet an experienced item writer is capable of controlling the difficulty value beforehand and making it adaptable to the examinees.

In certain forms of objective-type items such as multiple choice-items and matching items, it is very easy to increase or decrease the difficulty value of the item. In general, when the response alternatives are made homogenous, the difficulty value of the item is increased but when the response alternatives are made heterogeneous, except the correct alternative, the examinee is likely to choose the correct answer soon and thus, the level of difficulty is decreased.

The item writer must keep in view the characteristics of both the ideal examinees as well as the typical examinees. If he keeps the typical examinees ( who are fewer in number) in view and ignores the ideal examinees, the test items are likely to be unreasonably difficult ones.

STEREOTYPED WORDS SHOULD BE AVOIDED:

The use of stereotyped words either in the stem or in the alternative responses must be avoided because these facilitate rote learners in guessing the correct answer. Moreover, such stereotyped words failed to discriminate between those who really know and understand the subject and those who do not. Thus, stereotyped words do not provide an adequate and discriminatory measure of an index. The most obvious way of getting rid of search such words is to paraphrase the words in a different manner so that those who really know the answer can pick up the meaning.

IRRELEVANT CLUES MUST BE AVOIDED:

Irrelevant clues must be avoided. These are sometimes provided in several forms such as clang association, verbal association, length of the answer, keeping a different foil among homogenous foils, giving the same order of the correct answer, etc. In general, such clues tend to decrease the difficulty level of the item because they provide an easy route to the correct answer.

The common observation is that the examinees who do not know the correct answer, choose any of these irrelevant clues and answer on that basis. The item writer must, therefore, take special care to avoid such irrelevant clues. Specific determiners like never, always, all, none must also be avoided because they are also irrelevant clues to the correct answer, especially in the two-alternative items.

INTERLOCKING ITEMS MUST BE AVOIDED:

Interlocking items must be avoided. Interlocking items, also known as interdependent items, are items that can be answered only by referring to other items. In other words, when responding correctly to an item is dependent upon the correct response of any other item, the item constitutes an example of an interlocking or independent item. For example:

  • Sociometry is a technique used to study the effect structure of groups. True/false
  • It is a kind of projective technique. True/false
  • It was developed by Morene et al. true/false

The above examples illustrate the interlocking items. Answer to items 2 and 3 only be given when the examinee knows the correct answer of item 1. Such items should be avoided because they do not provide an equal chance for examinees to answer the item.

NUMBER OF ITEMS:

The item writer is also frequently faced with the problem of determining the exact number of items. As a matter of fact, there is no hard and fast rule regarding this. Previous studies have shown that the number of items I usually linked with the desired level of reliability coefficient of the test. Studies have revealed that usually 25-30 dichotomous items are needed to have the reliability coefficient as high as 0.80 whereas 15-20 items needed to reach the same level of reliability when multipoint items are used.

These are the minimum number of items which should be retained after item analysis. An item writer should always write almost TWICE the number of items to be retained finally. Thus, if he wants 30 items in the final test, he should write 60 items.

In the speed test, the number of items to be written is entirely dependent upon the intuitive judgment of the test constructor. On the basis of his previous experiences, he decides that a certain number of items can be answered with the given time limit.

ARRANGEMENT OF ITEMS:
After the items have been written down, they are reviewed by some experts are by the item writer himself and then arranged in the order in which they are to appear in the final test. Generally, items are arranged in increasing order of difficulty those having the same form (say alternative form, matching, multiple-choice, etc.) and dealing with the same contents are placed together.

Before proceeding toward administration of the test review by at least three experts. When the test have been written down and modified in the light of the suggestions and criticisms given by the experts, the test is said to be ready for experimental try-out.

THE EXPERIENTIAL TRYOUT/ PRE-TRY-OUT:

The first administration of the test is called EXPERIMENTAL TRY-OUT or PRE-TRY-OUT. The sample size for experimental try out should be 100.

The purpose of the experimental try out is manifold. According to Conrad (1951), the main purpose of the experimental tryout of any psychological and educational test is as follows:

Finding out the major weaknesses, omissions, ambiguities and inadequacies of the Items.

Experimental tryout helps in determining the difficulty level of each item, which in turn helps in their proper distribution in the final form.

Helps in determining a reasonable time limit for the test.

Determining the appropriate length of the tests. In other words, it helps in determining the number of items to be included in the final form.

Identifying any weaknesses and vagueness in directions or instructions of the test.

PROPER TRYOUT:

The second preliminary administration is called PROPER TRYOUT. At this stage test is delivered to the sample of 400 and must be similar to those for whom the test is intended.

The proper try-out is carried out for the item analysis. ITEM ANALYSIS is the technique of selecting discriminating items for the final composition of the test. It aims at obtaining three kinds of information regarding the items. That is

Item difficulty is the proportion or percentage of the examinees or individuals who answer the item correctly.

The discriminatory power of the items refers to the extent to which any given item discriminates successfully between those who possess the trait in larger amounts and those who possess the same trait in the least amount.

Determines the non-functional distractors.

FINAL TRYOUT:

The third preliminary administration is called the Final tryout. The sample for final administration should be at least 100. At this stage, the items are selected after item analysis and constitute the test in the final form. It is carried out to determine the minor defects that may not have been detected by the first two preliminary administrations.

The final administration indicates how effective the test will be when it would be administered on the sample for which it is really intended. Thus, the preliminary administration would be a kind of “DRESS REHEARSAL” providing a sort of final check on the procedure of administration of the test and its time limit.

After final tryout, expert opinion should be considered again.

The basis of the experimental or empirical tryout the test is finally composed of the selected items, the final test is again administered on a fresh. For this purpose, we check the reliability of the test and it indicates the consistency of scores.

In simple words, it is defined as the degree to which a measurement is consistent. If findings from the research are replicated consistently then they are reliable.

Reliability also refers to the self-correlation of a test. A correlation coefficient can be used to assess the degree of reliability if a test is reliable it should show a high positive correlation.


How to Conduct a Psychology Experiment

Emily is a fact checker, editor, and writer who has expertise in psychology content.

Conducting your first psychology experiment can be a long, complicated, and sometimes intimidating process. It can be especially confusing if you are not quite sure where to begin or which steps to take.

Like other sciences, psychology utilizes the scientific method and bases conclusions upon empirical evidence. When conducting an experiment, it is important to follow the seven basic steps of the scientific method:  

  1. Ask a question or find a research problem to solve.
  2. Determine what you will test to answer this question.
  3. Review current knowledge on the subject.
  4. Design an experiment.
  5. Perform the experiment.
  6. Analyze results using statistical methods.
  7. Draw your conclusion and share the results with the scientific community.

What Are Some Examples of Psychological Construct?

Examples of psychological constructs are abilities, attitudes, personality traits and emotional states. A psychological construct can refer to a person's self-esteem, athletic ability or political and religious views. Psychological constructs refer to the traits and qualities of a person that cannot be concretely identified by observation.

Unlike easily identifiable physical features, such as hair color or eye color, psychological constructs refer to a person's tendencies to behave or feel a certain way. If someone is described as an introvert, it does not necessarily mean that the person behaves like an introvert at all times. Someone who is generally introverted might enjoy the occasional social gathering. Psychological construct simply refers to a general characteristic that cannot be observed directly.

Psychological construct is a type of psychological measurement. Psychological measurements are numbers assigned to represent and define the characteristics of a person. It cannot be measured like a concrete characteristic, such as weight or height. Psychological measurements are not performed by using tools or devices, but instead, they are determined by a series of questions and evaluations conducted by a qualified therapist or psychiatrist. There are different types of evaluations used to determine things, such as a person's memory capacity or levels of depression or anxiety.


Definition of a Construct

concept is a mental image of something concrete that we create in our minds to represents something we have seen. When someone uses a term like, "athlete", we can create a mental picture of what an athlete looks like because we have directly observed different athletes. Specific individuals may come to mind, or at least an image that includes  common qualities of athletes.    

Constructs, however, are more complex. They are "higher level abstractions" that people create that piece together simpler concepts into purposeful patterns. Constructs are useful in interpreting empirical data and building theories. They are used to account for observed regularities and relationships. They summarize observations and provide explanations.


Contents

Validity (accuracy) Edit

Validity [5] of an assessment is the degree to which it measures what it is supposed to measure. This is not the same as reliability, which is the extent to which a measurement gives results that are very consistent. Within validity, the measurement does not always have to be similar, as it does in reliability. However, just because a measure is reliable, it is not necessarily valid. E.g. a scale that is 5 pounds off is reliable but not valid. A test cannot be valid unless it is reliable. Validity is also dependent on the measurement measuring what it was designed to measure, and not something else instead. [6] Validity (similar to reliability) is a relative concept validity is not an all-or-nothing idea. There are many different types of validity.

Construct validity Edit

Construct validity refers to the extent to which operationalizations of a construct (e.g., practical tests developed from a theory) measure a construct as defined by a theory. It subsumes all other types of validity. For example, the extent to which a test measures intelligence is a question of construct validity. A measure of intelligence presumes, among other things, that the measure is associated with things it should be associated with (convergent validity), not associated with things it should not be associated with (discriminant validity). [7]

Construct validity evidence involves the empirical and theoretical support for the interpretation of the construct. Such lines of evidence include statistical analyses of the internal structure of the test including the relationships between responses to different test items. They also include relationships between the test and measures of other constructs. As currently understood, construct validity is not distinct from the support for the substantive theory of the construct that the test is designed to measure. As such, experiments designed to reveal aspects of the causal role of the construct also contribute to constructing validity evidence. [7]

Content validity Edit

Content validity is a non-statistical type of validity that involves "the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured" (Anastasi & Urbina, 1997 p. 114). For example, does an IQ questionnaire have items covering all areas of intelligence discussed in the scientific literature?

Content validity evidence involves the degree to which the content of the test matches a content domain associated with the construct. For example, a test of the ability to add two numbers should include a range of combinations of digits. A test with only one-digit numbers, or only even numbers, would not have good coverage of the content domain. Content related evidence typically involves a subject matter expert (SME) evaluating test items against the test specifications. Before going to the final administration of questionnaires, the researcher should consult the validity of items against each of the constructs or variables and accordingly modify measurement instruments on the basis of SME's opinion.

A test has content validity built into it by careful selection of which items to include (Anastasi & Urbina, 1997). Items are chosen so that they comply with the test specification which is drawn up through a thorough examination of the subject domain. Foxcroft, Paterson, le Roux & Herbst (2004, p. 49) [8] note that by using a panel of experts to review the test specifications and the selection of items the content validity of a test can be improved. The experts will be able to review the items and comment on whether the items cover a representative sample of the behavior domain.

Face validity Edit

Face validity is an estimate of whether a test appears to measure a certain criterion it does not guarantee that the test actually measures phenomena in that domain. Measures may have high validity, but when the test does not appear to be measuring what it is, it has low face validity. Indeed, when a test is subject to faking (malingering), low face validity might make the test more valid. Considering one may get more honest answers with lower face validity, it is sometimes important to make it appear as though there is low face validity whilst administering the measures.

Face validity is very closely related to content validity. While content validity depends on a theoretical basis for assuming if a test is assessing all domains of a certain criterion (e.g. does assessing addition skills yield in a good measure for mathematical skills? To answer this you have to know, what different kinds of arithmetic skills mathematical skills include) face validity relates to whether a test appears to be a good measure or not. This judgment is made on the "face" of the test, thus it can also be judged by the amateur.

Face validity is a starting point, but should never be assumed to be probably valid for any given purpose, as the "experts" have been wrong before—the Malleus Malificarum (Hammer of Witches) had no support for its conclusions other than the self-imagined competence of two "experts" in "witchcraft detection," yet it was used as a "test" to condemn and burn at the stake tens of thousands men and women as "witches." [9]

Criterion validity Edit

Criterion validity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct. In other words, it compares the test with other measures or outcomes (the criteria) already held to be valid. For example, employee selection tests are often validated against measures of job performance (the criterion), and IQ tests are often validated against measures of academic performance (the criterion).

If the test data and criterion data are collected at the same time, this is referred to as concurrent validity evidence. If the test data are collected first in order to predict criterion data collected at a later point in time, then this is referred to as predictive validity evidence.

Concurrent validity Edit

Concurrent validity refers to the degree to which the operationalization correlates with other measures of the same construct that are measured at the same time. When the measure is compared to another measure of the same type, they will be related (or correlated). Returning to the selection test example, this would mean that the tests are administered to current employees and then correlated with their scores on performance reviews.

Predictive validity Edit

Predictive validity refers to the degree to which the operationalization can predict (or correlate with) other measures of the same construct that are measured at some time in the future. Again, with the selection test example, this would mean that the tests are administered to applicants, all applicants are hired, their performance is reviewed at a later time, and then their scores on the two measures are correlated.

This is also when measurement predicts a relationship between what is measured and something else predicting whether or not the other thing will happen in the future. High correlation between ex-ante predicted and ex-post actual outcomes is the strongest proof of validity.

The validity of the design of experimental research studies is a fundamental part of the scientific method, [10] and a concern of research ethics. Without a valid design, valid scientific conclusions cannot be drawn.

Statistical conclusion validity Edit

Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or ‘reasonable’. This began as being solely about whether the statistical conclusion about the relationship of the variables was correct, but now there is a movement towards moving to ‘reasonable’ conclusions that use: quantitative, statistical, and qualitative data. [11]

Statistical conclusion validity involves ensuring the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures. [12] As this type of validity is concerned solely with the relationship that is found among variables, the relationship may be solely a correlation.

Internal validity Edit

Internal validity is an inductive estimate of the degree to which conclusions about causal relationships can be made (e.g. cause and effect), based on the measures used, the research setting, and the whole research design. Good experimental techniques, in which the effect of an independent variable on a dependent variable is studied under highly controlled conditions, usually allow for higher degrees of internal validity than, for example, single-case designs.

Eight kinds of confounding variable can interfere with internal validity (i.e. with the attempt to isolate causal relationships):

  1. History, the specific events occurring between the first and second measurements in addition to the experimental variables
  2. Maturation, processes within the participants as a function of the passage of time (not specific to particular events), e.g., growing older, hungrier, more tired, and so on.
  3. Testing, the effects of taking a test upon the scores of a second testing.
  4. Instrumentation, changes in calibration of a measurement tool or changes in the observers or scorers may produce changes in the obtained measurements.
  5. Statistical regression, operating where groups have been selected on the basis of their extreme scores.
  6. Selection, biases resulting from differential selection of respondents for the comparison groups.
  7. Experimental mortality, or differential loss of respondents from the comparison groups.
  8. Selection-maturation interaction, etc. e.g., in multiple-group quasi-experimental designs

External validity Edit

External validity concerns the extent to which the (internally valid) results of a study can be held to be true for other cases, for example to different people, places or times. In other words, it is about whether findings can be validly generalized. If the same research study was conducted in those other cases, would it get the same results?

A major factor in this is whether the study sample (e.g. the research participants) are representative of the general population along relevant dimensions. Other factors jeopardizing external validity are:

  1. Reactive or interaction effect of testing, a pretest might increase the scores on a posttest
  2. Interaction effects of selection biases and the experimental variable.
  3. Reactive effects of experimental arrangements, which would preclude generalization about the effect of the experimental variable upon persons being exposed to it in non-experimental settings
  4. Multiple-treatment interference, where effects of earlier treatments are not erasable.

Ecological validity Edit

Ecological validity is the extent to which research results can be applied to real-life situations outside of research settings. This issue is closely related to external validity but covers the question of to what degree experimental findings mirror what can be observed in the real world (ecology = the science of interaction between organism and its environment). To be ecologically valid, the methods, materials and setting of a study must approximate the real-life situation that is under investigation.

Ecological validity is partly related to the issue of experiment versus observation. Typically in science, there are two domains of research: observational (passive) and experimental (active). The purpose of experimental designs is to test causality, so that you can infer A causes B or B causes A. But sometimes, ethical and/or methological restrictions prevent you from conducting an experiment (e.g. how does isolation influence a child's cognitive functioning?). Then you can still do research, but it is not causal, it is correlational. You can only conclude that A occurs together with B. Both techniques have their strengths and weaknesses.

Relationship to internal validity Edit

On first glance, internal and external validity seem to contradict each other – to get an experimental design you have to control for all interfering variables. That is why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological or external validity because you establish an artificial laboratory setting. On the other hand, with observational research you can not control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs. However, in doing so, you sacrifice internal validity.

The apparent contradiction of internal validity and external validity is, however, only superficial. The question of whether results from a particular study generalize to other people, places or times arises only when one follows an inductivist research strategy. If the goal of a study is to deductively test a theory, one is only concerned with factors which might undermine the rigor of the study, i.e. threats to internal validity. In other words, the relevance of external and internal validity to a research study depends on the goals of the study. Furthermore, conflating research goals with validity concerns can lead to the mutual-internal-validity problem, where theories are able to explain only phenomena in artificial laboratory settings but not the real world. [13] [14]

In psychiatry there is a particular issue with assessing the validity of the diagnostic categories themselves. In this context: [15]

  • content validity may refer to symptoms and diagnostic criteria
  • concurrent validity may be defined by various correlates or markers, and perhaps also treatment response
  • predictive validity may refer mainly to diagnostic stability over time
  • discriminant validity may involve delimitation from other disorders.

Robins and Guze proposed in 1970 what were to become influential formal criteria for establishing the validity of psychiatric diagnoses. They listed five criteria: [15]

  • distinct clinical description (including symptom profiles, demographic characteristics, and typical precipitants)
  • laboratory studies (including psychological tests, radiology and postmortem findings)
  • delimitation from other disorders (by means of exclusion criteria)
  • follow-up studies showing a characteristic course (including evidence of diagnostic stability)
  • family studies showing familial clustering

These were incorporated into the Feighner Criteria and Research Diagnostic Criteria that have since formed the basis of the DSM and ICD classification systems.

Kendler in 1980 distinguished between: [15]

  • antecedent validators (familial aggregation, premorbid personality, and precipitating factors)
  • concurrent validators (including psychological tests)
  • predictive validators (diagnostic consistency over time, rates of relapse and recovery, and response to treatment)

Nancy Andreasen (1995) listed several additional validators – molecular genetics and molecular biology, neurochemistry, neuroanatomy, neurophysiology, and cognitive neuroscience – that are all potentially capable of linking symptoms and diagnoses to their neural substrates. [15]

Kendell and Jablinsky (2003) emphasized the importance of distinguishing between validity and utility, and argued that diagnostic categories defined by their syndromes should be regarded as valid only if they have been shown to be discrete entities with natural boundaries that separate them from other disorders. [15]

Kendler (2006) emphasized that to be useful, a validating criterion must be sensitive enough to validate most syndromes that are true disorders, while also being specific enough to invalidate most syndromes that are not true disorders. On this basis, he argues that a Robins and Guze criterion of "runs in the family" is inadequately specific because most human psychological and physical traits would qualify - for example, an arbitrary syndrome comprising a mixture of "height over 6 ft, red hair, and a large nose" will be found to "run in families" and be "hereditary", but this should not be considered evidence that it is a disorder. Kendler has further suggested that "essentialist" gene models of psychiatric disorders, and the hope that we will be able to validate categorical psychiatric diagnoses by "carving nature at its joints" solely as a result of gene discovery, are implausible. [16]

In the United States Federal Court System validity and reliability of evidence is evaluated using the Daubert Standard: see Daubert v. Merrell Dow Pharmaceuticals. Perri and Lichtenwald (2010) provide a starting point for a discussion about a wide range of reliability and validity topics in their analysis of a wrongful murder conviction. [17]


What Are Some Examples of Psychological Construct?

Examples of psychological constructs are abilities, attitudes, personality traits and emotional states. A psychological construct can refer to a person's self-esteem, athletic ability or political and religious views. Psychological constructs refer to the traits and qualities of a person that cannot be concretely identified by observation.

Unlike easily identifiable physical features, such as hair color or eye color, psychological constructs refer to a person's tendencies to behave or feel a certain way. If someone is described as an introvert, it does not necessarily mean that the person behaves like an introvert at all times. Someone who is generally introverted might enjoy the occasional social gathering. Psychological construct simply refers to a general characteristic that cannot be observed directly.

Psychological construct is a type of psychological measurement. Psychological measurements are numbers assigned to represent and define the characteristics of a person. It cannot be measured like a concrete characteristic, such as weight or height. Psychological measurements are not performed by using tools or devices, but instead, they are determined by a series of questions and evaluations conducted by a qualified therapist or psychiatrist. There are different types of evaluations used to determine things, such as a person's memory capacity or levels of depression or anxiety.


2016 Amendment

3.04 Avoiding Harm
(a) Psychologists take reasonable steps to avoid harming their clients/patients, students, supervisees, research participants, organizational clients, and others with whom they work, and to minimize harm where it is foreseeable and unavoidable.

(b) Psychologists do not participate in, facilitate, assist, or otherwise engage in torture, defined as any act by which severe pain or suffering, whether physical or mental, is intentionally inflicted on a person, or in any other cruel, inhuman, or degrading behavior that violates 3.04(a).


“GENERAL STEPS OF TEST CONSTRUCTION”

The development of a good psychological test requires thoughtful and sound application of established principles of test construction. Before the real work of test construction, the test constructor takes some broad decisions about the major objectives of the test in general terms and population for whom the test is intended and also indicates the possible conditions under which the test can be used and its important uses.

These preliminary decisions have far-reaching consequences. For example, a test constructor may decide to construct an intelligence test meant for students of tenth grade broadly aiming at diagnosing the manipulative and organizational ability of the pupils. Having decided the above preliminary things, the test constructor goes ahead with the following steps:

  1. Planning
  2. Writing items for the test.
  3. Preliminary administration of the test.
  4. Reliability of the final test.
  5. The validity of the final test.
  6. Preparation of norms for the final test.
  7. Preparation of manual and reproduction of the test.
  8. PLANNING:

The first step in the test construction is the careful planning. At this stage, the test constructor address the following issues

Definition of the construct to be measured by the proposed test.

The author has to spell out the broad and specific objectives of the test in clear terms. That is the prospective users (For example Vocational counselors, Clinical psychologists, Educationalists) and the purpose or purposes for which they will use the test.

What will be the appropriate age range, educational level and cultural background of the examinees, who would find it desirable to take the test?

What will be the content of the test? Is this content coverage different from that of the existing tests developed for the same or similar purposes? Is this cultural-specific?

The author has to decide what would be the nature of items, that is to decide if the test will be a multiple-choice, true-false, inventive response, or n some other form.

What would be the type of instructions i-e written or to be delivered orally?

Whether the test would be administered individually or in groups? Will the test be designed or modified for computer administration. A detailed agreement for preliminary and final administration should be considered.

What special training or qualifications will be necessary for administering or interpreting the test?

The test constructor must have to decide about the probable length and time for completion of test.

What would be the method of sampling i-e random or selective.

Is there any potential harm for the examinees resulting from the administration of this test? Are there any safeguards built into the recommended testing procedure to prevent any sort of harm to anyone involved in the use of this test.

How will the scores be interpreted? Will the scores of an examinee be compared to others in the criteria group or will they be used to assess mastery of a specific content area? To answer this question, the author has to decide whether the proposed test will be criterion-referenced or norm-referenced.

Planning also include the total number of reproductions and a preparation of manual.

A single question or task that is not often broken down into any smaller units. (Bean, 1953:15)

EXAMPLE: An arithmetical mean may be an item, a manipulative task may be an item, a mechanical puzzle may be an item and likewise sleeplessness may also be an item of a test.

Items in a test are just like atoms in a matter that is they are indivisible.

The second step in item writing is the preparation of the items of the test. Item writing starts with the planning done earlier. If the test constructor decides to prepare an essay test, then the essay items are written down.

However, if he decides to construct an objective test, he writes down the objective items such as the alternative response item, matching item, multiple-choice item, completion item, short answer item, a pictorial form of item, etc. Depending upon the purpose, he decides to write any of these objective types of items.

PREREQUISITES FOR ITEM WRITING:

Item writing is essentially a creative art. There are no set rules to guide and guarantee the writing of good items. A lot depends upon the item writer’s intuition, imagination, experience, practice, and ingenuity. However, there are some essential prerequisites that must be met if the item writer wants to write good and appropriate items. These requirements are briefly discussed as follows

The item writer must have a thorough knowledge and complete mastery of the subject matter. In other words, he must be fully acquainted with all facts, principles, misconceptions, Fallacies in a particular field so that he may be able to write good and appropriate items.

The item writer must be fully aware of those persons for whom the test is meant. He must also be aware of the intelligence level of those persons so that he may manipulate the difficulty level of the items for proper adjustment with their ability level. He must also be able to avoid irrelevant clues to correct responses.

The item writer must be familiar with different types of items along with their advantages and disadvantages. He must also be aware of the characteristics of good items and the common probable errors in writing items.

The item writer must have a large vocabulary. He must know the different meanings of a word so that confusion in writing the items may be avoided. He must be able to convey the meaning of the items in the simplest possible language.

After writing down the items, they must be submitted to a group of subject experts for their criticism or suggestions, which must then be duly modified.

The item writer must also cultivate a rich source of ideas for items. This is because ideas are not produced in the mind automatically but rather require certain factors or stimuli. The common source of such factors are textbooks, Journals, discussions, questions for interviews, coarse outlines, and other instructional materials.

CHARACTERISTICS OF A GOOD ITEM:

An item must have the following characteristics

An item should be phrased in such a manner that there is no ambiguity regarding its meaning for both the item writer as well as the examinees who take the test.

The item should not be too easy or too difficult.

It must have discriminating power, that is, it must clearly distinguish between those who possess the trait and those who do not.

It should not be concerned with the trivial aspects of the subject matter, that is, it must only measure the significant aspects of knowledge or understanding.

As far as possible, it should not encourage guesswork by the subjects.

It should not present any difficulty in reading.

It should not be such that its meaning is dependent upon another item and/or it can be answered by referring to another item.

GENERAL GUIDELINES FOR ITEM WRITING:

Writing items is a matter of precision. It is perhaps more like computer programming than writing a prose. The task of the item writer is to focus the attention of a large group of examinees, varying in background experience, environmental exposure and ability level on a single idea. Such a situation requires extreme care in the choice of words. The item writer must keep in view some general guidelines that are essential for writing good items. These are listed as under

CLARITY OF THE ITEM:

The clarity in writing test items is one of the main requirements for an item to be considered good. Items must not be written as “verbal puzzles”. They must be able to discriminate between those who are competent and those who are not. This is possible only when the items have been written in simple and clear language. The items must not be a test of the examinee’s ability to understand the language.

The item writer should be very cautious particularly in writing the objective items because each such item provides more or less an isolated bit of knowledge and there the problem of clarity is more serious. If the objective item is a vague one, it will create difficulty in understanding and the validity of the item will be adversely affected. Vagueness in writing items may be because of several reasons such as poor thinking and incompetence of the item writer.

NON-FUNCTIONAL WORDS SHOULD BE AVOIDED:

Non-functional words must not be included in the items as they tend to lower the validity of the item. Non-functional words refer to those words which make no contribution towards the appropriate and correct choice of a response by the examinees. Such words are often included by the item writer in an attempt to make the correct answer less obvious or to provide a good distractor.

AVOID IRRELEVANT ACCURACIES:

The item writer must make sure that irrelevant accuracies unintentionally incorporated in the items are avoided. Such irrelevant accuracies reflect the poor critical ability to think on the part of the item writer. They may also lead the examinees to think that the statement is true.

DIFFICULTY LEVEL SHOULD BE ADAPTABLE:

The item must not be too easy or too difficult for the examinees. The level of difficulty of the item should be adaptable to the level of understanding of the examinees. Although it is a fact that exact decisions regarding the difficulty value of an item can be taken only after some statistical techniques have been employed, yet an experienced item writer is capable of controlling the difficulty value beforehand and making it adaptable to the examinees.

In certain forms of objective-type items such as multiple choice-items and matching items, it is very easy to increase or decrease the difficulty value of the item. In general, when the response alternatives are made homogenous, the difficulty value of the item is increased but when the response alternatives are made heterogeneous, except the correct alternative, the examinee is likely to choose the correct answer soon and thus, the level of difficulty is decreased.

The item writer must keep in view the characteristics of both the ideal examinees as well as the typical examinees. If he keeps the typical examinees ( who are fewer in number) in view and ignores the ideal examinees, the test items are likely to be unreasonably difficult ones.

STEREOTYPED WORDS SHOULD BE AVOIDED:

The use of stereotyped words either in the stem or in the alternative responses must be avoided because these facilitate rote learners in guessing the correct answer. Moreover, such stereotyped words failed to discriminate between those who really know and understand the subject and those who do not. Thus, stereotyped words do not provide an adequate and discriminatory measure of an index. The most obvious way of getting rid of search such words is to paraphrase the words in a different manner so that those who really know the answer can pick up the meaning.

IRRELEVANT CLUES MUST BE AVOIDED:

Irrelevant clues must be avoided. These are sometimes provided in several forms such as clang association, verbal association, length of the answer, keeping a different foil among homogenous foils, giving the same order of the correct answer, etc. In general, such clues tend to decrease the difficulty level of the item because they provide an easy route to the correct answer.

The common observation is that the examinees who do not know the correct answer, choose any of these irrelevant clues and answer on that basis. The item writer must, therefore, take special care to avoid such irrelevant clues. Specific determiners like never, always, all, none must also be avoided because they are also irrelevant clues to the correct answer, especially in the two-alternative items.

INTERLOCKING ITEMS MUST BE AVOIDED:

Interlocking items must be avoided. Interlocking items, also known as interdependent items, are items that can be answered only by referring to other items. In other words, when responding correctly to an item is dependent upon the correct response of any other item, the item constitutes an example of an interlocking or independent item. For example:

  • Sociometry is a technique used to study the effect structure of groups. True/false
  • It is a kind of projective technique. True/false
  • It was developed by Morene et al. true/false

The above examples illustrate the interlocking items. Answer to items 2 and 3 only be given when the examinee knows the correct answer of item 1. Such items should be avoided because they do not provide an equal chance for examinees to answer the item.

NUMBER OF ITEMS:

The item writer is also frequently faced with the problem of determining the exact number of items. As a matter of fact, there is no hard and fast rule regarding this. Previous studies have shown that the number of items I usually linked with the desired level of reliability coefficient of the test. Studies have revealed that usually 25-30 dichotomous items are needed to have the reliability coefficient as high as 0.80 whereas 15-20 items needed to reach the same level of reliability when multipoint items are used.

These are the minimum number of items which should be retained after item analysis. An item writer should always write almost TWICE the number of items to be retained finally. Thus, if he wants 30 items in the final test, he should write 60 items.

In the speed test, the number of items to be written is entirely dependent upon the intuitive judgment of the test constructor. On the basis of his previous experiences, he decides that a certain number of items can be answered with the given time limit.

ARRANGEMENT OF ITEMS:
After the items have been written down, they are reviewed by some experts are by the item writer himself and then arranged in the order in which they are to appear in the final test. Generally, items are arranged in increasing order of difficulty those having the same form (say alternative form, matching, multiple-choice, etc.) and dealing with the same contents are placed together.

Before proceeding toward administration of the test review by at least three experts. When the test have been written down and modified in the light of the suggestions and criticisms given by the experts, the test is said to be ready for experimental try-out.

THE EXPERIENTIAL TRYOUT/ PRE-TRY-OUT:

The first administration of the test is called EXPERIMENTAL TRY-OUT or PRE-TRY-OUT. The sample size for experimental try out should be 100.

The purpose of the experimental try out is manifold. According to Conrad (1951), the main purpose of the experimental tryout of any psychological and educational test is as follows:

Finding out the major weaknesses, omissions, ambiguities and inadequacies of the Items.

Experimental tryout helps in determining the difficulty level of each item, which in turn helps in their proper distribution in the final form.

Helps in determining a reasonable time limit for the test.

Determining the appropriate length of the tests. In other words, it helps in determining the number of items to be included in the final form.

Identifying any weaknesses and vagueness in directions or instructions of the test.

PROPER TRYOUT:

The second preliminary administration is called PROPER TRYOUT. At this stage test is delivered to the sample of 400 and must be similar to those for whom the test is intended.

The proper try-out is carried out for the item analysis. ITEM ANALYSIS is the technique of selecting discriminating items for the final composition of the test. It aims at obtaining three kinds of information regarding the items. That is

Item difficulty is the proportion or percentage of the examinees or individuals who answer the item correctly.

The discriminatory power of the items refers to the extent to which any given item discriminates successfully between those who possess the trait in larger amounts and those who possess the same trait in the least amount.

Determines the non-functional distractors.

FINAL TRYOUT:

The third preliminary administration is called the Final tryout. The sample for final administration should be at least 100. At this stage, the items are selected after item analysis and constitute the test in the final form. It is carried out to determine the minor defects that may not have been detected by the first two preliminary administrations.

The final administration indicates how effective the test will be when it would be administered on the sample for which it is really intended. Thus, the preliminary administration would be a kind of “DRESS REHEARSAL” providing a sort of final check on the procedure of administration of the test and its time limit.

After final tryout, expert opinion should be considered again.

The basis of the experimental or empirical tryout the test is finally composed of the selected items, the final test is again administered on a fresh. For this purpose, we check the reliability of the test and it indicates the consistency of scores.

In simple words, it is defined as the degree to which a measurement is consistent. If findings from the research are replicated consistently then they are reliable.

Reliability also refers to the self-correlation of a test. A correlation coefficient can be used to assess the degree of reliability if a test is reliable it should show a high positive correlation.


Psychology: Construct Validity in Psychological Measurement

Title: Construct Validity in Psychological Measurement
Discipline(s) or Field(s): Psychology
Authors: Carmen Wilson, Bill Cerbin, Melanie Cary, Rob Dixon, University of Wisconsin – La Crosse
Submission Date: January 15, 2007

Executive Summary

The goal of the lesson is to develop students’ understanding of construct validity as measured by their ability to: 1) explain methods used to determine construct validity for psychological measures and 2) design a study to determine the construct validity of a given measure.

Prior to the lesson. In the two class days prior to the lesson, the instructor presented information on content, criterion, and construct validity. Each type of validity was presented in terms of a question it answered and how it might be assessed. Content validity answers the question, “Do the items represent the domain of interest?” It can be assessed by having an expert in the topic review the test. Criterion validity answers the question, “Do scores on the test predict some non-test behavior?” It can be assessed by correlating scores on the test with some other measure of the behavior (e.g. behavioral observation). Construct validity answers the question, “Does the test measure what it claims to measure?” The lecture highlights several processes to assess construct validity. The answer to the construct validity question is dependent upon what is known about the construct being measured. For example, if the theory about the construct suggests that two groups of people should have different levels of a construct, and the test actually measures the construct, then the groups’ scores should be different.

We evaluated three versions of the lesson across three semesters. In Version 1 (lesson, no lecture – A), students developed a 5-item measure of depression and then designed three research studies to evaluate the validity of their measure prior to receiving any instruction about construct validity. In Version 2 (lesson, no lecture – B) we made minor modifications, but the lesson essentially remained the same. In Version 3 (lesson after lecture), we made significant modifications. The instructor lectured about construct validity first, and in a subsequent class, students analyzed three validity studies and then designed a validity study based on information provided by the instructor.

In Versions 1 and 2, students became bogged down in details of their proposed research studies and missed the more important goal of predicting results that would support the validity of their measure. The team decided to restructure the lesson so that in Version 3 they first heard a lecture, and then read summaries of real validity studies and predicted the results of those studies given the tests were valid. In the last part of the lesson students designed a study to determine if a given test was valid and predicted the results of that study. Interestingly, students who participated in Version 1 of the lesson (lesson, no lecture – A) generally performed better than students who participated in Versions 2 and 3.


How to Conduct a Psychology Experiment

Emily is a fact checker, editor, and writer who has expertise in psychology content.

Conducting your first psychology experiment can be a long, complicated, and sometimes intimidating process. It can be especially confusing if you are not quite sure where to begin or which steps to take.

Like other sciences, psychology utilizes the scientific method and bases conclusions upon empirical evidence. When conducting an experiment, it is important to follow the seven basic steps of the scientific method:  

  1. Ask a question or find a research problem to solve.
  2. Determine what you will test to answer this question.
  3. Review current knowledge on the subject.
  4. Design an experiment.
  5. Perform the experiment.
  6. Analyze results using statistical methods.
  7. Draw your conclusion and share the results with the scientific community.

Empirical Evaluation of Construct Validity

Campbell's & Fiske's (1959) multitrait-multimethod matrix methodology presented a logic for evaluating construct validity through simultaneous evaluation of convergent and discriminant validity, and the contribution of method variance to observed relationships. iv Wothke (1995) nicely summarized the central idea of MTMM matrix methodology:

The crossed-measurement design in the MTMM matrix derives from a simple rationale: Traits are universal, manifest over a variety of situations, and detectable with a variety of methods. Most importantly, the magnitude of a trait should not change just because different assessment methods are used (p. 125)

Traits are latent variables, inferred constructs. The term trait, as used here, is not limited to enduring characteristics it applies as well to more transitory phenomena such as moods, emotions, as well as to all other individual differences constructs, e.g., attitudes and psychophysical measurements. Methods for Campbell and Fiske are the procedures through which responses are obtained, the operationalization of the assessment procedures that produce the responses, the quantitative summary of which is the measure itself (Wothke 1995).

As Campbell & Fiske (1959) emphasized, measurement methods (method variance) are sources of irrelevant, though reliable, variance. When the same method is used across measures, the presence of reliable method variance can lead to an overestimation of the magnitude of relations among constructs. This can lead to overestimating convergent validity and underestimating discriminant validity. This is why multiple assessment methods are critical in the development of construct validity. Their distinction of validity (the correlation between dissimilar measures of a characteristic) from reliability (the correlation between similar measures of a chartacteristic) hinged on the differences between construct assessment methods.

Campbell & Fiske's (1959) observation remains important today: much clinical psychology research relies on the same method for both predictor and criterion measurement, typically self-report questionnaire or interview. Their call for attention to method variance is as relevant today as it was 50 years ago examination of constructs with different methods is a crucial part of the construct validation process. Of course, the degree to which two methods are independent is not always clear. For example, how different are the methods of interview and questionnaire? Both rely on self-report, so are they independent sources of information? Perhaps not, but they do differ operationally. For example, questionnaire responses are often anonymous, whereas interview responses require disclosure to another. Questionnaire responses are based on the perceptions of the respondent, whereas interview ratings are based, in part, on the perceptions of the interviewer. A conceptually based definition of “method variance” has not been easy to achieve, as Sechrest et al.'s (2000) analysis of this issue demonstrates. Certainly, method differences lie on a continuum where for example, self-report and interview are closer to each other than self-report and informant report or behavioral observation.

The guidance provided for evaluating construct validity in 1959 was qualitative it involved the rule-based examination of patterns of correlations against the expectations of convergent and discriminant validity (Campbell & Fiske 1959). Developments in psychometric theory, multivariate statistics and analysis of latent traits in the decades since the Campbell & Fiske (1959) paper have made available a number of quantitative methods for modeling convergent and discriminant validity across different assessment methods.

Bryant (2000) provides a particularly accessible description of using ANOVA (and a nonparametric variant) and confirmatory factor analysis (CFA) in the analysis of MTMM matrices. A major advantage of CFA in construct validity research is the possibility of directly comparing alternative models of relationships among constructs, a critical component of theory testing (see Whitely 1983). Covariance component analysis of the MTMM matrix has also been developed (Wothke 1995). Both covariance component analysis and CFA are variants of structural equation models (SEM). With these advances eye-ball examinations of MTMM matrices are no longer sufficient for the evaluation of the trait validity of a measure in modern assessment research.

Perhaps the first CFA approach was one that followed very straightforwardly from Campbell & Fiske (1959): it involved specifying a CFA model in which responses to any item can be understood as reflecting additive effects of trait variance, method variance, and measurement error (Marsh & Grayson 1995 Reichardt & Coleman 1995 Widaman 1985). So if traits A, B, and C are each measured with methods X, Y, and Z, there are six latent variables: three for the traits and three for the methods. Thus, if indicator i reflects method X for evaluating trait A, that part of the variance of i that is shared with other indicators of trait A is assigned to the trait A factor, that part of the variance of i that is shared with indicators of other constructs measured by method X is assigned to the method X factor, and the remainder is assigned to an error term (Eid et al. 2003 Kenny & Kashy 1992). The association of each type of factor with other measures can be examined, so, for example, one can test explicitly the role of a certain trait or a certain type of method variance on responses to a criterion measure. This approach can be expanded to include interactions between traits and methods (Campbell & O'Connell 1967, 1982), and therefore test multiplicative models (Browne 1984 Cudeck 1988).

Although the potential advantages of this approach are obvious, it has generally not proven feasible. As noted by Kenny & Kashy (1992), this approach often results in modeling more factors than there is information to identify them the result, often, is a statistical failure to converge on a factor solution. That reality has led some researchers to turn away from multivariate statistical methods to evaluate MTMM results. In recent years, however, two alternative CFA modeling approaches have been developed that appear to work well.

The first is referred to as the 𠇌orrelated uniquenesses” approach (Marsh & Grayson 1995). In this approach, one does not model method factors as in the approach previously described. Instead, one identifies the presence of method variance by allowing the residual variances of trait indicators that share the same method to correlate, after accounting for trait variation and covariation. To the degree there are substantial correlations between these residual terms, method variance is considered present and is accounted for statistically (although other forms of reliable specificity may be represented in those correlations as well). As a result, the latent variables reflecting trait variation do not include that method variance: one can test the relation between method-free trait scores and other variables of interest. And, since this approach models only trait factors, it avoids the over-factoring problem of the earlier approach. There is, however, an important limitation to the correlated uniquenesses approach. Without a representation of method variance as a factor, one cannot examine the association of method variance with other constructs, which may be important to do (Cronbach 1995).

The second alternative approach provides a way to model some method variance while avoiding the over-factoring problem (Eid et al. 2003). One constructs latent variables to represent all trait factors and all but one method factor. Since there are fewer factors than in the original approach, the resulting solution is mathematically identified: one has not over-factored. The idea is that one method is chosen as the baseline method and is not represented by a latent variable. One evaluates other methods for how they influence results compared to the baseline method. Suppose, for example, that one had interview and collateral report data for a series of traits. One might specify the interview method as the baseline method, so an interview method factor is not modeled as separate from trait variance, and trait scores are really trait-as-measured-by-interview scores. One then models a method factor for collateral report. If the collateral report method leads to higher estimates of trait presence than does the interview, one would find that the collateral report method factor correlated positively with the trait-as-measured-by-interview. That would imply that collaterals report higher levels of the trait than do individuals during interviews.

Interestingly, one can assess whether this process works differently for different traits. Perhaps collaterals perceive higher levels of some traits than are reported by interview (unflattering traits?) and lower levels of other traits as reported by interview (flattering traits?). This possibility can be examined empirically using this method. In this way, the Eid et al. (2003) approach makes it possible to identify the contribution of method to measure scores. The limitation of this method, of course, is that the choice of �seline method” influences the results and may be arbitrary (Eid et al. 2003).

Most recently, Courvoisier et al. (2008) have combined this approach with latent state-trait analysis the latter method allows one to estimate variance due to stable traits, occasion-specific states, and error (Steyer et al. 1999). The result is a single analytic method to estimate variance due to trait, method, state, and error. Among the possibilities offered by this approach is that one can investigate the degree to which method effects are stable or variable over time.

We wish to emphasize three points concerning these advances in methods for the empirical evaluation of construct validity. First, the concern that MTMM data could not successfully be analyzed using CFA/SEM approaches is no longer correct. There are now analytic tools that have proven successful (Eid et al. 2003). Second, statistical tools are available that enable one to quantitatively estimate multiple sources of variance that are important to the construct validation enterprise (Eid et al. 2003 Marsh & Grayson 1995). One need not guess at the degree to which method variance is present, or the degree to which it is common across traits, or the degree to which it is stable: one can investigate these sources of variance directly. Third, these analytic techniques are increasingly accessible to researchers (see Kline 2005, for a useful introduction to SEM). Clinical researchers have a validity concern beyond successful demonstration of convergent and discriminant validity. Success at the level of MTMM validity does not assure the measured traits have utility. Typically, one also needs to investigate whether the traits enhance prediction of some criterion of clinical importance.

To this end, clinical researchers can rely on a classic contribution by Hammond et al. (1986). They offered a creative, integrative analytic approach for combining the results of MTMM designs with the evaluation of differential prediction of external criteria. In the best tradition of applying basic science advances to practical prediction, their design integrated the convergent/discriminant validity perspective of Campbell & Fiske (1959) with Brunswik's (1952, 1956) emphasis on representative design in research, which in part concerned the need to conduct investigations that yield findings one can apply to practical problems. They presented the concept of a performance validity matrix, which adds criterion variables for each trait to the MTMM design. By adding clinical outcome variables to one's MTMM design, one can provide evidence of convergent validity, discriminant validity, and differential clinical prediction in a single study.

Such analyses are critical clinically, because this sophisticated treatment of validity is likely to improve the usefulness of measures for clinicians. For many measures, validation research that considers practical prediction improves measures’ “three Ps”: predicting important criteria prescribing treatments, and understanding the processes underlying personality and psychopathology (Youngstrom 2008), thereby improving clinical assessment. Such practical efforts in assessment must rely on observed scores, confounded as they may be with method variance. Construct validity research provides the clinician with an appreciation of the many factors entering into an observed score and, thus, appreciation of the mix of construct-relevant, reliable construct-irrelevant variance and method variance in any score. (see Richters 1992).


Measurement Matters

After a long and cold journey of 286 days, the Mars Climate Orbiter reached its destination on 23 September 1999. Rather than beginning its mission, however, the satellite disintegrated upon entering the atmosphere because one software module made calculations in US customary units and fed them into a second module that assumed metric units. Four years later, two halves of a large bridge being constructed across the Rhine came together to connect Germany and Switzerland. To the surprise of the engineers, there was a height difference of 54 cm (21 in) between the two sides: Different measurements of sea level had been used (the North Sea vs. the Mediterranean Sea).

Measurement problems can (and do) occur — sometimes with disastrous consequences — as part of even the most remarkable scientific endeavors, such as sending a satellite into space. We are in no different a situation in psychology as we navigate the shifts in our research culture toward a more open and rigorous science. So far, these shifts have largely ignored the topic of measurement, an unfortunate situation because the quality of measurement is even more foundational than statistical practice. A high-powered, perfectly parsimonious statistical model cannot save us from poor measurement.

In psychology, measurement is especially difficult because what we want to measure often does not permit direct observation. We can directly observe the height of a person next to us on the bus, but we often have little insight into latent, psychological attributes such as intelligence, extraversion, or depression. Construct validation — showing that an instrument meant to measure a construct actually measures the construct in question — is no easy task. Not only are psychological constructs difficult to observe, they are also complex. It is relatively easy to settle on which sea should be the benchmark for calculating height above sea level, but clearly defining intelligence, extraversion, or depression is challenging. There are different ways to understand and measure these constructs because they encompass different behaviors, perceptions, subjective experiences, environmental influences, and biological predispositions.

This article highlights the neglect of psychological measurement, explains why this poses a serious and underrecognized threat to the recent replicability efforts in psychological science, and concludes with some suggestions on how to move forward.

The Problem: Neglected Measurement

To measure a psychological construct such as extraversion, psychologists often use questionnaires with multiple items. Items are added up to a score, and it is assumed that this score represents a person’s position on the construct. From “Paul has a high score on an extraversion scale,” we assume that Paul is very extroverted. This inference is not a free psychometric lunch evidence of validity [1] is needed to support the claim. You want to have (1) a good theory supporting the items you include in your scale (2) a scale showing acceptable psychometric properties (e.g., reliability and dimensionality) and (3) a scale related to other constructs in the ways hypothesized (e.g., convergent and discriminant validity) that captures group differences or causal processes expected to exist. Only if your scale meets these criteria can substantive inferences follow.

Unfortunately, evidence of validity is lacking in many areas of psychological research. As an example, depression is assessed in more than 1,000 research studies per year and is used as an outcome, predictor, moderator, or covariate across numerous disciplines (e.g., psychology, psychiatry, epidemiology). More than 280 different scales for assessing depression severity have been developed and used in research in the last century. Commonly used depression scales feature more than 50 different symptoms, and content overlap among scales is low. For example, one third of the symptoms in the most cited scale — the 20-item Center of Epidemiological Studies Depression scale (Radloff, 1977 approximately 41,300 citations) — do not appear in any of the other most commonly used instruments. The result is that different scales can lead to different conclusions, which has been documented many times in clinical trials. For instance, a recent clinical trial queried patients on four different scales to examine whether full-body hyperthermia was an efficacious depression treatment. The hyperthermia group showed significant improvements over placebo on only one of the four scales. Unfortunately, the authors reported the three null findings in the supplementary materials without mention in the paper. This is an important lesson: Although comparing results of multiple measures offers more robust insights, it also opens the door to p-hacking, fishing, and other questionable research practices.

There is more. Major depression had one of the lowest interrater reliabilities of all mental disorders assessed in the DSM-5 field trials, with a coefficient of 0.28, and depression scales in general are often modeled without taking into account their multidimensionality and lack of temporal measurement invariance. Similar to the case of the Orbiter, these theoretical and statistical measurement issues can have drastic consequences, biasing conclusions of research studies and introducing error into inferences — inferences that influence the real-world behavior of scientists and resource allocation in science.

Depression is not an isolated example of poor measurement practices in psychological research. Reviews within specific domains cite similar issues (e.g., emotion Weidman, Steckler, & Tracy, 2016), and our recent work suggests that poor practices span topics and subdisciplines. In a systematic review of a representative sample of 35 empirical articles published in the Journal of Personality and Social Psychology in 2014, we identified 433 scales aimed to measure psychological constructs. Of these, about half contained no citation to any validation study. For many scales, Cronbach’s alpha was the sole psychometric property, and for one in five scales, no psychometric information whatsoever was reported. Simplified, evidence of validity, in practice, forms a hierarchy: (1) none, (2) alpha only, (3) a citation, presumably to another paper that contains validity evidence, and (4) more evidence, which takes a variety of forms. Further, we saw signs of researcher degrees of freedom, similar to the depression literature: Authors used multiple scales to measure one construct without justifying their use of a particular scale. We also noted that scale modification (adding or removing items) was common, as was combining multiple scales to a single index without a transparent rationale.

Poor Measurement Complicates Replications

Taking the results of these studies together, it is difficult to ignore the connection between poor measurement practices and current discussions about replicability. For example, Monin, Sawyer, and Marquez (2008) used a variety of scales in their study, which were also administered in the replication study as a part of the “Reproducibility Project: Psychology.” However, the replication study identified different factor solutions in the primary measures, indicating that different items formed different factors. How are we to interpret the result of this study? Is it a theory failure, a replication failure, or a measurement failure? Again, these questions hold broadly. For depression, for instance, the factor structure of a given scale often differs across samples, across time in the same sample, and even in large subsets of the same sample.

If a scale lacks validity or measures different constructs across samples, there is little benefit in conducting replication studies. We must take a step back and discern how to define and measure the variables of interest in the first place. In such cases, what we need are validity studies, not replication studies. Our work to promote replicability in psychology will be stymied absent improving our measurement practices. Making replications mainstream must go hand in hand with making measurement theory mainstream.

Ways Forward

Norms are changing in psychology, and recent articles and publisher policies push psychological scientists toward more rigorous and open practices. However, contributions focusing on the connection between measurement and replicability remain scant. We therefore close with some nontechnical suggestions that we hope will be relevant to researchers from all subdisciplines of psychology.

Clearly communicate the construct you aim to measure, how you define the construct, how you measure it, and the source of the measure.

Provide a rationale when using a specific scale over others or when modifying a scale. If possible, use multiple measures to demonstrate either robust evidence for a finding or the sensitivity of a finding to particular scales.

Preregister your study. This counters selective reporting of favorable outcomes, exploratory modifications of measures to obtain desired results, and overinterpretation of inconclusive findings across measures.

Consider the measures you use in your research. What category of validity evidence (none, alpha, citation, or more) would characterize them? If your measures fall into the first two categories, consider conducting a validation study (examples are provided below). If you cannot do so, acknowledge measurement as a limitation of your research.

Stop using Cronbach’s alpha as a sole source of validity evidence. Alpha’s considerable limitations have been acknowledged and clearly described many times (e.g., Sijtsma, 2009). Alpha cannot stand alone in describing a scale’s validity.

Take the above points into consideration when reviewing manuscripts for journals or when serving as an editor. Ensure authors report the necessary information regarding the measurement so that readers can evaluate and replicate the measurement in follow-up studies, and help change the measurement standards of journals you work for.

We recognize that measurement research is difficult. Measurement requires both theoretical and methodological expertise. Good psychometric practice cannot make up for a poorly defined construct, and a well-defined construct cannot make up for poor psychometrics. For those reasons, it is hard to come up with a few quick fixes to improve measurement. Instead, we recognize that many psychologists may not have had training in validity theory or psychometrics and provide a list of resources for those interested in learning more. These include a collection of seminal materials on measurement and validation, as well as some accessible examples.

In closing, we want to share the screenshot of the Wikipedia article on Psychological Measurement (see Figure 1), which auto-directs to the page for Psychological Evaluation.

We couldn’t agree more: Measurement deserves more attention.

Figure 1. This screenshot of the Wikipedia article on Psychological Measurement auto-directs to the page for Psychological Evaluation.

The authors would like to thank Jolynn Pek, Ian Davidson, and Octavia Wong for their ongoing work in forming some of the ideas presented here.

1 We acknowledge the old and ongoing philosophical debate about how to best define validity and measurement in psychology. A detailed discussion of validity theory is beyond the scope of this article and is described at length elsewhere (e.g., American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014 Borsboom, Mellenbergh, & van Heerden, 2004 Kane, 2013). Here, we discuss validity consistent with Loevinger’s (1957) seminal work on construct validation.

References and Further Reading

Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America. American Psychologist, 63, 32–50. doi:10.1037/0003-066X.63.1.32

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: Joint Committee on Standards for Educational and Psychological Testing.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033-295X.111.4.1061

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8, 370–378.

Fried, E. I. (2017). The 52 symptoms of major depression. Journal of Affective Disorders, 208, 191–197. doi:10.1016/j.jad.2016.10.019

Fried, E. I., & Nesse, R. M. (2015). Depression is not a consistent syndrome: An investigation of unique symptom patterns in the STAR*D study. Journal of Affective Disorders, 172, 96–102. doi:10.1016/j.jad.2014.10.010

Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring depression over time . . . or not? Lack of unidimensionality and longitudinal measurement invariance in four common rating scales of depression. Psychological Assessment, 28, 1354–1367. doi:10.1037/pas0000275

Janssen, C. W., Lowry, C. A., Mehl, M. R., Allen, J. J. B., Kelly, K. L., Gartner, D. E., … Raison, C. L. (2016). Whole-body hyperthermia for the treatment of Major Depressive Disorder. JAMA Psychiatry, 53706, 1–7. doi10.1001/jamapsychiatry.2016.1031

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.12000

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635–694.

Monin, B., Sawyer, P. J., & Marquez, M. J. (2008). The rejection of moral rebels: Resenting those who do the right thing. Journal of Personality and Social Psychology, 95, 76–93. doi:10.1037/0022-3514.95.1.76

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716-aac4716. doi:10.1126/science.aac4716

Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. doi:10.1177/014662167700100306

Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2013). DSM-5 field trials in the United States and Canada, part II: Test-retest reliability of selected categorical diagnoses. The American Journal of Psychiatry, 170, 59–70. doi:10.1176/appi.ajp.2012.12070999

Santor, D. A., Gregus, M., & Welch, A. (2006). Eight decades of measurement in depression. Measurement, 4, 135–155. doi:10.1207/s15366359mea0403

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach. Psychometrika, 74, 107–120. doi:10.1007/s11336-008-9101-0

Weidman, A. C., Steckler, C. M., & Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17, 267–295.

Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Making replication mainstream. Behavioral and Brain Sciences. Advance online publication. doi:10.1017/S0140525X17001972