Configural |
(Assumed) Dimensionality |
Does the configural structure assumed in the first phase (“prototypic”) arise? Can it be supported? |
PCA, EFA/ESEM, CFA. Preliminary eigenvalues, followed by the number of emerging factors in factor analyses. |
One expects that the proposed dimensionality in the previous phases will be corroborated; otherwise, it is worth exploring alternative dimensional structures. From a preliminary PCA perspective, this can be observed through the number of emerging > 1.0 eigenvalues. When the ratio between the first and second eigenvalue is greater than four, some authors suggest the possibility of unidimensionality.47 Going further with CFA, the amount of dimensions is evaluated through internal diagnosis suggesting poor configural specification (e.g., using Modification Indexes and Expected Parameter Changes via Lagrange Multiplier tests17,19). In the case of an analysis with ESEM48, it is possible to observe directly alternative structures beyond those theoretically assumed. |
Theoretical relevance of items (theoretical-empirical congruence) |
Do the items really belong in their respective dimensions, based on the results of the analysis? |
EFA/ESEM and/or CFA. Positioning or location of items in factors. |
The items should express their respective factors, distinct from each other, as planned in the instrument development or adaptation process. If any item manifests dimensions other than those theoretically predicted, it must be revised. |
Factor specificity |
Is each item linked to only one dimension? Is there ambiguity? |
EFA/ESEM and/or CFA. Cross-loading items. |
If an item contains factorial specificity, the factor loading should not present ambiguity. The item is expected to be a unique expression of the factor it supposedly represents. Items violating this property should be identified and, depending on the situation, modified, or even replaced. |
Metric |
Reliability/discrimination of items |
What is the magnitude of the relationship between the items and the factors that underlie them? |
EFA/ESEM and/or CFA/IRT. Item loadings and residuals. |
For the item to be considered reliable, its factorial loading should be above a pre-specified demarcation. The literature does not stipulate a particular value. Conventionally, 0.3017,49, 0.3550, or 0.4051 are considered acceptable cut-off points to admit an item as reliable. Reliability is also tied to the notion of discriminability, since factor loadings are related to IRT parameters, which express the discrimination of an item. By plotting curves from different ai (corresponding to λi), it is possible to visualize them in the Item Characteristic Curve and then make a decision. |
Absence of redundancy of item content |
Do items overlap in such a way that they do not map the construct independently? |
ESEM, CFA/IRT. Residual correlation (implying violation of conditional/local independence). |
In principle, it is expected that items of a given factor show no residual correlations. They are expected to be independent, once conditioned to the factor they supposedly reflect. Violation of independence implies that the variability of the items has another common source, in addition to the factor they represent. The magnitude of a residual correlation—from which a conditional independence violation can be inferred—is somewhat arbitrary. One possibility is to choose a theoretically sustainable value or level (for example, 0.20 or 0.25) and statistically compare models with or without the estimated residual correlation. Another possibility is to follow recommendations from authors to guide the decision-making process. Reeve et al.53 suggest the simple demarcation of ≥ 0.3 to admit the existence of residual correlation. Some demarcations are based on formal statistics. One is the Chi-square-based local dependence (LD χ2),proposed by Chen e Thissen54,55, which uses the ≥ 10 cut-off point to indicate dependence. Another is the Q3 statistic (and variants), as suggested by Yen56–58. Several situations lead to correlation between item residuals (errors)59, but a common process in instrument development (or adaptation) refers to the presence of content (partial) redundancy between items (in general, pairs). Theoretical evaluation—observing semantics, and denotative and connotative meanings of the respective contents—should be sought when a statistical violation is observed. |
Convergent factorial validity (CFV). |
Do the items convergently reflect the corresponding factor? |
CFA. Average Variance Extracted (AVE) |
CFV refers to each factor, as its name implies. It is understood that CFV occurs if the relationship between the AVE of the items—i.e., the variance that the items have in common—is at least greater than the joint variance of the respective errors, which express item variability due to other factors. Thus, quantitatively, the CFV is endorsed if the AVE ≥ 0,517,60. From an interpretative perspective, endorsing CFV means accepting that the dimension (factor) in question is “well attended” by the respective set of items, since they contain more factor information than error (from sampling and/or measurement/process and/or inherent to the components61). A related indicator— √ AVE —summarizes the construct reliability (dimension). Thus, values ≥ 0.7 also indicate convergence and, strictly, that it is internally consistent (i.e., consistency of/between items, internal to the factor to which they belong)60. |
Discriminating factorial validity (DFV) |
Is the amount of information captured by the set of items in their respective factors greater than that shared among the component factors (discriminant)? |
CFA. Contrast of the average variance extracted (by the items) of a given factor with the square of the correlations of this factor with the others of the system. |
This property only applies to multidimensional constructs. If there is DFV, a larger information “flow” is expected from the factors to the items than between the factors themselves. Demarcation of DFV violation may follow some generic rule of thumb or a more formal evaluation. Some authors suggest factorial correlations of 0.80 to < 0.85 as indicative of violation and ≥ 0.85 as violation per se17. A more rigorous strategy is to formally test the statistical significance of the difference between the AVE of the factor and the square of its correlations with others60. A positive and statistically significant sign of this difference would endorse DFV, while a statistically significant negative sign would favor its rejection, indicating violation. A nonsignificant positive or negative difference may be an indication for or against a violation. On a more conservative stance, a violation could be based only on a statistically significant difference. |
Scalar |
Coverage of latent trait information (by each item and the set of items). |
Does the item set cover most of the latent trait or are there “unmapped” regions? In the latent trait regions effectively mapped, are the items evenly distributed or are there clusters indicating redundancy? |
Parametric IRT. Eyeballing, using the Wright Map, which consists of combining the construct map with estimates of the item placement obtained in the IRT and chart observation analyses. |
It is expected that items will be able to properly position individuals (or any other unit of analysis) along the construct map. The spectrum of variation predicted by the construct map should also be covered appropriately. One way to evaluate these two aspects is to critically assess the position of the items according to the proposed Wright Map13,27. In this sense, the correspondence of item positioning is considered along the latent spectrum—for example, via bi parameters obtained in IRT analyses—and the increasing intensity presented in the construct map13. This eyeballing procedure should be followed by an analysis of the information coverage21,62. Specific charts allow you to indicate whether the set of items covers most of the latent trait or if there are regions with gaps (without items). These graphs also help detect whether all latent trait regions are effectively covered, whether items are distributed evenly, or if there are clusters, indicating overlap and positioning/mapping redundancy. Additional graphic evaluations allow, in a complementary way, to assess the behavior of the items, especially regarding the latent trait coverage. Obtained by parametric IRT, these graphs include the Item Information Functions and Item Characteristic Curves. When items are polytomous, the Category Characteristic Curves are obtained. They also serve to evaluate the items “internally,” observing the coverage areas of each level and whether they are ordered according to the theoretical assumption of the construct map. Examples of these graphs can be found in the references cited at the end of this Table or in Internet searches (https://www.stata.com/manuals/irt.pdf). |
Ordering according to item stability or monotonicity. |
Do items mapping regions of the construct map do so in the theoretically expected order of intensity or are there regions of the construct wherein less severe (lighter/milder) items supplant other items that, in principle, should be capturing more intense areas of the latent trait? |
Nonparametric and parametric IRT. Loevinger’s H, Mokken criterion and graphic assessments. |
The items should separate well the regions of the latent trait (content)—area that they supposedly cover—avoiding overlapping as much as possible. Two strategies allow checking this property: ordering according to scalability and monotonicity. Ordering items according to scalability refers to the coherence between the frequencies with which the items are endorsed and the part of the construct map that they should cover. In an ideal scenario, it is expected that a respondent with low intensity of a given latent trait of the construct (dimension) effectively endorses a representative item (mapper) of this region of “lower” intensity, while not endorsing another item that reflects a more intense degree of the construct. This aspect can be analyzed by item and by the whole set of the instrument. Loevinger’s H coefficient reflects this63–65. With the value 1.0 as the upper limit of adequacy, an estimate of at least 0.3 is recommended for the set of items64,66. An H below this value indicates an instrument with poor scalability. According to Mokken66, values of 0.3 to < 0.4 indicate weak scalability; 0.4 to < 0.5, average; and ≥ 0.5, strong scalability. In an acceptable instrument, most of the H estimates of each item should also follow these references. The assumption of monotonicity is another related property to be appreciated during the evaluation of scalar behavior of each item and, by extension, of the set formed by them64,65. Monotonicity can be supported when the probability of confirmation positive of an item increases according to the increase in intensity of the latent trait. Visually, there is a violation of simple monotonicity when the probability of endorsement declines as the total (latent) score grows. Additionally, a violation of double monotonicity occurs if there is any crossing along the curves of the items obtained in a IRT analysis. Whether single or double, monotonicity is present when the criterion suggested by Mokken is < 4066, understanding that some item crossings can be attributed to the sample variability. Values between 40 and 80 serve as a warning, demanding a more detailed evaluation by the researchers; a criterion higher than 80 raises doubts about the monotonicity hypothesis of an item, as well as the scale as a whole63,64. |