What, what for and how? Developing measurement instruments in epidemiology

Reichenheim, Michael; Bastos, João Luiz

doi:10.11606/s1518-8787.2021055002813

Brasil

home Sumário navigate_before Anterior Atual Seguinte navigate_next

home Sumário

Original Article • Rev. Saúde Pública 55 • 2021 • https://doi.org/10.11606/s1518-8787.2021055002813 linkcopy

What, what for and how? Developing measurement instruments in epidemiology

Authorship SCIMAGO INSTITUTIONS RANKINGS

ABSTRACT

The development and cross-cultural adaptation of measurement instruments have received less attention in methodological discussions, even though it is essential for epidemiological research. At the same time, the quality of epidemiological measurements is often below ideal standards for the construction of solid knowledge on the health-disease process. The scarcity of systematizations in the field about what, what for, and how to adequately measure intangible constructs contributes to this scenario. In this review, we propose a procedural model divided into phases and stages aimed at measuring constructs at acceptable levels of validity, reliability, and comparability. Underlying our proposal is the idea that not only some but several connected studies should be conducted to obtain appropriate measurement instruments. Implementing the model may contribute to broadening the interest in measurement instruments and, especially, addressing key epidemiological problems.

Epidemiologic Measurements; Data Accuracy; Cross-Cultural Comparison; Validation Studies as Topic

Evaluation stage	Description and purpose	Questions to be answered	Technique/method/model	Empirical expression
Evaluation of the theory upon which the construct is based	Theoretical appreciation of the construct that one desires to assess, in relation to both a potential multidimensionality and a gradient of intensity in each dimension. This stage develops the construct map (dimensional)¹³.	What is the definition of the construct of interest? Are there postulated subdimensions for the construct? Which are they? What would be the theoretical elements of this dimension(s) and how would they be organized in a gradient of intensity?	Literature review. Consultation with experts.	There is no empirical expression of this aspect, since the definition of the construct, its gradient of intensity, and its possible subdimensions are fundamentally theoretical questions.
Item content evaluation	Identification of the empirical manifestations of the dimension(s) and how they cover parts of the construct map. In this stage, a preliminary content validity (a.k.a. face validity) is proposed, connecting the empirical expression of the item content to the underlying theoretical elements.	Do items have contents tied to the underlying dimension? Are the items distinct from each other in terms of content? Is each part of the construct map represented by a specific item? Do the items cover the construct map sufficiently and adequately (i.e., without gaps and/or occupying a similar position to other items)?	Literature review. Consultation with experts. Qualitative approaches with members of the target population (in-depth interviews, focus groups, etc.)^36,37.	Individually, each item reflects a specific part of the construct map. Together, the items should sufficiently and adequately cover the contents of the underlying construct (or, in case it is multidimensional, each constituent dimension).
Item semantics specification	Writing items to better convey their content to the respondent.	Do the terms used in writing up the items allow item allow its direct and unambiguous connection to specific parts of the construct map?	Consultation with linguistics experts and experts on the subject matter, as well as translators (in the case of adaptations)^6,11.	Items of the instrument and their specific writing.
Evaluation of operational aspects	Assess and decide on how the instrument is to be administered—face-to-face interviews, self-completed forms, computer-assisted questionnaires etc.—which includes assessing the adequacy of the operational scenario. In this stage of the process, an evaluation of the contribution of each item to the construct map begins, including consideration of levels/categories of the outcome.	What is the most appropriate mode of administration, considering the target population? In what operational scenario should the instrument be administered?	Consultation with experts and members of the target population via qualitative studies^36,37.	Mode(s) of application of the instrument in the desirable operational scenario. Any instrument should be evaluated in light of a preestablished operational scenario, preferably early on in its development process (or adaptation process).
Pre-tests (including preliminary reliability tests)	Medium-sized studies (e.g., n = 100–150) aimed at evaluating: Acceptance, understanding, and emotional impact of the items. Formal aspects related to the sequence of items or rules for skipping them. Instrument response options, vis-à-vis the operational scenario (operational aspects). This stage can also be used for preliminary reliability analyses, focusing on internal consistency, inter- and intra-observer agreement/test-retest etc.	Does the instrument have an acceptable degree of understanding? Are the reactions aroused by the items in the respondents within what was expected? Does the sequence of items contribute to an easy administration for interviewers and/or respondents? Are the response options in line with respondents’ ability to discern them? Does the operational scenario favor the interaction between instrument and respondent, or interviewer and respondent? Are there indications of good reliability in preliminary studies (pre-tests)?	Administration of the instrument in the target population, possibly including alternative formulations of the items. A sequence of studies should be carried out until one or more prototypes are obtained for the second phase of instrument development (or adaptation)^3,6.	Records of the administration of the instrument in the target population. Reliability indicators (acceptability differs by type). See Reichenheim et al.⁶ for more details. See also Streiner et al.⁷, Nunnally and Bernstein³⁸, Raykov and Marcoulides³⁹, Price⁴⁰ and Shavelson and Webb⁴¹.

Structure to evaluate	Property under evaluation	Questions to be answered	Model(s)^a,b/parameter(s)	Comments
Configural	(Assumed) Dimensionality	Does the configural structure assumed in the first phase (“prototypic”) arise? Can it be supported?	PCA, EFA/ESEM, CFA. Preliminary eigenvalues, followed by the number of emerging factors in factor analyses.	One expects that the proposed dimensionality in the previous phases will be corroborated; otherwise, it is worth exploring alternative dimensional structures. From a preliminary PCA perspective, this can be observed through the number of emerging > 1.0 eigenvalues. When the ratio between the first and second eigenvalue is greater than four, some authors suggest the possibility of unidimensionality.⁴⁷ Going further with CFA, the amount of dimensions is evaluated through internal diagnosis suggesting poor configural specification (e.g., using Modification Indexes and Expected Parameter Changes via Lagrange Multiplier tests^17,19). In the case of an analysis with ESEM⁴⁸, it is possible to observe directly alternative structures beyond those theoretically assumed.
	Theoretical relevance of items (theoretical-empirical congruence)	Do the items really belong in their respective dimensions, based on the results of the analysis?	EFA/ESEM and/or CFA. Positioning or location of items in factors.	The items should express their respective factors, distinct from each other, as planned in the instrument development or adaptation process. If any item manifests dimensions other than those theoretically predicted, it must be revised.
	Factor specificity	Is each item linked to only one dimension? Is there ambiguity?	EFA/ESEM and/or CFA. Cross-loading items.	If an item contains factorial specificity, the factor loading should not present ambiguity. The item is expected to be a unique expression of the factor it supposedly represents. Items violating this property should be identified and, depending on the situation, modified, or even replaced.
Metric	Reliability/discrimination of items	What is the magnitude of the relationship between the items and the factors that underlie them?	EFA/ESEM and/or CFA/IRT. Item loadings and residuals.	For the item to be considered reliable, its factorial loading should be above a pre-specified demarcation. The literature does not stipulate a particular value. Conventionally, 0.30^17,49, 0.35⁵⁰, or 0.40⁵¹ are considered acceptable cut-off points to admit an item as reliable. Reliability is also tied to the notion of discriminability, since factor loadings are related to IRT parameters, which express the discrimination of an item. By plotting curves from different a_i (corresponding to λ_i), it is possible to visualize them in the Item Characteristic Curve and then make a decision.
	Absence of redundancy of item content	Do items overlap in such a way that they do not map the construct independently?	ESEM, CFA/IRT. Residual correlation (implying violation of conditional/local independence).	In principle, it is expected that items of a given factor show no residual correlations. They are expected to be independent, once conditioned to the factor they supposedly reflect. Violation of independence implies that the variability of the items has another common source, in addition to the factor they represent. The magnitude of a residual correlation—from which a conditional independence violation can be inferred—is somewhat arbitrary. One possibility is to choose a theoretically sustainable value or level (for example, 0.20 or 0.25) and statistically compare models with or without the estimated residual correlation. Another possibility is to follow recommendations from authors to guide the decision-making process. Reeve et al.⁵³ suggest the simple demarcation of ≥ 0.3 to admit the existence of residual correlation. Some demarcations are based on formal statistics. One is the Chi-square-based local dependence (LD χ²),proposed by Chen e Thissen^54,55, which uses the ≥ 10 cut-off point to indicate dependence. Another is the Q3 statistic (and variants), as suggested by Yen^56–58. Several situations lead to correlation between item residuals (errors)⁵⁹, but a common process in instrument development (or adaptation) refers to the presence of content (partial) redundancy between items (in general, pairs). Theoretical evaluation—observing semantics, and denotative and connotative meanings of the respective contents—should be sought when a statistical violation is observed.
	Convergent factorial validity (CFV).	Do the items convergently reflect the corresponding factor?	CFA. Average Variance Extracted (AVE)	CFV refers to each factor, as its name implies. It is understood that CFV occurs if the relationship between the AVE of the items—i.e., the variance that the items have in common—is at least greater than the joint variance of the respective errors, which express item variability due to other factors. Thus, quantitatively, the CFV is endorsed if the AVE ≥ 0,5^17,60. From an interpretative perspective, endorsing CFV means accepting that the dimension (factor) in question is “well attended” by the respective set of items, since they contain more factor information than error (from sampling and/or measurement/process and/or inherent to the components⁶¹). A related indicator— √ AVE —summarizes the construct reliability (dimension). Thus, values ≥ 0.7 also indicate convergence and, strictly, that it is internally consistent (i.e., consistency of/between items, internal to the factor to which they belong)⁶⁰.
	Discriminating factorial validity (DFV)	Is the amount of information captured by the set of items in their respective factors greater than that shared among the component factors (discriminant)?	CFA. Contrast of the average variance extracted (by the items) of a given factor with the square of the correlations of this factor with the others of the system.	This property only applies to multidimensional constructs. If there is DFV, a larger information “flow” is expected from the factors to the items than between the factors themselves. Demarcation of DFV violation may follow some generic rule of thumb or a more formal evaluation. Some authors suggest factorial correlations of 0.80 to < 0.85 as indicative of violation and ≥ 0.85 as violation per se¹⁷. A more rigorous strategy is to formally test the statistical significance of the difference between the AVE of the factor and the square of its correlations with others⁶⁰. A positive and statistically significant sign of this difference would endorse DFV, while a statistically significant negative sign would favor its rejection, indicating violation. A nonsignificant positive or negative difference may be an indication for or against a violation. On a more conservative stance, a violation could be based only on a statistically significant difference.
Scalar	Coverage of latent trait information (by each item and the set of items).	Does the item set cover most of the latent trait or are there “unmapped” regions? In the latent trait regions effectively mapped, are the items evenly distributed or are there clusters indicating redundancy?	Parametric IRT. Eyeballing, using the Wright Map, which consists of combining the construct map with estimates of the item placement obtained in the IRT and chart observation analyses.	It is expected that items will be able to properly position individuals (or any other unit of analysis) along the construct map. The spectrum of variation predicted by the construct map should also be covered appropriately. One way to evaluate these two aspects is to critically assess the position of the items according to the proposed Wright Map^13,27. In this sense, the correspondence of item positioning is considered along the latent spectrum—for example, via b_i parameters obtained in IRT analyses—and the increasing intensity presented in the construct map¹³. This eyeballing procedure should be followed by an analysis of the information coverage^21,62. Specific charts allow you to indicate whether the set of items covers most of the latent trait or if there are regions with gaps (without items). These graphs also help detect whether all latent trait regions are effectively covered, whether items are distributed evenly, or if there are clusters, indicating overlap and positioning/mapping redundancy. Additional graphic evaluations allow, in a complementary way, to assess the behavior of the items, especially regarding the latent trait coverage. Obtained by parametric IRT, these graphs include the Item Information Functions and Item Characteristic Curves. When items are polytomous, the Category Characteristic Curves are obtained. They also serve to evaluate the items “internally,” observing the coverage areas of each level and whether they are ordered according to the theoretical assumption of the construct map. Examples of these graphs can be found in the references cited at the end of this Table or in Internet searches (https://www.stata.com/manuals/irt.pdf).
Scalar	Ordering according to item stability or monotonicity.	Do items mapping regions of the construct map do so in the theoretically expected order of intensity or are there regions of the construct wherein less severe (lighter/milder) items supplant other items that, in principle, should be capturing more intense areas of the latent trait?	Nonparametric and parametric IRT. Loevinger’s H, Mokken criterion and graphic assessments.	The items should separate well the regions of the latent trait (content)—area that they supposedly cover—avoiding overlapping as much as possible. Two strategies allow checking this property: ordering according to scalability and monotonicity. Ordering items according to scalability refers to the coherence between the frequencies with which the items are endorsed and the part of the construct map that they should cover. In an ideal scenario, it is expected that a respondent with low intensity of a given latent trait of the construct (dimension) effectively endorses a representative item (mapper) of this region of “lower” intensity, while not endorsing another item that reflects a more intense degree of the construct. This aspect can be analyzed by item and by the whole set of the instrument. Loevinger’s H coefficient reflects this^63–65. With the value 1.0 as the upper limit of adequacy, an estimate of at least 0.3 is recommended for the set of items^64,66. An H below this value indicates an instrument with poor scalability. According to Mokken⁶⁶, values of 0.3 to < 0.4 indicate weak scalability; 0.4 to < 0.5, average; and ≥ 0.5, strong scalability. In an acceptable instrument, most of the H estimates of each item should also follow these references. The assumption of monotonicity is another related property to be appreciated during the evaluation of scalar behavior of each item and, by extension, of the set formed by them^64,65. Monotonicity can be supported when the probability of confirmation positive of an item increases according to the increase in intensity of the latent trait. Visually, there is a violation of simple monotonicity when the probability of endorsement declines as the total (latent) score grows. Additionally, a violation of double monotonicity occurs if there is any crossing along the curves of the items obtained in a IRT analysis. Whether single or double, monotonicity is present when the criterion suggested by Mokken is < 40⁶⁶, understanding that some item crossings can be attributed to the sample variability. Values between 40 and 80 serve as a warning, demanding a more detailed evaluation by the researchers; a criterion higher than 80 raises doubts about the monotonicity hypothesis of an item, as well as the scale as a whole^63,64.

Evaluation stage	Questions to be answered	Technique/method/model^a	Comments^a
Evaluation of relationships between the (sub)scales of the instrument.	Are the (sub)scales of the instrument associated in the expected direction and magnitude?	Parametric or nonparametric association tests between the (sub)scales of the instrument.	This aspect could have already been contemplated in the assessment of discriminant validity involving factorial correlation, in the stage of evaluation of the internal structure. At this moment of analysis, however, the tests are already based on the scale scores themselves (whether crude or model-based), refined in previous stages, mainly regarding scalar structure.
Evaluation of relationships between (sub)scales with other instruments of the same construct that are not considered reference	Does the instrument associate with another one that measures the same construct in a similar (convergent) way? At what magnitude?	Comparison of extreme groups and parametric or nonparametric association tests.	This stage concerns construct validity. Together, construct, content, and criterion validity are known as the three Cs described in many textbooks on classical measurement theory.
Evaluation of relationships between (sub)scales with another instrument (or procedure) considered reference for the construct itself.	Is the instrument capable of measuring what is proposed when there is another one regarded as reference?	Estimation of sensitivity, specificity, and area under the ROC (Receiver Operating Characteristic) curve of the instrument, based on a concurrent criterion (reference instrument) and/or a predicted (future) outcome.	The literature traditionally calls this stage as criterion validity (one of the three Cs), subdivided into concurrent and predictive validity.
Evaluation of relationships between the (sub)scale with others outside of the construct in question.	Does the instrument confirm the general predictions and hypotheses of the theory that involves it, i.e., its nomological network? Is the instrument unrelated to other constructs that are not part of the general theory that encompasses the phenomenon of interest?	Multivariate data analysis, complex causal models, and other statistical techniques that allow analysis of relationships of interest with greater rigor and accuracy.	Evaluation of relationships between the (sub)scale with others outside of the construct in question.

vertical_align_top file_download show_chart

more_horizclose
- image
- translate
- link
- article
- vertical_align_top
- file_download
- show_chart
- image
- translate
- link
- article

location_on

Faculdade de Saúde Pública da Universidade de São Paulo Avenida Dr. Arnaldo, 715, 01246-904 São Paulo SP Brazil, Tel./Fax: +55 11 3061-7985 - São Paulo - SP - Brazil
E-mail: revsp@usp.br

rss_feed Acompanhe os números deste periódico no seu leitor de RSS

Acessibilidade / Reportar erro

[TFN1] Note: References: Streiner et al.⁷, Beatty et al.⁴², Moser and Kalton⁴³, Bastos et al.³, Reichenheim and Moraes⁶, Johnson and Morgan⁴⁴, DeVellis⁴⁵; Gorenstein et al.⁴⁶ Some of these references are occasionally marked, when necessary, along with other specific ones.