Old imputation techniques mean substitution, regression imputation result in two kinds of bias. The estimates themselves e. The significance tests associated with those estimates are also biased because the SE s are reduced by introducing artificial certainty in the estimate. The mean of observed data may not be an arbitrarily made up value, but replacing every missing value with the mean is an expression of the assumption that one can be absolutely certain that this single value is exactly what would have been observed if the observation were not missing.

Mean substitution and regression imputation introduce no noise, no margin of error, and no variability around the plausible estimate of the missing value. Consequently, although the estimated mean of observed data is unchanged by mean substitution of missing data, the SE artificially shrinks because no deviations from the mean exist among those substitutions.

Deviations from the mean are the foundation of estimating variability in the population i. A technique called stochastic regression imputation gets around this by adding some random noise i. This makes the estimate of the missing value much more plausible. But how certain can one be in a single estimate of what a missing value might truly have been? MI essentially solves this problem the same way stochastic regression imputation does, but goes further by calculating several plausible estimates of a missing value instead of a single estimate.

The rationale behind this multiplicity is similar to that for large samples in general.

## Missing Data: Listwise vs. Pairwise

Suppose a researcher is interested in finding the degree of belief among high-school children in a certain school district that condom use prevents the spread of HIV. Lack of time and funding prevents interviewing all 4, children in the district, so a sample is gathered from within the school district to draw an inference about the population from which it was drawn. How much confidence should we have that this is the average degree of support among all children in the district?

The principle that larger random samples yield more certainty about estimates is discussed extensively even in introductory texts on research methods and statistics. MI operates on the same principle. Any substitution for a missing value is only one among many plausible substitutions, and our estimate of the missing information is more robust when many plausible values are sampled.

### Browse more videos

In our previous example, suppose the population of female students shows greater support for condom use than male students, and students from high-SES families show greater support than students from low-SES families. Knowing the gender and SES of Subject 42 allows us to narrow down the range of plausible values even further e. Suppose that we would like to forgo the complications introduced into our data analysis when we use MI. Is there a way that we can make unbiased inferences without going through the process of fitting multiple replicates of our analysis model and pooling the results?

### Bestselling Series

Well, if the statistical technique to be used can accommodate maximum likelihood estimation e. Though the underlying mathematical principles exceed the scope of this article, the conceptual framework underlying FIML estimation is relatively simple. Consider the case of many participants filling out the same survey i.

Even if some of those participants do not complete the survey, we would still like to fit a model that will allow us to draw accurate conclusions about the entire sample. FIML estimation can help accomplish this goal by using the observed responses to supplement the loss of information due to the missing responses. The image on this monitor is made up of many rows of pixels, just as a data set is made up of many rows of responses. If a pixel dies in your monitor, you are still able to understand the image on the screen because you can use the information from surrounding pixels to infer what the undamaged image would be.

Similarly, FIML estimation uses what are known as casewise log-likelihoods to achieve an analogous effect when used to fit a statistical model to incomplete data. By using only what is known from the observed data, FIML can infer what the whole model should look like without needing to know what the missing responses would truly be.

In this way, just as your eye can look at a damaged computer monitor and still understand how the complete image would appear, FIML can be applied to an incomplete data set to produce estimates that correctly describe the entire sample.

## On the Joys of Missing Data | Journal of Pediatric Psychology | Oxford Academic

Regardless of the mechanism of missingness, researchers should use the most principled technique available for their question of interest. When missing is MAR, for example, ad hoc approaches that attempt to treat missing data without considering their underlying structure e. Under these circumstances, a single stochastic regression imputation or single EM imputation offer easily implemented, yet principled, ways to treat the missing data.

If a maximum likelihood technique is used, a noninclusive implementation of FIML can also provide acceptable results in this situation. Planned missing data designs have been suggested for many years but only recently have they begun to percolate into the design choices of applied researchers. Three designs are particularly useful for applied pediatric researchers: The multiform questionnaire protocol, the two-method measurement model, and the wave-missing longitudinal design.

For all planned missing designs, the critical element of them is random assignment. With true random assignment, the missing data from these designs are, by definition, MCAR in nature. Recall that MCAR produces no bias in the estimated parameters of a given statistical model, only power is diminished. Also recall that the two modern approaches to missing data treatments restore the lost power.

MCAR with a modern treatment is a truly win—win situation for applied pediatric researchers! Rather than creating short-forms of different scales or eliminating constructs because of time constraints or concerns about burden and fatigue, a multiform design can be implemented. Multiform designs can also reduce respondent reactivity. For researchers conducting intervention studies, the control condition often will show improvements just by virtue of reacting to the questionnaire protocol. Reducing exposure to all items of a given construct reduces the reactivity to the construct as a whole.

With a multiform planned missing design, the analyzed data contain all the needed items and information when a modern treatment of the missing data is used. The simplest multiform design that researchers should use is the three-form design. As the name implies, three different questionnaire forms are created and randomly assigned to participants. The key to a three-form design is assigning items to four different blocks or sets, which are designated X, A, B, and C. The X block contains items that are administered to all participants. That is, one of the blocks of items in A, B, or C are intentionally not administered.

More blocks of items can be generated and put together to create forms that have even fewer administered items. The top tier of Table I shows a schematic of the pattern of complete and missing data that results from using such a design. In assigning items to blocks, a number of considerations are involved. First, the X block, which is administered to all participants, typically will contain the essential demographic variables as well as key variables that are likely to predict the MAR mechanism.

Although the intentional parts of the missing data are MCAR, nearly any study will also have additional missing information on top of the randomly controlled MAR data. In addition to these variables, we recommend that at least one item from each construct be included in the X block.

This one item would be an indicator of the construct with the best item properties i. The rest of the variables associated with each construct would be evenly distributed across the A, B, and C blocks of items. Here, each of the A, B, and C blocks would contain one or more items from each construct if enough items for a given construct are available. If not enough items per construct exist, the pattern of assignment of items to constructs should a balance the number of items in each block as equally as possible, and b maximize the between-block correlations among the items.

The higher the between-block associations are in this design, the more efficient is the missing data recovery process, which leads to greater power and greater convergence rates when the data are analyzed. The multiform designs are optimal for large sample studies that rely on SEM procedures.

- Stolen Child.
- The Self-Leadership Report - A Guide to Balance and Self Management for Busy People. (Xtreme Skills for Busy People Book 2).
- Modern Freedom: Hegels Legal, Moral, and Political Philosophy (Studies in German Idealism)!
- Pulsed Laser Ablation of Solids: Basics, Theory and Applications.
- Structural equation modeling.

Based on simulation work, the three-form design requires sample sizes of about and greater to achieve acceptable coverage and convergence Jia et al. At sample sizes of this magnitude, SEM procedures can be used see Little, Unlike the three-form design, which intentionally omits variables to reduce cost and burden, the two-method design is a way to increase the power of an otherwise underpowered study. The two-method design is ideally suited for contexts in which an expensive, but highly valid, method for assessing a construct is desired.

**go to link**

## Structural equation modeling

If a cheaper and, by implication, less valid method of measuring the same construct exists, the two methods can be partnered together to dramatically increase the sample size while holding the costs of a given study constant. In addition to the requirement that two methods of measuring the same construct exists, the two-method planned missing data design is also predicated on a multivariate measurement model to represent the construct of interest.

As with multiform designs, sample sizes for these designs need to be large enough to support the estimation of latent constructs. Figure 1 is a depiction of a two-method analysis of stress as represented by the gold standard of cortisol measured using two assays and by a simple self-report questionnaire of perceived stress.

- SearchWorks Catalog.
- Introduction to Mathematical Oncology.
- ISBN 10: 0805863702.
- Log in to Wiley Online Library!
- Persistent Object Systems: Proceedings of the Sixth International Workshop on Persistent Object Systems, Tarascon, Provence, France, 5–9 September 1994.
- Willy Brandt: A Political Biography!
- Cyrus Field: Transatlantic Visionary (Titans of Fortune).

The items of this self-report measure are parceled into three indicators. The multiform questionnaire protocol can be administered in a longitudinal design, but there may be little reduction in the cost of obtaining measurements on every occasion, even if the battery of measurements were shorter. Suppose there are four waves of measurement once every 3 months across which one wishes to estimate the change trajectory of coping during the first year following cancer diagnosis.

It is unlikely that all participants would participate on all occasions, and it would be difficult to know the mechanism of missingness. Graham, Taylor, and Cumsille present several similar design possibilities, the simplest being to divide the sample into five parts, four of which would not be measured at Wave 1, 2, 3, or 4 see bottom tier of Table I.

One subset would be measured at all occasions, which Graham et al. For example, with five waves of measurement, one can divide the sample into 11 parts and assign 10 of those subsets to be measured on only three of five occasions i.