My github account contains examples on simulating data including:
Item response theory
Bifactor models
Person-Fit Statistics with Model Misspecification: A Monte Carlo Study. Castaneda, R., Zelinsky, N., Turitz, M. (May, 2016).<Poster>
Before interpreting test scores, researchers employ various methods that demonstrate that the test has adequate psychometric properties. Sometimes, however, this may not be enough. Person fit statistics quantify the difference between expected item response patterns and observed item response patterns. Item response theory (IRT) assumes that the underlying trait is reflected by the response pattern. For response patterns consistent with the IRT model, the scores reflect the trait that is being measure. However, for inconsistent response patterns, the outcome score is unlikely to be meaningful or valid. Monte Carlo studies in person fit have focused on the statistics’ abilities to capture aberrant response patterns. These studies have done so in the context of unidimensional IRT, but not multidimensional IRT to the authors’ knowledge. Most psychological constructs are inherently multidimensional, but collapsed to unidimensional for application (e.g. scaling a personality measurement using unidimensional 2PL). This research focuses on identifying the outcome of Drasgow, Levine and Williams’ (1985) Zh person fit statistics under null conditions when the estimation model is misspecified. We consider 3 levels of misspecification (1 factor, 2 factors, and 3 factors) along with 3 levels for the number of items (15, 25 and 50) and 2 levels for the sample size (200, 1000). Suggestions for applied researchers are given.
Modeling and Linking Speeded Tests: An Application with the VASE Vocabulary Tests. Castaneda, R., Liu, Y., Zelinsky, N., Vevea, J., Scott, J., & Flinspach, S. (May, 2016). <Poster | Talk>
In theory, a pure test of speed is composed of a large set of easy items and administered with a time limit, whereas a pure test of power is composed of items with varying difficulty levels and administered without a time limit. In practice, tests often contain both power and speed components, which call for a proper modeling framework to account for the dependency among item responses coming from both sources. This research uses the Vocabulary Assessment Study in Education (VASE) 2014 dataset to illustrate item calibration, scoring, and linking when neither the speed nor the power component of the test is negligible. We first fit an item response model assuming a pure power test, and assess the violation of local independence among items located towards the end of the test. Then, a multidimensional calibration model is specified to capture the residual dependencies caused by speededness. Finally, linking between multiple test forms is performed via calibrated projection. This study extends previous research in three aspects: a) the influence of speededness is empirically evaluated by local dependence diagnostics, b) the speed nature of the test is considered at the item calibration stage, and c) the score conversion between two speeded test forms is obtained by a recently developed linking method.
Assessing the Performance of Single Item Longitudinal Models over Varying Conditions. Castaneda, R. (May, 2015)
Longitudinal item response theory (IRT) has been traditionally used for a series of exams given to a multiple groups of people who respond to a shared items (i.e. linking). However, longitudinal IRT models may be useful when assessing a person’s latent trait (assuming it is stable) over time using a single item. For example, if we treat 30 consecutive days as binary items with the question “have you smoked today” we can assess a person’s latent trait after a 30-day period. Here the item difficulty becomes how likely they were to smoke that day and item discrimination is how well the day discriminated between high ability (don’t smoke) and low ability people (succumb to smoking). This study compares the performance of longitudinal IRT across various conditions and examines possible inferences it may provide. For this, we propose the following conditions: Studies featuring 2, 5, 10, and 30 time points, using 50, 100, and 200 subjects. Additionally we will simulate two thetas, one where the theta is stable and normally distributed and another where theta changes across time. Finally, item discrimination will be either set to 1 for all items or allowed to vary from .65 to 1.2.This gives us 48 total cells.
The relationship between knowledge-monitoring and test performance: A meta-analysis. Castaneda, R. (May, 2015).
Knowledge-monitoring is a crucial component of learning. students cannot be expected to engage in higher level metacognitive activities (e.g. monitor and plant learning strategies) if they fail to adequately differentiate between what they know and don’t know. Researchers have been interested in using knowledge-monitoring measures in tests of ability for various purposes (e.g. to increase reliability, add test information, or as a diagnostic instrument). However, various differences in the strength between knowledge-monitoring and test performance have been reported. These differences may be due to (a) type of monitoring measure (b) knowledge-monitoring item placement (c) age (d) knowledge domains (e) memory retrieval types and (f) test reliability. The present study addresses these questions from a random effects meta-analytic perspective, investigating the relationship between knowledge monitoring and test performance. Based on a sample of 12 studies who sampled participants ranging from first grade to college Pearson’s-coefficient mean effect size was .47 with knowledge-monitoring item placement accounting for 59.09% of the variance between studies. These results indicate that there is a strong relationship between knowledge-monitoring and test performance (r=.47). However, the more interesting question was where knowledge-monitoring items could be placed for maximum efficiency. Moderator testing results indicate that placing knowledge monitoring items directly before a test question can yield the strongest relationship (r=.60). Researchers interested in knowledge-monitoring should include items asking students how confident they are in their responses right before the test performance question for optimal performance. Further findings and implications are discussed.
Using a vocabulary measure as a diagnostic tool for knowledge monitoring. Castaneda, R. , Vevea, L., J. (March, 2015).
Researchers have demonstrated that students who engage in metacognitive activities learn faster and develop larger vocabularies than those who do not. Our measure can be used to assess vocabulary depth and self-checking in elementary school students, which may be a useful tool for screening highly inaccurate (low-metacognitive monitoring) students.
Currently I am working on completing my dissertation on modeling local item independence using random effects meta-analysis.
Item response theory
Person fit statistics
Factor Analysis
Latent Class Analysis
Hierarchical linear modeling
Structural Equation Modeling