Test – retest reliability and concurrent validity of the South African Early Learning Outcomes Measure

United Nations Sustainable Development Goal (SDG) Target 4.2. states that by 2030 countries should ‘ensure that all girls and boys have access to quality early childhood development, care and pre-primary education so that they are ready for primary education’ (United Nations n.d.). A key requirement of efforts to assess this outcome is the availability of reliable and valid population-level instruments suitable for children from a wide range of ethnolinguistic backgrounds, which can be used to track country attainment of SDG Goal 4.2. The Early Learning Outcomes Measure (ELOM) was developed to address the need for a locally validated, culturally fair and standardised instrument and has been used in studies of early learning programme outcomes.


Introduction
United Nations Sustainable Development Goal (SDG) Target 4.2. states that by 2030 countries should 'ensure that all girls and boys have access to quality early childhood development, care and pre-primary education so that they are ready for primary education' (United Nations n.d.).
A key requirement of efforts to assess this outcome is the availability of reliable and valid population-level instruments suitable for children from a wide range of ethnolinguistic backgrounds, which can be used to track country attainment of SDG Goal 4.2. The Early Learning Outcomes Measure (ELOM) was developed to address the need for a locally validated, culturally fair and standardised instrument and has been used in studies of early learning programme outcomes.
As Snelling et al. (2019) note in recent years, several international efforts have been made to generate instruments to measure language, numeracy cognition and motor development in 3-5-year-old children. These include the Early Development Index (Janus 2007; https://edi. offordcentre.com), the International Development and Early Learning Assessment (IDELA) (Dowd et al. 2016;Pisani, Borisova & Dowd 2015;Pisani et al. 2017), the Measure of Development and Early Learning Module (MODEL) of the Measuring Early Learning Quality and Outcomes (MELQO) initiative (http://ecdmeasure.org/about-melqo/what-is-melqo/) and other instruments adapted to local cultural developmental settings and largely covering the same domains as IDELA and MODEL. Examples include the East Asia-Pacific Early Child Development Scales (Rao et al. 2014) and the Tongan Early Human Capability Index (Brinkman & Vu 2016).
Limited availability of locally standardised measures adapted for multi-language and multi-cultural contexts is a challenge for research in most countries in the so-called developing world, and increasingly in the global north. With 11 official languages as well as a number of other mother tongues spoken by smaller groups, including refugees and migrants, South Africa is no exception. The ELOM direct assessment (hereafter the ELOM) was developed in response to the need for a psychometrically sound, standardised South African instrument designed to measure developmental domains associated with readiness to learn in school. Its design was informed by the South Africa's National Curriculum Framework from Birth to Four and its National Early Learning and Development Standards, which is consistent with the constructs assessed in the international early development instruments referred to above (Snelling et al. 2019). Early Learning Outcomes Measure items were drawn from reliable and valid instruments, particularly those used in Africa and other developing regions. The ELOM is a population-level instrument designed to measure the developmental status of children aged 50-69 months, which can be administered by trained non-professionals. It comprises 23 individually administered items clustered in five domains: gross motor development (GMD) measures large muscle control; fine motor coordination and visual motor integration (FMC and VMI) measure the proficiency of children's small muscle use and visual motor integration; emergent numeracy and mathematics (ENM) assesses understanding of numerical concepts, space, symbols, shapes and sizes; cognition and executive functioning (CEF) measures working memory, impulse control, problem-solving skills, critical thinking and ability to form concepts; emergent literacy and language (ELL) which assesses language use and communication skills.
Psychometric analysis has established that ELOM domains are unidimensional and internally consistent, that the instrument is reliable, and provides a fair assessment regardless of the socio-economic status (SES) or ethnolinguistic background (Snelling et al. 2019). Examination of item and domain ceiling effects on an older sample (mean age 75.82 months) compared with that used in the standardisation of the ELOM revealed that apart from three items particularly susceptible to maturation effects (one gross and two fine motor items), the remaining 20 items, four domains and ELOM total score distributions were normally distributed, or only slightly skewed (Dawes et al. 2020). Further information on the ELOM may be found at http://elom.org.za.
In this paper, we report on two further studies on the psychometric properties of the ELOM, which were undertaken to complete the requirements for a psychometrically sound and reliable instrument and which have not previously been reported in the literature. These include ELOM test-retest reliability (Study 1) and ELOM concurrent validity (Study 2).

Study 1: Test-retest reliability of the Early Learning Outcomes Measure
Study 1 aimed at examining the test-retest reliability of the ELOM. The research question of interest here is whether the ELOM produces a consistent result for the same child when tested on two occasions separated by an appropriate time interval. It was hypothesised, therefore, that test scores would be significantly correlated between two administrations of the ELOM.

Research method and design Sample
Participants were a convenience sample of English-or isiXhosa-speaking children attending two preschools that serve low-income children in Cape Town. Class lists were examined to purposively select all children in the classes who were between the ages of 55 and 69 months. G*Power 3.1.9.4 online software was used to determine sample size for correlation, which was found to be 37. After data cleaning, a sample size of N = 49 children (M = 60.77 months, standard deviation [SD] = 3.70; range 55-67 months) was realised. This sample is sufficient to detect an effect of 0.50 with power set at 0.80 (p = 0.05). The sample consisted of 24 male and 25 female participants, of whom 30 were English-speaking and 19 isiXhosa-speaking children.

Preparation of data for analysis
Children who were likely to show either very poor performance because of learning difficulties or invalid assessments (incomplete protocols or with scoring errors) were excluded. Once data were checked and cleaned, it was imported into Statistical Package for Social Science (SPSS) version 25.0 (IBM Corp. 2017).

Measure
Children were assessed on the ELOM as described above. The total score and scores on all five domains were used for the analysis of test-retest reliability.

Test-retest procedure
Test-retest reliability is solely related to variability in a child's performance over time. For test-retest reliability in developmental tests such as the ELOM, having short time periods between the two assessments is recommended to ensure that the likelihood of error is because of chance and not actual changes in the child's characteristics resulting from their development (Multon 2010). Whilst a period up to 4 weeks between assessments may be acceptable for older children and adolescents (depending on the measure), a shorter time period is recommended for preschoolers as they develop at a faster rate (Briggs-Gowan et al. 2016). WPPSI-IV test-retest intervals ranged from 7 to 48 days with an average of 23 days (Syeda & Climie 2014), whilst the testing intervals for the Early Screening Inventory were 7-10 days apart (Meisels et al. 1993). Following this pattern, the testing interval from test to re-test in the current study was 7 days.
The ELOM was administered at preschool for each child by certified ELOM assessors (http://elom.org.za/forassessors/), which took approximately 45 min to an hour. To limit the likelihood of fatigue, which reduces the reliability of the assessments, all children were tested in the morning (Furr & Bacharach 2014). Assessors captured the children's information and test performance on tablets programmed to calculate the ELOM domain and total standard scores. Following each assessment, the record was uploaded to a password-protected central server and was kept confidential.

Data analysis
The Pearson product-moment correlation was used to assess the relationship between children's ELOM scores derived at the two times of measurement (Rust & Golombok 2014;Warner 2013). Study 1 drew on other studies of testretest reliability with similar instruments to set a criterion for an acceptable correlation between the scores derived at the two points of measurement. Bryant and Roffe (1978) reported the test-retest reliability (Pearson's r) of the McCarthy Scales to range from 0.71 to 0.85. As the WPPSI-IV and the ELOM are more comparable instruments (see Study 2), we followed Syeda and Climie (2014) in setting the criterion for an acceptable ELOM test-retest reliability coefficient at 0.75.

Ethical considerations
The study was approved by the University of Cape Town's Humanities Faculty Ethics Committee (PSY2019-024). Participating preschool staff were briefed on the study. Prior to testing, parents or guardians of the participating children were requested to give written informed consent for their child's participation by signing an informed consent form. As there was a high likelihood of parents forgetting to return the forms to school with the child, passive consent was used where necessary (as approved by the Ethics Committee). Children were informed that they could stop the assessment at any time without consequences and could also request for a break during the assessment.

Results
As normality was violated and linearity was fairly but weakly upheld, the data were bootstrapped and confidence intervals were established (Field 2013;how2stats 2019;Swank & Mullen 2017). As is evident (Table 1), the ELOM total score (0.90), FMC and VMI (0.79), ENM (0.76) and arguably ELL (0.74) either exceeded or met the criterion chosen. Cognition and executive functioning (0.64) and GMD (0.50) were below the criterion. The ELOM total score test-retest reliability exceeded the level (0.80) put forward for group-level analysis by Cronbach (Polit 2014). None of the confidence intervals crossed zero and were narrow with a difference of less than 0.4 (Cumming 2012).
All p values were considered to be statistically significant at p < 0.001.

Discussion
In this study population, using ELOM total scores, the instrument has an excellent test-retest reliability (0.90) over a 7-day period. This finding is in line with the test-retest reliability of the WPPSI-IV Full Scale IQ (0.93) and its composite scores (0.84-0.89) reported by Syeda and Climie (2014), and is in line with coefficients of 0.82-0.92 for the same WPPSI-IV composites reported by Soares and McCrimmon (2013). The FMC and VMI, ENM domains and arguably ELL met the criterion for acceptable test-retest reliability chosen for this study.
Study 1 has limitations. A convenience sample was used as random sampling was not practical in the two schools from which the children were drawn (all children of the appropriate age had to be included to make up the sample). In addition, the study sample was drawn from children in lower socioeconomic groups. It is possible, although very unlikely, that the test-retest reliability of the ELOM could differ in a study of children from higher SES backgrounds. This is because this form of reliability is a property of the test and not the population. As Aldridge, Dovey and Wade (2017) stated, test-retest reliability: [R]efers to the systematic examination of consistency, reproducibility, and agreement among two or more measurements of the same individual, using the same tool, under the same conditions (i.e. when we don't expect the individual being measured to have changed on the given outcome). Test-retest studies help us to understand how dependable our measurement tools are likely to be if they are put into wider use in research. (p. 208) As noted, further studies of the test-retest reliability of the ELOM should be conducted with random samples and in children from higher socio-economic backgrounds and other language groups so as to ensure that the results reported here are confirmed.

Study 2: Concurrent validity of the Early Learning Outcomes Measure
The primary aim of Study 2 was to establish the concurrent validity of the ELOM by comparing children's performance on the instrument with core subtests of the WPPSI-IV that measure the same constructs. We investigated whether concurrent validity was demonstrated between ELOM total and WPPSI-IV Full Scale composite scores, and between three selected ELOM domains (FMC and VMI, CEF and ELL) and WPPSI-IV indices (visual spatial, fluid reasoning, processing speed, working memory and verbal comprehension). Study 2 aimed at making a contribution to the psychometric qualities of the ELOM by strengthening its validity. The establishment of concurrent validity would mean that ELOM results can be interpreted with greater confidence, and thus, with wider application and relevance.

Research method and design Sample
Participants were already enrolled in the Drakenstein Child Health Study (DCHS), birth cohort study being conducted in Paarl in the Western Cape of South Africa that follows 1000 mother-child dyads from 20-28 weeks' gestation. The DCHS participants are all of low SES and are vulnerable to substance abuse and human immunodeficiency virus (HIV) (Stein et al. 2015). G*Power 3.1.9.4 online software was used to determine sample size for correlation, with power set to 0.80, an effect size of 0.40 and significance set to 0.05, for a one-tailed test. These requirements yielded a minimum required sample size of 37 (Faul et al. 2009). After cleaning, the sample size of N = 62 (24 male and 38 female participants) provided sufficient statistical power (> 0.80) to accurately assess concurrent validity. The age range of sample was from 72.98 to 75.97 months (M = 75.05, SD = 0.75). This included 45 isiXhosa, 16 Afrikaans and one English speaker. The demographic characteristics of the whole DCHS sample are provided in Stein et al. (2015). These children were older than the ELOM standardisation range (50-69 months). However, as noted above, ceiling effects in this age group are only evident for three of the 23 ELOM items and are not evident for ELOM total and domain scores. It was, therefore, decided that the ELOM could be used in this age group to investigate concurrent validity.

Preparation of data for analysis
As for Study 1, ELOM Direct Assessment guidelines were used to exclude records of children likely to show either very poor performance because of learning difficulties or invalid assessments. Once data were cleaned, it was imported into SPSS version 25.0 (IBM Corp. 2017).

Measures
Participants for Study 2 were tested on the ELOM (described above) and the WPPSI-IV core subtests during the 72-month neurocognitive DCHS testing wave in 2019. The ELOM total scores, and scores on three selected domains (FMC & VMI, CEF and ELL), which are related to areas of the WPPSI-IV core subtests, were used to assess the concurrent validity. The WPPSI-IV is a standardised intelligence test used for children between 30 and 91 months (Wechsler 2012a), which has not been standardised for use in South Africa. Strong test-retest reliability and concurrent validity have been established (Thorndike 2014). The WPPSI-IV Full Scale composite score is comprised of five Primary Index Scales: verbal comprehension, visual spatial, fluid reasoning, working memory and processing speed indices. The children in this DCHS are tested on WPPSI-IV core subtests: Information, Similarities, Block Design, Matrix Reasoning, Picture Memory and Bug Search (see Table 2). These contribute to the indices, which combine to derive Full-Scale IQ (the WPPSI-IV Full Scale composite score). The core WPPSI-IV subtests were compared, via WPPSI-IV indices, with ELOM domains (Table 3).  Visual motor coordination and non-verbal problem-solving abilities (Groth-Marnat 2003; Wechsler 2012b).

Matrix reasoning Fluid reasoning index
The child chooses the response option that best completes an incomplete matrix.
Fluid intelligence and the ability to analyse the relationship between a whole and its parts (Groth-Marnat 2003; Wechsler 2012b).

Picture memory Working memory index
The child is shown a picture and must remember the stimulus by choosing it out of a set of response options.

Bug search
Processing speed index Children choose out of an array of insects, the one that matches the target insect.
WPPSI-IV, wechsler preschool and primary scale of intelligence fourth edition.

Concurrent validity procedure
The assessors for the DCHS with postgraduate psychology qualifications administered both tests to children in private rooms at study sites. The ELOM was administered in the child's home language (as the instrument is available in Afrikaans and isiXhosa). As the test has not been translated into South African languages, the WPPSI-IV was administered in English with translation into isiXhosa or Afrikaans by an assistant during the testing sessions. In order to reduce the likelihood of variation in translations, the DCHS devised standard translations for use by all assistants. All WPPSI-IV translations were forward and back translated; thereafter, translation consensus meetings were carried out with community nursing staff and the translators to ensure that the translations were age and context appropriate.
Both instruments were administered on the same day, with the ELOM first and then the WPPSI-IV. Children were given a break between the two testing sessions.

Data analysis
Pearson's correlation coefficient (r) was used to measure the strength of relationships between WPPSI-IV core subtests, WPPSI-IV indices, ELOM items and ELOM domains. The criteria for acceptable r (see Table 4) were followed according to Swank and Mullen (2017) who noted that correlation coefficients used in testing validity are lower than other applications of correlation, as abstract or latent constructs result in measurement complexities.

Results
Descriptive statistics for ELOM and WPPSI-IV scores respectively were displayed (Table 5 and Table 6). The correlations between ELOM and WPPSI-IV scores are provided (Table 7). The very high correlation (r = 0.64; p < 0.001) between the ELOM total Score and the WPPSI-IV Full Scale composite score demonstrates a strong concurrent validity. All the three ELOM domains yielded a high or very high correlation with the WPPSI-IV Full Scale composite score (p < 0.001). The expected correlations from Table 3 are highlighted (Table 8) and it shows the strongest relationships existing between ELOM domains and WPPSI-IV subtests. A significant correlation was observed when the ELOM items were individually correlated with the WPPSI-IV core subtests, with results shown (Table 9).

Discussion
Strong concurrent validity of the ELOM with the WPPSI-IV has been established in this sample. Both tests measure similar constructs. The very high and significant correlation between the ELOM total score and the WPPSI-IV Full Scale composite score suggests that the ELOM total score could be used as a proxy indicator of IQ, particularly as the ELOM is standardised for South Africa, whereas the WPPSI-IV is not. However, investigation of the relationship between the two tests in children from across a wide range of socio-economic backgrounds is necessary before this can be confirmed. The FMC & VMI domain showed the strongest correlation with WPPSI-IV Bug Search, suggesting that they are measuring similar constructs -perhaps a visual aspect. The CEF domain showed the strongest correlation with WPPSI-IV Block Design, suggesting that they are measuring similar constructs -potentially non-verbal problem solving and spatial perception (Groth-Marnat 2003; Wechsler 2012b). As expected, the ELL domain showed the strongest correlation with the WPPSI-IV VCI composite score (see Table 3).
A limitation of Study 2 is that the sample was 6 months older than the ELOM standardisation sample and that all children were from low socio-economic backgrounds, as the DCHS tracks the development of children growing up in high-risk circumstances (Stein et al. 2015). Replication with children from the full range of socio-economic backgrounds is recommended.

Conclusion
The ELOM was developed because of the lack of standardised instruments in South Africa suitable for measuring early learning programme effects and children's readiness to learn in the Grade R year (Snelling et al. 2019). It is the first psychometrically robust population-level South African instrument that can be administered by trained nonprofessionals at low cost, which is used to assess preschool children from across a wide range of socio-economic and ethnolinguistic backgrounds. Prior to the current studies,

Block design Visual spatial Fine motor coordination and visual motor integration
Visual motor coordination and visual spatial integration are important aspects of fine motor skills (Carlson, Rowe & Curby 2013;Decker et al. 2011    ‡, CI stands for confidence interval, calculated at the 95% interval. LL stands for lower limit, and UL stands for upper limit. §, It is noted that these confidence intervals span across zero, so these correlations should be interpreted with caution. *, p < 0.05, **; p < 0.01, ***; p < 0.001.   Table 9 continues on next page→ test-retest reliability and concurrent validity had not been established. Whilst the concurrent validity of the GMD and the ENM domains of the ELOM remain to be established, these studies have enhanced the psychometric properties of the measure. ELOM, early learning outcomes measure; FMC, fine motor coordination; VMI, visual motor integration; CEF, cognition and executive functioning; ELL, emergent literacy and language; CI, confidence interval; LL, lower limit; UL, upper limit. †, It is noted that these confidence intervals span across zero, so these correlations should be interpreted with caution. ‡, r represents the Pearson correlation statistics, in Rho. §, Item-by-item analyses of the gross motor development domain and the emergent numeracy and mathematics domain were excluded from this study as these domains do not fall into the scope of this research. *, p < 0.05; **, p < 0.01; ***, p < 0.001.