Evidence for goal- and mixed evidence for false belief-based action prediction in 2- to 4-year-old children: A large-scale longitudinal anticipatory looking replication study

Unsuccessful replication attempts of paradigms assessing children’s implicit tracking of false beliefs have instigated the debate on whether or not children have an implicit understanding of false beliefs before the age of four. A novel multi-trial anticipatory looking false belief paradigm yielded evidence of implicit false belief reasoning in 3-to 4-year-old children using a combined score of two false belief conditions (Grosse Wiesmann, C., Friederici, A. D., Singer, T., & Steinbeis, N. [2017]. Developmental Science , 20 (5), e12445). The present study is a large-scale replication attempt of this paradigm. The task was administered three times to the same sample of N = 185 children at 2, 3, and 4 years of age. Using the original stimuli, we did not replicate the original finding of above-chance belief-congruent looking in a combined score of two false belief conditions in either of the three age groups. Interestingly, the overall pattern of results was comparable to the original study. Post-hoc analyses revealed, however, that children performed above chance in one false belief condition (FB1) and below chance in the other false belief condition (FB2), thus yielding mixed evidence of children’s false belief-based action predictions. Similar to the original study, participants’ performance did not change with age and was not related to children’s general language skills. This study demonstrates the importance of large-scaled replications and adds to the growing number of research questioning the validity and reliability of anticipatory looking false belief paradigms as a robust measure of children’s implicit tracking of beliefs.

In contrast to the traditional view that children only develop false belief understanding around the age of four, recent studies utilizing novel task formats have yielded evidence of false belief understanding at a much earlier age. Here, we will focus on anticipatory looking paradigms. Such paradigms make use of humans' tendency to anticipate actions while observing them which already develops until the end of the first year of life (e.g., Falck-Ytter et al., 2006;Flanagan & Johansson, 2003). Within anticipatory looking paradigms, actions that are performed based upon certain mental states (such as intentions or true/false beliefs) are presented and participants' action anticipation is observed to figure out whether participants take the agent's mental state into consideration when predicting their action. Using such an intriguing new task design, Clements and Perner (1994) were the first to measure children's false belief understanding employing a nonverbal paradigm. Enacting a standard change-of-location false belief task, they tracked children's anticipatory looks while prompting them to anticipate the agent's behavior. After the anticipatory phase, the explicit false belief action prediction question was uttered. Analyzing children's anticipatory looks, Clements and Perner (1994) found that children from 2;11 years on reliably looked at the old location of the target object in false belief trials and the current location of the target object in the true belief trials, thereby indicating an implicit understanding of false belief. Strikingly, the same children were unable to correctly verbally predict the agent's searching behavior in the false belief trials, exhibiting a lack of explicit false belief understanding (for an overview of the conceptual and terminological implicit-explicit distinction, see Perner & Roessler, 2012). Since then, a number of studies using anticipatory looking paradigms have contributed supporting evidence for the view that even 2-year-old children and infants possess implicit false belief understanding (e.g., Ruffman et al., 2001;Senju et al., 2011;Southgate et al., 2007;Surian & Geraci, 2012;Surian & Franchin, 2020;Thoermer et al., 2012;Wang et al., 2012).
In the past years, competing accounts of early, implicit false belief understanding have been formulated based on these important findings. These accounts have tried to reconcile the new findings with the traditional results from standard explicit false belief tasks. On the one hand, proponents of the conceptual continuity view assume that Theory of Mind abilities are present from infancy on. They attribute the failure of younger children in explicit false belief tasks to children's struggle with the task demands of these explicit tasks, such as their limited inhibitory control (e.g., Wang & Leslie, 2016) or their insufficient pragmatic skills when interpreting the test question (Siegal & Beattie, 1991). Therefore, in spontaneous-response tasks in which inhibitory and pragmatic demands are reduced children can succeed at a much younger age (Baillargeon et al., 2010;Scott, 2017). On the other hand, supporters of a conceptual-change view assume that only the application of behavioral rules (Perner & Roessler, 2012; or infants' well-developed statistical learning skills (Ruffman, 2014) lead to successful performance in spontaneous-response tasks and that Theory of Mind abilities only develop later. Further, Heyes (2014aHeyes ( , 2014b argues that children's success on implicit tasks stems from domain-general processes such as perceptual novelty. Dual-systems accounts assume that humans are

Research highlights
• We investigated early false belief tracking through a largescaled longitudinal replication study • In a multi-trial anticipatory looking paradigm, we did not replicate the original finding of above-chance beliefcongruent looking in 3-and 4-year-olds with a combined measure of two false belief conditions • We found above-chance false belief-congruent anticipatory looking in 2-, 3-, and 4-year-olds in false belief condition FB1, but below-chance performance in false belief condition FB2 • Our findings suggest that 2-to 4-year-olds track an agent's goal, but not reliably its false belief in an anticipatory looking task equipped with an implicit and very efficient system for processing mental states which is already present in infants and a second more flexible, but less efficient, explicit system which only develops in the preschool period and is tied to the presence of language abilities and executive functions (Apperly & Butterfill, 2009;Grosse Wiesmann et al., 2017, 2020. If implicit task formats indeed measure early false belief competencies, continuity between false belief performance in infancy and childhood would be expected (Setoh et al., 2016). In a multi-measure longitudinal study, significant developmental relations between implicit false belief reasoning at 18 months and explicit false belief understanding at 48 months (Thoermer et al., 2012), and at 50, 60, and 70 months (Kloo et al., 2021) as well as belief-based intention understanding at 60 months (Sodian et al., 2016)  found a tentative concurrent relation between an explicit false belief task and some measures of anticipatory looking (first look but not relative looking duration).
While numerous findings of implicit false belief understanding in infants fueled the controversial debate on children's early mental state reasoning abilities, the past few years have yielded a fast-growing number of partial or failed replication attempts of anticipatory looking tasks assessing implicit false belief understanding  leading to what some researchers consider a replication crisis . According to a meta-analysis by Barone et al. (2019), results in spontaneous-response paradigms are dependent on the type of paradigm used, with higher performance levels being obtained for violation-of-expectation paradigms than for anticipatory looking or interactive paradigms. In line with this finding, infants' performance often did not correlate across different types of paradigms and using different gaze measures Poulin-Dubois & Yott, 2018). There were even meaningful performance differences in the same sample of participants depending on which gaze measure-first look or differential looking score-was analyzed . The meta-analysis by Barone et al. (2019) also found that year of publication as well as sample size influenced children's performance: Positive findings of implicit false belief understanding in infants mainly stem from early studies that assessed only small samples of infants (Barone et al., 2019). Another problem with anticipatory looking tasks is high exclusion rates (Schuwerk et al., 2018;Southgate et al., 2007) which often lead to decreases in sample sizes.
Because of the single-trial nature of most anticipatory looking false belief tasks 1 , a large number of children was excluded from the analysis due to lacking correct anticipatory looking already in the familiarization phase.
Moreover, several anticipatory looking studies found above-chance performance in one false belief condition but not in another false belief condition. In a seminal anticipatory looking study by Southgate et al. (2007), two different false belief conditions (FB1 and FB2) were implemented. In FB1 trials, an agent observed the transfer of a target object from location A to location B. The target object was then removed from the scene in the absence of the agent, leading to the agent's false belief about the target's location. In FB2 trials, the agent was already absent during the transfer of the target object from location A to location B and also missed the removal of the target from the scene leading to their false belief about the target's location. In FB1 trials, the agent thus believed the object to be in the last location (B), whereas in FB2 trials, the agent believed the object to be in the first location (A). In FB1 trials, anticipatory looking at the belief-congruent location coincides with looking at the last location the object was. In FB2 trials, however, looking at the belief-congruent location coincides with looking at the first location the object was. Thus, in combination, the two false belief conditions mutually serve as controls for each other to rule out the possibility that participants solve the task using alternative strategies . Compared to FB1 trials, FB2 trials pose increased processing and memory demands due to the added intermediate events. A study comparing FB1 and FB2 performance in the paradigm by Southgate et al. (2007) found that 2-to 4-year-olds performed significantly better in FB1 than in FB2 trials, indicating that FB2 trials might be harder to solve (Grosse Wiesmann et al., 2018). In following replication attempts, researchers were usually only able to replicate the above-chance performance in children and infants in FB1 trials, but not in FB2 trials Grosse Wiesmann 1 Note that also violation-of-expectation and interactive helping paradigms make use of only a single trial to measure children's implicit false belief understanding. Thus, the single-trial nature is a limitation of most implicit tasks. Kulke, von Duhn et al., 2018). This is problematic because only if children pass both FB1 and FB2 trials, their performance can be interpreted as solid evidence for implicit false belief understanding (Baillargeon et al., 2018).
The difficulties in replicating the original findings of false beliefcongruent looking in infants and young children are worrisome and call for novel paradigms which can reliably and robustly assess infants' early false belief understanding. A recent promising study by Grosse Wiesmann et al. (2017) addressed the issue that most paradigms rely on only a single trial to measure belief understanding. In their anticipatory looking change-of-location task, each child watched six FB1 and six FB2 trials. In each trial, children anticipated the behavior of an animal agent who was searching for a mouse. Aggregated over trials and over both conditions (FB1 and FB2), the authors found belief- In the present longitudinal study, we conducted a large-scale replication attempt of the multi-trial anticipatory looking task by Grosse Wiesmann et al. (2017) in 27-, 36-, and 52-month-old children. First, we aimed at closely replicating the original finding of above-chance false belief performance in 3-and 4-year-old children. While children's mean age in the original study was 39.6 and 51.6 months, the age of the 3and 4-year-olds in the present study was within the originally tested age range. In addition, we assessed 2-year-olds to explore whether the paradigm is also sensitive towards implicit tracking of beliefs in children below the age of three. Second, we were interested in longitudinal performance trajectories in the age range from 2 to 4 years. As in Grosse Wiesmann et al. (2017), we analyzed relations with children's general language abilities and for the two older age groups relations with explicit false belief understanding.

METHOD
This study was preregistered using the replication recipe by Brandt et al. (2014). The preregistration and the eye tracking data can be found at OSF (https://osf.io/eyvsr/). We report how we determined the sample size, all data exclusions, all manipulations, and all measures in this study. The individual demographic information cannot be shared for data protection reasons.

Participants
The present study was part of a large longitudinal research project assessing the role of language in Theory of Mind development from 2 to 4 years. We report data from three measurement points. Children were The small telescopes approach by Simonsohn (2015) recommends that replication attempts should have sample sizes large enough to find an effect the original study had 33% power to detect. That means, an approximately 2.5 times larger sample size than the sample of the original study is necessary to detect such effects with sufficiently high power (i.e., with 80% power). While our oldest sample was slightly below this criterion, our two younger samples substantially exceed this recommendation. Moreover, the longitudinal combination of these data sets additionally increases our study's power (Vickers, 2003). We followed the small telescopes approach (Simonsohn, 2015) for determination of our sample size, since it was not possible to determine a reliable size of the effect under investigation (i.e., of implicit false belief understanding measured in anticipatory looking paradigms) based on previous research. Thus, it was not possible to reliably conduct an apriori power calculation since effect sizes obtained in previous studies vary greatly among individual studies.

Tasks and procedure
The anticipatory looking false belief task was performed at the ages The true belief trials from the original study were left out for reasons of time constraints in the overarching study and because the main goal was to replicate the above-chance performance in the false belief trials.
In the original study, the true belief trials aimed at keeping up children's anticipatory looking by showing an action outcome of the trial which was not provided in the false belief trials. Further, the true belief trials were meant to provide a performance baseline for children's anticipatory looking which we reasoned could also be provided by the familiarization trials.
In each trial, children watched a mouse enter the scene, followed by another agent (one of eight other animals). Subsequently, the mouse entered a y-shaped tunnel and exited it into one of two boxes situated at the tunnel's arms. The agent witnessed these events. In the familiarization trials (FAM), the agent immediately followed the mouse through the tunnel and opened the box in which the mouse was hiding. The content of the FAM trials should clarify for the participants that it was always the agent's goal to try to find the mouse when entering the tunnel. Once the agent had entered the tunnel, the tunnel's endings and the corresponding boxes were illuminated to elicit children's anticipatory looking. The test phase in which children's anticipatory gaze was recorded commenced 540 ms before this light effect and ended 40 ms before the first part of the animal was visible exiting the tunnel. The test phases of the FAM trials lasted 2500 ms each.
In the false belief trials, the mouse transferred from the box in which it was initially hiding to the other box and then left the scene. Two types of false belief trials were used, and they differed with regards to whether the agent watched the transfer of the mouse (FB1) or not (FB2). In neither the FB1 nor the FB2 trials, the agent watched that the mouse finally left the scene after this transfer. Thus, in both types of false belief trials, the agent held a false belief regarding the mouse's current location. In the FB1 trials, the agent assumed that the mouse was in the final hiding location although it was actually gone. In the FB2 trials, the agent thought the mouse was in the initial hiding location although it was actually gone. Once the mouse had left the scene, the agent re-appeared and entered the tunnel. The agent had tracked the mouse's prior movements with respective head turns. In combination with the events in the FAM trials, the participants should assume that the agent was trying to find the mouse. Next, the tunnel's endings and the boxes were illuminated. The test phase in which children's anticipatory gazes were recorded commenced 540 ms before this light effect and ended 80 ms before the end of the trial as in the original study. In the false belief trials, the agent did not re-appear at either end of the tunnel. The test phases in the false belief trials lasted 2940 ms each.
In Figure 1, the events in the FAM, FB1, and FB2 trials are displayed.
All trials were arranged in two different randomizations, of which each child watched only one per measurement point. The trials were spread out over two blocks with a short break in-between. While in the original study, two FAM or true belief trials depicting the outcome of the trial were conducted before the first FB trial, in the present study, only one FAM trial showing the outcome was presented before the first FB trial.

Low-inhibition false belief task
The low-inhibition false belief task was only conducted at the age of 36 months following the procedure described in Setoh et al. (2016).
In this task, a typical change-of-location story was presented using a picture book. In the story, the protagonist Lilli finds an apple in a bucket and transfers the apple to a basket. While Lilli is outside playing with a ball, her brother finds the apple and takes it away. When

Standard explicit false belief tasks
At the age of 52 months, only the two explicit false belief tasks from the Theory of Mind scale by Wellman and Liu (2004) were conducted to measure explicit false belief understanding. A sum score of both tasks was used.
In the contents false-belief task, children were shown a Smarties box and were asked to guess the content of the box. Once the children had In the explicit false-belief task, children were shown a picture of a backpack and a picture of a closet and they were told that the figurine SETK 3-5) was conducted at the age of 36 and 52 months.

SETK 2
The SETK 2 consisted of two language comprehension and two language production subtasks. According to the age-specific norm For the encoding of semantic relations and morphological rule formation tasks, the stimulus materials were presented on the child's computer via the screen-sharing mode. For the phonological working memory task, the stimulus material was held into the webcam of the computer. The language memory tasks only required children to repeat sentences and words after the experimenter.

Statistical analysis
All data preprocessing and data analysis were conducted in R 3.4.3 (R Core Team, 2020). Two-tailed testing and a significance level of .05 was used for all analyses. If not indicated otherwise, the original score by Grosse Wiesmann et al. (2017) which is a combination of the first fixation and the longer look score was used for data analysis. For the analysis of relations between general language and performance in the implicit false belief task, equivalence tests for correlations (Lakens et al., 2018) were used to corroborate the equivalence of the relation between general language and implicit false belief. In contrast to null-hypothesis significance testing, equivalence testing can be used to investigate "whether an observed effect is surprisingly small, assuming that a meaningful effect exists in the population" (Lakens et al., 2018, p. 259). Within this procedure, two one-sided t-tests are performed to be able to reject the null hypothesis that there is an effect at least as extreme as a pre-defined smallest effect size of interest. The absence of a meaningful effect can then be supported. Following one recommendation by Lakens et al. (2018), we determined the smallest effect size of interest such that we had at least 80% power to find it given our sample sizes.
Violin plots were created to visualize the distribution of the data within each measurement point and condition. Violin plots are similar to box plots but also depict the probability density of the data to represent the data distribution. The probability density function is calculated by a Kernel density estimator in a way such that the obtained function fits the observed data well. The thicker sections of the plot indicate a higher probability that members of the population take on a value in this range, whereas the thinner sections stand for a lower probability that members of a population fall into this value range.

Descriptive statistics and control analyses
In Table 1, descriptive statistics of the anticipatory looking false belief task at all three measurement points can be found. For a graphical display of the data distribution at each measurement point in each condition, see Figure 2. Independent samples t-tests on performance in the anticipatory looking false belief task were performed to rule out possible gender effects. No effects of gender were observed (all pvalues > .05). Performance in the language tests and the explicit false belief tasks is displayed in Table 2.

Confirmatory analyses
In our confirmatory analyses, we mirrored the analysis plan of the original study for a close comparison. For the anticipatory looking task, chance performance lay at 0.50 for all measurement points. First, children's performance in the FAM trials was analyzed. They performed above chance level at all three measurement points using one-sample TA B L E 1 Descriptive statistics and results of one-sample t-tests on performance in the three conditions (FAM, FB1, and FB2) at all three measurement points using the score by In the attempt to replicate the finding of Grosse Wiesmann et al.   Table 3. 4 As suggested in the review process, we also calculated a more comprehensive 3×3 ANOVA investigating simultaneously the effects of measurement point and condition to fully exploit our longitudinal design. In doing so, we found the same pattern of results (see S7). We did not initially plan to conduct such an ANOVA since the rationale of the study was first to follow the analysis play in Grosse Wiesmann et al. (2017) (including checking for effects of age) and only in a second step to investigate possible reasons for not replicating the original finding (such as comparing performance between FB1 and FB2 trials).

Comparison of looking durations at the initial and final hiding location
As a further investigation of differential looking patterns depending on the agent's belief in FB1 and FB2 trials, we calculated children's total looking durations at the initial and final hiding location of the mouse, separately for the FB1 and the FB2 condition at each measurement point. Then, we conducted three repeated-measures ANOVAs on the looking durations with the within-participants factors condition (FB1 and FB2) and hiding location 5 (initial and final) to analyze whether children's looking durations at the initial and final hiding location differed dependent on the agent's belief. At all three measurement points, there was no significant interaction between condition and location In the Supplemental Material, we further provide a betweenparticipants analysis of performance only in the first FB1 and FB2 trial to allow comparison of our findings with the results of other single-trial studies (see S8) -a procedure also adopted in the replication study by Dörrenberg et al. (2018). Moreover, we analyzed children's progression through the task on a trial-by-trial basis. These results can also be found in the Supplemental Material (S9). Lastly, we investigated based on an approach by Anderson and Maxwell (2016) whether the effect obtained in our study is consistent or inconsistent with the effect obtained in the original study and found that the effect was not inconsistent with the original study's findings. This analysis can also be found in the Supplemental Material (S10).

Relations between the anticipatory looking false belief task and language
Based on Grosse Wiesmann et al.'s (2017) results and the assumptions of the dual-systems account (Apperly & Butterfill, 2009), we expected to find no relation between children's general language skills and their performance in the anticipatory looking false belief task. To investigate this hypothesis, we ran equivalence tests. We followed an approach described in Lakens et al. (2018) to choose as the smallest effect size of interest one for which we had 80% power to detect it and to set this smallest effect size of interest as the equivalence bounds to test against. This resulted in equivalence bounds of ± 0.21 for correlations 5 Note that the final hiding location corresponds to the correct, belief-based location in FB1 trials, and the initial hiding location corresponds to the correct, belief-based location in FB2 trials.
between the SETK at 24 months and the anticipatory looking task at 27 months, bounds of ± 0.23 for correlations at 36 months, and bounds of ± 0.32 for correlations at 52 months. As a measure of children's false belief performance in the anticipatory looking false belief task, we again used the mean of the FB1 and FB2 score as in Grosse Wiesmann et al. (2017). We additionally ran analyses with children's performance in the FAM trials. Table 4 shows the results of the equivalence tests and Pearson correlations. The significant results of the equivalence tests indicate that the correlations between language and performance in the anticipatory looking task were equivalent and were not more extreme than the pre-defined equivalence bounds. In line with this, Pearson correlations revealed only non-significant, close-to-zero correlations. This pattern of findings suggests that the true relation between general language and performance in the anticipatory looking task was not more extreme than the pre-defined equivalence bounds.

3.6
Relations between the anticipatory looking false belief task and explicit false belief Finally, for the two older age groups, we analyzed relations between explicit false belief understanding and performance in the anticipatory looking task. At 36 months, there was a trend for a positive relation between children's performance in the FB1 trials and performance in the low-inhibition false belief task which closely failed to reach significance (r(108) = .18, p = .067, point-biserial correlation). Neither performance in the FB2 trials nor performance in the FAM trials was positively related with performance in the low-inhibition false belief task

DISCUSSION
In the present study, we attempted to replicate the finding of false belief-congruent anticipatory looking in young children by conducting the multi-trial, anticipatory looking false belief task by Grosse Wiesmann et al. (2017) in a large sample. As in the original study, we found above-chance performance in the familiarization trials in 2-, 3-, and 4year-olds. However, we did not find the previously reported abovechance performance in either of the three age groups with the combined false belief score used in the original study (an average of performance in two different false belief conditions, FB1 and FB2). Further investigation of the data indicated that all three age groups performed TA B L E 4 Results of equivalence tests and pearson correlations between the anticipatory looking false belief task and general language abilities Abbreviations: SETK2, SETK3, and SETK4 represent performance in the language development test at 2, 3, and 4 years; FAM = familiarization trials; FB = mean of false belief condition 1 trials and false belief condition 2 trials.
significantly above chance in FB1 trials but significantly below chance The present study assessed children's implicit belief-tracking abilities using an anticipatory looking paradigm. We attempted to replicate the original study's finding of false belief-congruent looking in 3-and 4-year-olds in this multi-trial paradigm. We closely followed the data collection and data preparation procedure and measures described in Grosse Wiesmann et al. (2017) and utilized the same stimuli apart from two deviations: First, the true belief trials from the original study which intended to keep up action anticipation were left out due to time constraints and second, the first false belief trial was only preceded by one familiarization trial. Despite using very similar procedures, we did not find above-chance false belief-congruent looking in 27-, 36-and 52-month-old children. This finding is unlikely to be due to children not grasping the story presented in the task since children performed well above chance level in the familiarization trials, requiring simple goal-based action predictions. This above-chance performance indicates that children understood the agent's goal-directed behavior.
In the false belief trials, children additionally needed to consider that the agent held a false belief about the target's location when predicting the agent's actions. Further analyses on the false belief data yielded false belief-congruent looking even at the age of 27 months but only in the false belief condition FB1. In the other false belief condition, FB2, all age groups performed significantly below chance level. This pattern of findings resembles the results of other recent anticipatory looking studies Kampis et al., 2021;Kulke, Reiß et al., 2018, Study 2b). In FB1 trials, the agent observed the displace-ment of the target object from location A to location B and was only absent while the target left the scene. In FB2 trials, however, the agent was already absent during the displacement of the target. Our results indicate that children might not have taken into consideration that the agent did not watch the target's transfer in the FB2 trials. Rather, they mostly looked at the last place where they themselves observed the mouse going, neglecting that the agent did not have this information. Many researchers argue that above-chance performance in both FB1 and FB2 trials is required to conclude that participants engaged in implicit false belief reasoning (Baillargeon et al., 2018;Southgate et al., 2007).
While the pattern of our results is comparable to the original study (Grosse Wiesmann et al., 2017), only in the large sample that we collected, the within-participant differences in FB1 and FB2 performance became pronounced enough to suggest that FB1 and FB2 trials might be processed differently. Our finding that children looked longer at the target's final than at the target's initial hiding location independent of the false belief condition suggests that children in all three age groups treated both false belief conditions equally. Thus, children might have neglected the absence of the agent during the last transfer in the FB2 trials, leading to above-chance performance in FB1 but below-chance performance in FB2 trials and overall longer looking durations at the final hiding location. This finding demonstrates the importance of large enough samples to find such performance differences with sufficient power.
Children's successful performance in FB1 trials could therefore also be explained by applying a strategy such as 'looking at the last location the target was at' . In other replication attempts of the anticipatory looking task, often chance performance in FB2 trials and low performance levels in the familiarization trials were observed. According to the original authors, this contradicts the idea that infants follow a last location strategy (Baillargeon et al., 2018). In our study, we found high success rates on FAM trials and below-chance performance in FB2 trials. However, we also observed positive correlations of explicit false belief understanding with FB1 performance, but not with FAM performance, in the older two age groups. This provides a tentative indication that success in FB1 trials might be related with succeeding in a mental state reasoning task and therefore might tap a similar skill. Together with performance patterns observed in other replication studies, it seems unlikely that above-chance performance in FB1 trials can be solely explained by the child applying non-mentalistic behavioral rules (Baillargeon et al., 2018). However, without suitable control conditions, this possibility cannot be ruled out.
Not finding evidence for belief-congruent looking in FB2 trials is well in line with previous research using anticipatory looking false belief tasks (Baillargeon et al., 2018;Grosse Wiesmann et al., 2018;Kampis et al., 2021;Schuwerk et al., 2018).
A direct comparison of FB1 and FB2 performance in our sample yielded that participants in all three age groups performed significantly better in FB1 than in FB2 trials, which constitutes a conceptual replication is possible that children looked at the last location they saw the target going, because they incorrectly remembered that the agent had also observed these actions. Consequently, FB2 trials might draw more heavily on working memory, attention capacity, and inhibitory skills while tracking the target's actions and representing the agent's belief (Baillargeon et al., 2018;Grosse Wiesmann et al., 2018). Also, Senju et al. (2010) argue that participants must maintain the agent's epistemic state longer in FB2 than in FB1 trials which makes this condition more challenging. The finding that heightened cognitive load in a dual-task design hindered implicit false belief processing in adults (Schneider et al., 2012) are in line with this interpretation and indicate that even low-level, implicit processing of beliefs to some extent requires executive functions. In our study, we measured FB2 performance longitudinally and found no age-related improvement of performance restricting the argument that young children's memory limitations hindered successful FB2 performance. However, executive functions and working memory capacity still develop beyond the age of 4 years (e.g., Evers, 2019;Garon et al., 2008) such that the requirements of the FB2 tasks might still have been too challenging for the 52month-olds. Concluding, FB2 trials may be a less reliable measure of implicit tracking of beliefs due to the additional demands they impose and may therefore not constitute a suitable control condition to assess false belief tracking (Baillargeon et al., 2018).
Despite finding only evidence of false belief-congruent anticipatory looking in one type of false belief trials (FB1) but not in the other (FB2), we observed that all three age groups performed well above chance level in the familiarization trials. Since these trials required children to understand the protagonist's goal (which was to follow the mouse), children's successful performance in these trials might indicate their ability to perform goal-based action predictions. Using the multi-trial paradigm, we observed the same pattern of above-chance FB1 and below-chance FB2 performance known from single-trial studies Grosse Wiesmann et al., 2018;Kulke, Reiß et et al., 2018, Study 2b). Yet, in our study, we found tentative positive, concurrent relations between implicit and explicit false belief reasoning which lends some support to the view that there is conceptual-continuity between implicit and explicit mental state reasoning Sodian et al., 2020). These relations, however, only emerged for the FB1 condition and not for the FB2 condition. This again indicates that the FB2 condition may not be a suitable measure of implicit false belief understanding.
A limiting factor to our study is the percentage of trials that were excluded from the analysis. More than 20% of all presented trials were excluded due to the child not paying attention during vital moments of the trial, the child looking away during the test phase, or the child failing to anticipate. Parts of these exclusions can be explained by decreased motivation towards the end of the task and are inherent to the task's multi-trial nature and the young age groups we are assessing with it. Nevertheless, due to the paradigm's multi-trial design, we did not need to exclude any participant from the analysis and remained with a sufficient amount of data from each child and measurement point.
Moreover, while the anticipatory looking task was administered first at 52 months, it was performed last at the earlier two measurement points due to the design of our underlying longitudinal study. The order of tasks might have influenced performance and motivation in the task differently. Nevertheless, the pattern of performance in all three conditions was comparable across measurement points, indicating that, even if task order had an effect, its size was negligible. A further limitation lies within the considerable variance observed across the progression of the task (see S9) and the fact that participants contributing only few trials to a specific condition were treated identically to participants contributing all trials of a condition. In the Supplemental Material (S12), we provide information on how many participants contributed less than half of the trials to a condition. A last limiting factor to this study are two deviations from the original study: First, due to time constraints, the true belief trials from the original study were not conducted. These trials were intended to keep up children's motivation and action anticipation by showing the action outcome of the trial. We reasoned that the familiarization trials would also serve this purpose, but it might be possible that omitting the true belief trials caused decreased action anticipation towards the end of the task.
Yet, the pattern of performance across the entire task was comparable to performance in only the first FB1 or FB2 trial (see S8) restricting this limitation. Second, in the present study, only one FAM trial depicting the action outcome of the trial was presented before showing the first FB1 or FB2 trial while in the original study two trials with action outcome were shown prior to the first false belief trial. Since children still succeeded in the FB1 trials, it seems unlikely that these procedural changes only affected performance in the FB2 trials.

CONCLUSION
The present study is an attempted large-scale replication of the false belief-congruent looking behavior in young children in the anticipatory looking false belief task by Grosse Wiesmann et al. (2017). We conducted the task in three age groups, of which two fell into the same age ranges as in the original study. Further, we closely followed the original study's data preparation and analysis procedure and utilized the original stimuli and scores. Nevertheless, we did not replicate the finding of overall above chance false belief-congruent looking in either of

DATA AVAILABILITY STATEMENT
The eyetracking data is available at https://osf.io/eyvsr.

DECLARATIONS OF INTEREST
None.