Estimation of HII Bubble Size Distribution from 21cm Power Spectrum with Artificial Neural Networks

2022-05-24 14:21:38HayatoShimabukuroYiMaoandJianrongTan

Hayato Shimabukuro,Yi Mao,and Jianrong Tan,3

1 Yunnan University,SWIFAR,No.2 North Green Lake Road,Kunming 650500,China;shimabukuro@ynu.edu.cn (HS)

2 Department of Astronomy,Tsinghua University,Beijing 100084,China;ymao@tsinghua.edu.cn (YM)

3 Department of Physics &Astronomy,University of Pennsylvania,209 South 33rd Street,Philadelphia,PA 19104,United States of America

Abstract The bubble size distribution of ionized hydrogen regions probes information about the morphology of H II bubbles during reionization.Conventionally,the H II bubble size distribution can be derived from the tomographic imaging data of the redshifted 21 cm signal from the epoch of reionization,which,however,is observationally challenging even for upcoming large radio interferometer arrays.Given that these interferometers promise to measure the 21 cm power spectrum accurately,we propose a new method,which is based on artificial neural networks,to reconstruct the H II bubble size distribution from the 21 cm power spectrum.We demonstrate that reconstruction from the 21 cm power spectrum can be almost as accurate as being directly measured from the imaging data with fractional error ≾10%,even with thermal noise at the sensitivity level of the Square Kilometre Array.Nevertheless,the reconstruction implicitly exploits the modeling in reionization simulations,and hence the recovered H II bubble size distribution is not an independent summary statistic from the power spectrum,and should be used only as an indicator for understanding H II bubble morphology and its evolution.

Key words:methods:data analysis–methods:numerical–(cosmology:) dark ages–reionization–first stars–(cosmology:) diffuse radiation–cosmology:theory

1.Introduction

The epoch of reionization (EOR) is a unique period of time in cosmic evolution,during which ultraviolet (UV) and X-ray photons emitted from the first luminous objects(e.g.,first stars and galaxies) ionize hydrogen atoms first in the surrounding intergalactic medium (IGM) and form bubbles of H II regions,and eventually these H II bubbles fill the whole Universe by z ≃6 (e.g.,Fan et al.2006).

To unveil the nature of cosmic reionization,the cosmic 21 cm background has emerged as a promising probe of the EOR.The 21 cm line of atomic hydrogen results from the hyperfine transition due to spin coupling (Scott &Rees 1990;Madau et al.1997).The tomographic images of 21 cm brightness temperature can directly tell the spatial distribution of H II bubbles and the complete history of cosmic reionization (e.g.,Furlanetto et al.2006;Pritchard &Loeb 2012).However,making threedimensional(3D)21 cm maps requires high sensitivity and spatial resolution,so it is technically extremely difficult.Instead,ongoing large radio interferometer array experiments,e.g.,the Giant Metrewave Radio Telescope (GMRT)4http://gmrt.ncra.tifr.res.in,the LOw Frequency Array (LOFAR)5http://www.lofar.org,the Murchison Widefield Array (MWA)6http://www.mwatelescope.organd the Precision Array for Probing the Epoch of Reionization(PAPER)7http://eor.berkeley.edu,have first attempted to detect the 21 cm power spectrum from the EOR,a two-point statistic of 21 cm brightness temperature fluctuations (e.g.,Furlanetto et al.2006;Pritchard &Loeb 2012),and have put upper limits on the 21 cm power spectrum(Paciga et al.2013;Yatawatta et al.2013;Tingay et al.2013;Patil et al.2014,2017;Parsons et al.2014;Jelić et al.2014;Ali et al.2015;Dillon et al.2015;Jacobs et al.2015;Pober et al.2015;Mertens et al.2020).Furthermore,upcoming experiments such as the Square Kilometre Array(SKA)8http://www.skatelescope.org(Mellema et al.2013;Koopmans et al.2015)and the Hydrogen Epoch of Reionization Array (HERA)9http://reionization.org(DeBoer et al.2017) promise to measure the 21 cm power spectrum from the EOR for the first time and with high sensitivity (The HERA Collaboration et al.2021).

Understanding the morphology and topology of ionized bubbles(Mellema et al.2015;Kulkarni et al.2016,2017;Hassan et al.2018)is a key question related to the EOR.While the topological features of the 21 cm maps can be described by Minkowski functionals (Gleser et al.2006;Lee et al.2008;Friedrich et al.2011;Hong et al.2014;Yoshiura et al.2017;Chen et al.2019),the morphology of ionized regions can be quantified by measuring the size distribution of H II bubbles (Zahn et al.2007;Mesinger &Furlanetto 2007;Zahn et al.2011;Pan &Barkana 2012;Majumdar et al.2014;Lin et al.2016),which can be measured from the 21 cm maps(Kakiichi et al.2017;Giri et al.2018a,2018b).For example,Kakiichi et al.(2017) suggested a novel technique called“granulometry”for such purpose,based on the idea that granulometry counts the number of ionized bubbles when their sizes are smaller than some threshold.

However,conventional methods for measuring the H II bubble size distribution require high signal-to-noise ratio imaging data of the redshifted 21 cm signal obtained with upcoming radio interferometers such as the SKA.While it is indeed one of the major science goals for the SKA(Koopmans et al.2015),the 21 cm imaging is observationally more challenging than the 21 cm power spectrum measurements.This is because it will take significantly more integration time to reduce the thermal noise at individual pixels in the 21 cm images,in order to compensate the information loss in the process of Fourier transform of visibility data to obtain the imaging maps.But before the 21 cm imaging data become available,can we learn more information about reionization from the 21 cm power spectrum? Specifically,can we reconstruct the H II bubble size distribution from the 21 cm power spectrum measurements?

The 21 cm power spectrum and H II bubble size distribution are distinct statistical quantities,so from an informational point of view,one quantity cannot be used to infer the other directly,if no additional information is utilized.However,if we employ reionization simulations based on underlying reionization modeling,which can predict both observables from the same set of model parameters(“reionization parameters”),then this underlying reionization modeling essentially provides additional information on the connection between the 21 cm power spectrum and H II bubble size distribution.In principle,we can first obtain the best fit values of reionization parameters constrained by the 21 cm power spectrum (Greig &Mesinger 2015,2017a,2018;Shimabukuro &Semelin 2017),and then the H II bubble size distribution can be inferred by running the reionization simulation with the best fit reionization parameters.The disadvantage of this indirect approach is that the degeneracies in reionization parameters may bias the best fitting parameter inference—because such estimations are explicitly model-dependent—and thus result in errors in the estimations of H II bubble size distribution.

Recently,machine learning techniques have been widely applied to 21 cm cosmology in three regimes—parameter estimation(Shimabukuro&Semelin 2017;Gillet et al.2019;Hassan et al.2020;Zhao et al.2022),emulation (Kern et al.2017;Schmit &Pritchard 2018;Jennings et al.2019) and classification (Hassan et al.2019).These examples of applications demonstrate that the Artificial Neural Network(ANN) technique can easily establish the connection between two multi-dimensional variables,or “vectors”,if they are correlated.In this paper,we propose a new method wherein the H II bubble size distribution is reconstructed directly from the 21 cm power spectrum using the ANN.Basically,the networks that connect the input (the 21 cm power spectrum) and the output (the H II bubble size distribution) are trained to match the predicted output to their true values,relying on a large number of simulation samples.Since the intermediate step of reionization parameter inference is bypassed,in principle,the reconstruction of H II bubble size distribution with this direct,data-driven,method can be more accurate than the aforementioned indirect approach—we shall test this point herein.

Note that this ANN-based method is implicitly modeldependent—the training data sets are based on reionization simulations and their modeling.When this method is applied to future 21 cm power spectrum observational data,caution should be taken about the consequence of the modeldependence—the reconstructed H II bubble size distribution is not an independent summary statistic from the power spectrum,and therefore should not be used for reionization parameter inference.Instead,the reconstructed H II bubble size distribution should be regarded only as an indicator for understanding the H II bubble morphology and its evolution.

The rest of this paper is organized as follows.In Section 2,we describe the modeling of cosmic reionization,the 21 cm signal,and the bubble size distribution.In Section 3,we outline the ANN technique.We show our results in Section 4,and give concluding remarks in Section 5.

2.Simulation Data Preparation

2.1.Reionization Simulations

We perform semi-numerical simulations of reionization with the publicly available code 21cmFAST(Mesinger et al.2011).This code is based on the semi-numerical treatment of cosmic reionization and thermal history of the IGM.It quickly generates the fields of density,velocity,ionization field,spin temperature and 21 cm brightness temperature on a grid.This code utilizes the excursion-set approach(Furlanetto et al.2004) to identify ionized regions.Specifically,cells inside a spherical region are identified as ionized,if the number of ionizing photons in that region is larger than that of neutral hydrogen atoms or fcoll(x,R,z)≥ζ-1.Here,ζ is the ionizing efficiency,fcoll(x,R,z)is the collapsed fraction smoothed over a sphere with radius R and center at x and redshift z.The smoothing scale R proceeds from large to small radius until the above condition is satisfied.If this does not happen with R down to the cell size,then the cell at x is considered as partially ionized with the ionized fraction of ζfcoll(x,Rcell,z).While this formalism is based on several simplified assumptions,the ionized field obtained by this formalism is in reasonably good agreement with that generated with full radiative transfer simulations(Zahn et al.2011).

Our simulations were performed on a cubic box of 200 comoving Mpc on each side,with 2563grid cells.We apply the Latin Hypercube Sampling method (McKay et al.1979)to scan the EOR parameter space,with one realization for each set of parameter values.This method is designed to be more efficient than the naive exhaustive grid-based search.To sample N points in an n-dimensional parameter space,we first divide the parameter space into Nnequal interval grids,and then choose a set of parameters from each row and column exclusively at the Latin Hypercube of the parameter space,so there are totally N points chosen.While there are several designs that satisfy that condition,we use the maximum Latin Hypercube algorithm that maximizes the minimum distance between the pairs (Morris &Mitchell 1995),which prevents highly clustered regions and ensures homogeneous sampling.

Our EOR model is parametrized with three parameters as follows.(1)ζ,the ionizing efficiency.(Furlanetto et al.2004,2006),which is a combination of several parameters related to ionizing photons.Here,fescis the fraction of ionizing photons escaping from galaxies into the IGM,f* is the fraction of baryons locked in stars,Nγis the number of ionizing photons produced per baryon in stars andis the mean recombination rate per baryon.The values of these parameters are very uncertain at high redshift(Gnedin et al.2008;Wise &Cen 2009).In our data set,we explore the range of 5 ≤ζ ≤100.

(2) Tvir,the minimum virial temperature of haloes that host ionizing sources.Typically,Tviris about 104K,corresponding to the temperature above which atomic cooling becomes effective.In our data set,we explore the range of 104≤Tvir≤105K.

(3) Rmfp,the mean free path of ionizing photons.The propagation of ionizing photons through the ionized IGM strongly depends on the presence of absorption systems,and the sizes of ionized regions are determined by the balance between the sinks and sources of ionizing photons(see,e.g.,McQuinn et al.2011).This process is modeled by the mean free path of ionizing photons,Rmfp(Sobacchi &Mesinger 2014),i.e.,the typical distance traveled by photons inside ionized regions before they are absorbed.Rmfpis determined by the number density of Lymanlimit systems and the optical depth of ionizing photons to them.In our data set,we explore the range of 2 ≤Rmfp≤20 Mpc in comoving scales.

In this paper,we adopt the standard ΛCDM cosmology with fixed values of cosmological parameters based on the Planck 2016 results (Planck Collaboration et al.2016),(0.678,0.308,0.0484,0.692,0.815,0.968).

2.2.21 cm Power Spectrum

The 21 cm brightness temperature is given by(e.g.,Mellema et al.2013),

Here,TSis the spin temperature of the IGM,Tγis the cosmic microwave background (CMB) temperature and dv||dr||is a peculiar velocity along the line of sight.xHIis neutral fraction of the hydrogen atom gas,andmatter density fluctuations.All variables are evaluated at the redshift z=ν0/ν-1.We focus on the regime in which the gas has been significantly heated,so that TS≫Tγ.For simplicity,we compute the 21 cm signal without accounting for the redshift space distortion.

The simplest observable that radio interferometer arrays can measure is the 21 cm power spectrum which characterizes the fluctuations in the 21 cm brightness temperature.The 21 cm power spectrum is defined by (e.g.,Furlanetto et al.2006)We also use the dimensionless 21 cm power spectrum,k3P21(k)/2π2.

2.3.H II Bubble Size Distribution

In this subsection,we briefly describe how we measure the H II bubble size distribution from the 3D ionization field map directly.While several different methods have been suggested to measure the bubble size and its distribution (e.g.,Mesinger&Furlanetto 2007;Zahn et al.2007,2011;Majumdar et al.2014;Lin et al.2016),there is no consensus on method,due to the fact that the connectivity in 3D ionized regions is highly irregular and complex.In this paper,after the 3D ionized fraction field is obtained from the reionization simulation using 21cmFAST,the bubble size distribution can be measured from the map of ionization field with the method employed by 21cmFAST(for details,please refer to Furlanetto et al.(2004),Zahn et al.(2007),Mesinger &Furlanetto (2007),Zahn et al.(2011),Mesinger et al.(2011)).Specifically,after a pixel of an ionized region is randomly chosen,the distance from this pixel to the nearest pixel of a neutral region along a random direction is measured.This Monte-Carlo procedure is repeated 107times,after which the bubble size distribution can be obtained by taking the volume-weighted average (Zahn et al.2007;Mesinger &Furlanetto 2007).The probability distribution function (PDF) is

where n is the number of bubbles with the bubble size in the range from R to R+dR.Note that the PDF is normalized to unity.

3.Artificial Neural Networks

In this section,we briefly describe the architecture of the ANN.The ANN is a machine learning technique inspired by the natural neuron networks in a human brain.It can be regarded as approximate functions that associate the input data with the output data.By repeated “training” with a set of simulation data (a.k.a.“training data”),the ANN can optimize itself in terms of its capability of predicting the output for a new set of data (a.k.a.“test data”).A typical ANN has a simple architecture that consists of three layers,as illustrated in Figure 1—an input layer,a hidden layer and an output layer,with each layer having a number of neurons.More generally,the number of hidden layers and that of neurons at each layer can be chosen arbitrarily.

Figure 1.Typical architecture of an ANN.The architecture of the ANN consists of an input layer,a hidden layer and an output layer of neurons.Each neuron in a layer is connected to the neurons in the next layer.

Figure 2.MSE evaluated for training samples as a function of the iteration number.

In our paper,for example,we set 14 neurons at the input layer,corresponding to the number of k-bins in the 21 cm power spectrum at each redshift.In the output layer,we set 212 neurons,which is the number of radius bins in the H II size PDF.We set up five hidden layers,each of which contains 212 neurons.

The ANN works as follows.The input data {xj} are fed to the neurons in the input layer.The ithneuron siin the first hidden layer is connected to the jthneuron in the input layer linearly with an associated weighti.e.,

where n is the dimension of the input data.In the hidden layer,the ithneuron is then activated by an activation function φ(s),i.e.,the output of this neuron ti=φ(si).Generally,the activation function is a nonlinear function.We employ the sigmoid function φ(s)=1/(1+e-s),because it has nice properties that it saturates to constant values when |s| is large,and that it is a smooth and differentiable function.Neurons in the next hidden layer are linearly connected to the activated neurons in the previous hidden layer,and then activated by φ(s)in a similar manner.Thanks to the nonlinear activation function,a trained ANN can approximate any function,in principle.The output data in the output layer is a linear combination of the activated neurons in the last hidden layer,

where k is the number of neurons in the last hidden layer,and L-1 is the number of all hidden layers.Note that the output data in the output layer are not activated.

The ANN trains its weights in such a manner that,for a set of training data with known values of input and output vectors,the output data generated by the networks are sufficiently close to the true values.Quantitatively,the weights are adjusted to minimize the cost function which is defined as

where Ntrainis the number of training data sets,m is the number of neurons at the output layer,and y and d are the networkgenerated and the true values of output data,respectively.We need to compute the partial derivative of E with respect to the individual weightsand find the local minimum of E using gradient descent.For this purpose,we employ the back propagation algorithm to compute the trained weights(Rumelhart et al.1986).The number of iterations for this algorithm should be large enough to ensure the convergence of results.Once we have trained the network weights using the training samples,we can make predictions for the output data for test samples,or apply the network to observational data.

In our paper,the input data are the 21 cm power spectrum P21(k,z)at some redshift z,with the wavenumber ranging from k=0.12 to 1.1 Mpc-1in 14 logarithmic k-bins (unless noted otherwise).We choose to avoid the larger-scale modes(k <0.1 Mpc-1) because of foreground contamination (e.g.,Liu et al.2014).The output data are the H II bubble size distribution PDF(R) at the corresponding redshift z,with the bubble size radius distributed in the range of 0.78 ≤R ≤1000 Mpc in Nradius=212 logarithmic R-bins.Our data sets consist of N=1000 realizations of simulations,with one realization for each set of values in the EOR parameter space{ζ,Tvir,Rmfp}.For each realization,we sample the data in 50 different redshifts in the range of z=7–12 (i.e.,Δz=0.1).Thus our total data sets contain 50,000 samples of input and output data.We first use Ntrain=48,000 random samples as the training data (with 9600 samples as the validation data sets)to train our neural network.After that,we apply the trained network to 2000 test samples of 21 cm power spectra,and generate 2000 H II bubble size distributions.These networkgenerated PDFs can be compared with the actual PDFs that are computed from the ionized maps directly(dubbed with“ANN”and“true”in figures or subscripts of quantities throughout this paper,respectively).For the purpose of illustration,unless noted otherwise,we choose a reference case with parameter values ζ=52.0,Tvir=4.5×104K and Rmfp=18.3 Mpc,consistent with current observational constraints on the reionization history (e.g.,Greig &Mesinger 2017b).Note that the allowed regime of mean neutral fraction is0.01 ≤in our data sets and the data withandare excluded from the samples.

For training the networks,we test the convergence of the back propagation algorithm and plot the mean squared error(MSE)

as a function of the iteration number for this algorithm in Figure 2.Here PDF(Ri,α) represents the number of bubbles with size Ri,αin the ithR-bin for the αthtraining sample.We find that the MSE converges to much below 10-4after 2000 iterations,corresponding to a numerical absolute error of 0.01 in the value of PDF,so we set 2000 back propagation iterations for all computations throughout this paper.This means that the PDF generated by our ANN has a numerical limit of 0.01,below which the PDF can be dominated by numerical errors.

Figure 3.(Top)H II bubble size distribution measured from the ionization field(black solid line)and that reconstructed from the 21 cm power spectrum by the ANN(red dashed line)at=0.39for our fiducial test EOR model.The KL divergence in this case is DKL=9.00×10-5.(Bottom) relative error (“RE”)of the ANN-reconstructed PDF with respect to the “true” bubble size distribution.We cut it off at the radius wherein the PDF is smaller than 0.01(the numerical limit set by our ANN).

The networks are tested by evaluating the accuracy of the recovered H II bubble size distribution in terms of the Kullback-Leibler (KL) divergence (Kullback &Leibler 1951).The KL divergence is useful in quantifying the similarity between two PDFs Piand Qi(here i is the index for the data points).It is defined as

DKLis close to zero if two PDFs are similar.In our case,Piand Qirepresent PDFtrue(Ri,α)and PDFANN(Ri,α)for a given data sample α,respectively.

4.Results

We apply our trained ANN to the test samples,and,in this section,show the results of reconstructing the H II bubble size distribution PDF from the 21 cm power spectrum,as compared with the PDF measured from the ionized fraction field.We first assume that the input 21 cm power spectrum is the pure signal from the simulation,and will later consider the scenario wherein the input power spectrum contains the thermal noise from radio interferometers.

4.1.ANN Recovery with Pure Signal

In the absence of thermal noise,the ANN-reconstructed H II bubble PDF is compared with the PDF from the ionized fraction field in Figure 3,for our fiducial test model at the stage

Figure 4.Same as Figure 3 but for different stages of reionization at =0.30,0.51,0.67,0.80.Larger corresponds to the earlier stage of reionization.The KL divergence is DKL=2.85×10-5,3.54×10-5,1.44×10-3 and 2.32×10-4,respectively.

when the mean neutral fractionThe KL divergence in this case is 9.00×10-5,and the relative error of the reconstruction,i.e.,systematic error using the ANN,is 10-3-10-1.This demonstrates that the reconstruction method works well.Note that the PDF generated by our ANN has a numerical limit of 0.01,below which the PDF is comparable to the numerical error set by the number of iterations in the back propagation algorithm,so we only calculate the relative error of the reconstructed PDF at the radii wherein the PDF ≥0.01.

We further test the accuracy of reconstruction at different stages of the reionization,and 0.80,in Figure 4.As reionization proceeds from high to low ¯xHI,the peak of the PDF shifts from small to large radii,due to the growth of ionized bubbles.This is consistent with the evolution of the peak of the 21 cm power spectrum which shifts from large to small k,as depicted in Figure 5.The KL divergence for the reconstruction of PDF at all stages of reionization is very close to zero (DKL~10-4),and the relative error is ≾10% for R ≾100 Mpc at all time.

Furthermore,we test the reconstruction for different test models of reionization (see Table 1).For the same mean xHI,the PDFs and 21 cm power spectra can vary for different models of reionization.Figure 6 affirms that at the same meanour fiducial model(Model 1)has the largest radius of the peak PDF,while the peak radius for Model 2 is the smallest.This is consistent with the fact that the peak of the 21 cm power spectrum appears in the smallest k for Model 1 and the largest k for Model 2,as displayed in the right panel of Figure 6.We compare the reconstruction among three different reionization models at the same fixedin the left panel of Figure 6,and find good accuracy for all cases.This indicates that the ANN can distinguish different H II bubble size distributions even if the ionization field is at the same global mean stage.

Figure 5.The 21 cm power spectrum at different stages of reionization,(purple/black/red/blue/green),respectively,for our fiducial test EOR model.=0.30 0.39 0.51 0.67 0.80

Table 1 Parameter Values of Reionization Models used for Test Samples in Figure 6

Figure 6.(Left)Same as Figure 3 but for three different test models given in Table 1(black/red/blue curves for Model 1/2/3,respectively).The KL divergence for Models 1,2 and 3 is DKL=6.78×10-5,4.77×10-3 and 2.35×10-3,respectively.(Right) The 21 cm power spectrum at the same fixed =0.39 for these models.

Figure 7.Distribution of the relative errors of bubble size distribution from all test models at some fixed bubble radii R=15(left),30(middle)and 60 Mpc(right).

To evaluate the accuracy for all test data (with different EOR models and at different redshifts),we plot the distribution histogram of relative error of the reconstructed PDF with respect to the“true”PDF,for some fixed bubble sizes,in Figure 7.In most cases,the relative error is <10%.This demonstrates that the H II bubble size distribution can be recovered successfully with good accuracy from the 21 cm power spectrum using the ANN technique,regardless of the stages of reionization and reionization models.

Figure 8.Same as Figure 3 but using only part of the information in the power spectrum,withk=0.15min,0.21,0.31 and 0.45 Mpc-1 (blue/purple/orange/green dashed,respectively),in comparison with the fiducial case of using all modes (red dotted line).The KL divergence is DKL=9.00×10-5,4.34×10-4,9.77×10-4, Mpc-1,respectively.

Figure 9.Same as the top panel of Figure 8 but for different stages of reionization at =0.30,0.51,0.67,0.80.

Figure 10.Same as Figure 3 but the reconstruction is made by the indirect approach (blue dotted) as well as recovered directly by the 21 cm power spectrum with the ANN (red dashed).

4.2.Scale Dependence of ANN Recovery

In practical observations,the large-scale power may be lost due to the removal of foreground contamination.Therefore,it is important to understand how the minimum wavenumberkminof the 21 cm power spectrum affects the reconstruction of the bubble size distribution,and test this convergence using simulation data.In Figure 8,we compare the H II bubble size distributions recovered by the ANN using the 21 cm power spectrum with varyingk=0.12min(all bins),0.15,0.21,0.31 and 0.45 Mpc-1,atfor our fiducial test EOR model,while the maximum wavenumberis fixed.We find that the reconstructed PDF fromalmost indistinguishable from that using all modes,and the KL divergence is just as good.This value ofconsistent with the peak in the 21 cm power spectrum that contains information on the characteristic bubble size.Also,losing more large-scale information (i.e.,enlargingkmin) can hurt the reconstruction and result in a relative error larger than 10%.This implies that the large-scale information in the 21 cm power spectrum,particularly at the peak of power,is indeed essential for reconstructing the PDF.We further test the scale dependence at different stages of reionization in Figure 9.We find that the largest possiblekmin,which compromises to give a good reconstruction,can depend on the stage of reionization,because the scale of power spectrum should be large enough,compared to the typical bubble size at that moment.For example,powers withcan also give as good reconstruction of PDF as powers of all modes,before reionization proceeds halfway

Figure 11.The 21 cm power spectrum signal (black solid),the total noise power spectrum for the confgiuration of SKA1 (purple dot–dashed) including the contributions from thermal noise (red dashed) and cosmic variance (blue dotted),for the fiducial test model at =0.39.

Figure 12.Same as Figure 3 but the reconstruction with the ANN is from the 21 cm power spectrum with total noise including cosmic variance and thermal noise(with the shaded region representing the 1σ confidence region).The mean and variance of PDF are computed from 10 realizations of random noise.

For the complete information,we also varykmaxwith fixedand find that the recovered PDF and KL divergence only weakly depend onkmax.

4.3.Comparison with Indirect Reconstruction Approach

In Section 1,we commented that there could be an alternative,indirect,reconstruction approach,which is to infer the H II bubble size distribution from the prediction of the underlying reionization simulation with the reionization model parameters that are best fit by the 21 cm power spectrum.This indirect approach is more interpretable and intuitive than the direct ANN-based reconstruction.In this subsection,we compare the accuracies of both approaches in Figure 10.We find that while the indirect approach can outline the approximate shape of the PDF,it can only capture the characteristics in a biased manner,in terms of both location and height of the PDF peak.Quantitatively,the indirect approach results in an error of tens of per cent.This should be due to the degeneracies in reionization parameters that can result in large errors in best fitting the parameter values,which,in turn,leads to errors in the H II bubble size distribution in the indirect approach.In comparison,the direct,ANN-based approach can have an error within 10%,and is therefore more accurate than the indirect approach.

4.4.ANN Recovery with Thermal Noise

In previous subsections,we assume that the input 21 cm power spectrum is the pure signal from simulations.In practical observations,however,the measurements of 21 cm power spectrum contain random noise.For large radio interferometer arrays like the SKA,the noise is dominated by thermal noise at small scales,but cosmic variance becomes important at large scales.In this subsection,we will take into account both thermal noise and cosmic variance,and investigate the effect of the noise power spectrum on the reconstruction of H II bubble size distribution.

The thermal noise power spectrum for a mode k is given by McQuinn et al.(2006);Mao et al.(2008,2013)

Here dA(z) is the comoving angular diameter distance at z,where λ21=λ(z)/(1+z)=0.21 m and H(z)is the Hubble parameter at z.Ω=λ2/Aeis solid angle spanning the field of view and t is the total integration time.Tsysis the system temperature of antenna,which is the sum of the receiver temperature of~100 K and the sky temperature Tsky=60(ν/300 MHz)-2.55K.Compact layout of the radio interferometer array can repeatedly measure one visibility mode,thereby reducing the thermal noise.denotes the number of redundant baselinesL⊥kcorresponding to k⊥within the baseline area equal to the effective area per station Ae.The thermal noise for the mode k depends on the projection of k on the sky planewhereμ=cosθand θ is the angle between the mode k and the line of sight.

The thermal noise for the spherically averaged power spectrum over a k-shell is given by Lidz et al.(2011)

Nc(k,μ) is the number of modes in the ring with μ on the spherical k-shell with the logarithmic step size δk/k=∈,Nc=∈k3Δμ ×vol/4π2,and vol is the survey volume of the sky.The sum here accounts for the noise reduction by combining independent modes.Thus it runs over the upper half shell with positive μ since the brightness temperature is a real-valued field,and only half of the Fourier modes are independent.

The cosmic variance for the 21 cm power spectrum is estimated by

where Nmodes=∈k3×vol/4π2is the number of modes in the upper half k-shell.

In this paper,we consider an experiment similar to the lowfrequency array of SKA Phase 1 (SKA1).Specifically,we assume a configuration such that 224 stations are compactly laid out in the core with 1000 m in diameter,and the minimum baseline between stations is 60 m.We assume that the field of view of a single primary beam is full width at half maximum(FWHM)the effective area per station Ae≈421 m2at z~8,the total integration time is 1000 hr,the bandwidth of a redshift-bin is 10 MHz and the step size of a k-bin is ∈=δk/k=0.1.Figure 11 plots the total noise power spectrum PN(k)=Pthermal(k)+Pcv(k)as well as the respective contributions from cosmic variance and thermal noise.Our result is consistent with previous studies,e.g.,Koopmans et al.(2015).For SKA1,the cosmic variance is always negligible compared to the signal,and the thermal noise is smaller than the signal for k ≤1 Mpc-1.In other words,the 21 cm signal dominates over the noise except at small scales.Since the reconstruction is not sensitive tokmax,as shown in Section 4.2,we expect that the reconstructed PDF should not be significantly affected by the noise.

Figure 13.Same as Figure 12 but for different stages of reionization at =0.30,0.51,0.67,0.80.

Figure 14.Same as Figure 12 but for different integration time of 100 hr (red dot–dashed)and 1000 hr(blue dashed).We only show one realization here,for illustrative purpose.

We model the measured 21 cm power spectrum as P(k)=P21(k)+N(k),where P21(k) is the 21 cm power spectrum signal,and N(k) is a random draw from a Gaussian probability distribution with zero mean and variance equal to the square of total noise power spectrumFor each test EOR model,we generate 10 independent realizations from the total noise power spectrum as the input data for the ANN,and from the 10 different outputs of reconstructed PDFs,compute the mean and variance.Figure 12 displays the 1σ confidence level region of the 10 different outputs of reconstructed PDFs for the fiducial test EOR model atWe also show the evolution of the reconstruction in Figure 13.We find that even if the total noise is accounted for at the sensitivity level of SKA1,the reconstruction of H II bubble size distribution with the ANN still works well at the relative error level of 10%(except for large radii ≿100 Mpc).This finding is confirmed for other test EOR models.

Nevertheless,if we reduce the total integration time from 1000 hr to 100 hr,then the thermal noise will be much larger.As a result,as depicted in Figure 14,the reconstruction of H II bubble size distribution will cause significant systematic error,which is ≿10%at scales smaller than the peak of PDF,but can be tens of per cent at larger scales.

5.Summary and Conclusions

In this paper,we propose a new,ANN-based method to reconstruct the H II bubble size distribution directly from the 21 cm power spectrum.Our method will allow tracing the evolution of H II bubble size distribution during cosmic reionization when only 21 cm power spectrum measurements will be available,but direct measurements of bubble size distribution cannot be done without 3D 21 cm imaging data.Nevertheless,our reconstruction method implicitly exploits the modeling in reionization simulations,and hence the recovered H II bubble size distribution is not an independent summary statistic from the power spectrum,and should be used only as an indicator for understanding H II bubble morphology and its evolution.

We train our neural networks with 48,000 training data sets and tested the networks with 2000 test data sets.These data sets are generated by varying EOR parameters for 1000 realizations with the semi-numerical code 21cmFAST.We use the 21 cm power spectrum for k=0.12–1.1 Mpc-1in 14 k-bins as the input of the networks,and generate the H II bubble size distribution PDF(R) for R=0.78–1000 Mpc in 212 R-bins as the output,at z=7–12.We train the weights of ANN using the back propagation algorithm.

We apply the trained networks to the test data sets to test the accuracy of recovery.We demonstrate that the recovered H II bubble size distribution can be almost as accurate as that directly measured from the ionization map with the fractional error <10%for R ≾100 Mpc at all stages of reionization,with the KL divergence DKL≪1 at all time.This result is generic for a number of EOR models.

We further investigate the main contributions to the reconstruction,and find that the large-scale modes are particularly important.The reconstruction results are sensitive to the minimum wavenumber cutoffkmin,while weakly depending on the maximum wavenumber cutoffkmax.Thekminshould correspond to the scale that is much larger than the typical bubble size,so it depends on the stage of reionization.For the early and middle stageskminmust be smaller than 0.21 Mpc-1,in order for the reconstruction results to converge.For the later stage,e.g.,atkminshould be smaller than 0.15 Mpc-1.

In principle,the reconstruction of PDF can be achieved alternatively with an indirect approach,which is to first constrain the best fit values of reionization parameters from the 21 cm power spectrum measurements and then obtain the H II bubble size distribution from the prediction of reionization simulations using the best fit reionization parameters.However,this indirect approach can result in an error of tens of per cent in the reconstructed PDF,in comparison with the <10% error in our direct,ANN-based method.

Our reconstruction is tested when the thermal noise and cosmic variance at the SKA1 noise level are applied to the 21 cm power spectrum.Since the total noise for SKA1 is subdominant for k ≾1 Mpc-1,assuming the integration time of 1000 hrs,our reconstruction results are not affected much by the noise,i.e.,the recovered PDF agrees with that directly measured from the ionization map with the fractional error<10% for the radii R ≾100 Mpc at all stages of reionization.

Note that this fractional error of ≾10% refers to the difference in the reconstructed PDF with respect to the true value.These are the systematic errors using the ANN.However,an estimation of statistical uncertainties is not performed in this paper.In principle,this can be done by neural network techniques such as the density-estimation likelihood-free inference (Alsing et al.2018,2019;Zhao et al.2022).We defer the implementation of this technique to future work.

Acknowledgments

This work is supported by the National SKA Program of China (Grant No.2020SKA0110401),NSFC (Grant Nos.12103044,11821303 and 11850410429) and National Key R&D Program of China (Grant No.2018YFA0404502).H.S.was also supported in part by grants from Yunnan University.We thank Rennan Barkana,Xuelei Chen,Anastasia Fialkov,Nicolas Gillet,Andrei Mesinger and Benjamin Wandelt for fruitful discussions and valuable feedbacks.

ORCID iDs