Title:

Bayesian estimation of luminosity distributions and model based classification of astrophysical sources

The distribution of the flux (observed luminosity) of astrophysical objects is of great interest as a measure of the evolution of various types of astronomical source populations and for testing theoretical assumptions about the Universe. This distribution is examined using the cumulative distribution of the number of sources (N) detected at a given flux (S), known as the log(N)−log(S) curve to astronomers. Estimating the log(N) − log(S) curve from observational data can be quite challenging though, since statistical fluctuations in the measurements and detector biases often lead to measurement uncertainties. Moreover, the location of the source with respect to the centre of observation and the background contamination can lead to nondetection of sources (missing data). This phenomenon becomes more apparent for low flux objects, thus indicating that the missing data mechanism is nonignorable. In order to avoid inferential biases, it is vital that the different sources of uncertainties, po tential bias and missing data mechanism be properly accounted for. However, the majority of the methods in the relevant literature for estimating the log(N)−log(S) curve are based on the assumption of complete surveys with non missing data. In this thesis, we present a Bayesian hierarchical model that properly accounts for the missing data mechanism and the other sources of uncertainty. More specifically, we model the joint distribution of the complete data and model parameters and then derive the posterior distribution of the model parameters marginalised across all missing data information. We utilise a Blocked Gibbs sampler in order to extract samples from the joint posterior distribution of the parameters of interest. By using a Bayesian approach, we produce a posterior distribution for the log(N) − log(S) curve instead of a bestfit estimate. We apply this method to the Chandra Deep Field South (CDFS) dataset. Furthermore, approaching this complicated problem from a fully Bayesian angle enables us to appropriately model the uncertainty about the conversion factor between observed source photon counts and observed luminosity. Using relevant spectral data for the observed sources, the uncertainty about the fluxtocount conversion factor γ for each observed source is expressed through MCMC draws from the posterior distribution of γ for each source. In order to account for this uncertainty in the non detected sources, we develop a novel statistical approach for fitting a hierarchical prior on the fluxtocount conversion factor based on the MCMC samples from the observed sources (a statistical approach that can be used in many modelling prob lems of similar nature). We derive in a similar manner the posterior distribution of the model parameters, marginalised across the missing data, and we explore the impact in our posterior estimates of the parameters of interest in the CDFS dataset. Studying the log(N) − log(S) relationship for different source populations can give us further insight into the differences between the various types of astronomical pop ulations. Hence, we propose a new softclustering scheme for classifying galaxies in different activity classes (Star Forming Galaxies, LINERs, Seyferts and Composites) using simultaneously 4 optical emissionline ratios ([NII]/Hα, [SII]/Hα, [OI]/Hα and [OIII]/Hβ). The most widely used classification approach is based on 3 diagnostic diagrams, which are 2dimensional projections of those emission line ratios. Those diagnostics assume fixed classification boundaries, which are developed through theoretical models. However, the use of multiple diagnostic diagrams independently of one another often gives contradicting classifications for the same galaxy, and the fact that those diagrams are 2dimensional projections of a complex multidimensional space is limiting the power of those diagnostics. In contrast, we present a data driven soft clustering scheme that estimates the posterior probability of each galaxy belonging to each activity class. More specifically, we fit a large number of multivariate Gaussian distributions to the Sloan Digital Sky Survey (SDSS) dataset in order to capture local structures and subsequently group the multivariate Gaussian distributions to represent the complex multidimensional structure of the joint distribution of the 4 galaxy activity classes. Finally, we discuss how this softclustering can lead to estimates of populationspecific log(N) − log(S) relationships.
