Title:

The importance of statistical measure when describing phenotype

Data collected in life sciences studies mostly include a genotype description of the organism, a phenotype characterisation of the organism, and experimentspecific covariates including a description of experimental procedures and laboratory (environmental) conditions. Here, phenotype measurements are taken for Neurospora crassa (wild type) growing on agar in the standard laboratory conditions. I define a phenotype as a set of traits including apical extension velocity, branching angle, and branching distance. I use the above measures (traits) to model (estimate) biologically complex filamentous fungi network as a simplified 'In Silico Fungus' consisting of series of straight lines. Phenotype data, under the central limit theorem, is often characterized by means and standard deviations. Subsequently, P values are used to show statistical validity. Here, I question whether making normality assumption based on the popularity of such approach is always justified. Therefore, I test three different scenarios by making different assumptions about the data collected. (1) Firstly, I use the most popular approach: I assume the phenotype data comes from the continuous, normal (Gauss) distribution. Thus, I predict the future measurement outcomes by using normal (Gauss) parametric approximation. (2) Secondly, I use the most intuitive approach: I do not make any assumptions about the data collected and use it to predict the future measurement outcomes by withdrawing values pseudo randomly from the actual, raw, and discrete dataset. (3) Finally, I use the strategy balanced between the previous two: I construct a customised, continuous, and nonparametric distribution based on the data collected. Thus, I predict the future measurement outcomes by using kernel density estimation method. Subsequently, I implement all of the strategies above: (1), (2), and (3) in the in silico fungus programme to compare the computer simulation outcomes. More specifically, I compare the surface coverage, expressed as the proportion of the surface occupied by the fungus. Obtained results show that the differences between different data regimes (1), (2), and (3) are significant. Therefore, I conclude that the correct assessment of the data normality is crucial for the correct interpretation and implementation of scientific observations. I suspect the described data classification process determines successful implementation of biological findings especially in the fields such as medicine and engineering.
