Popis: |
The processes which govern the function of biological organisms are inherently dynamic and studying their behaviour over time is critical for gaining insight into their underlying mechanisms. They are also incredibly complex with tens of thousands of interacting variables comprising their state. In recent years, the development of high-throughput assaying technologies such as microarrays and nuclear magnetic resonance spectroscopy have revolutionised the fields of genomics and metabolomics respectively with their ability to quickly and easily interrogate these states at a single moment in time. When these assaying technologies are used to collect measurements repeatedly on the same biological unit, such as a human patient, laboratory rat or cell line, then the temporal behaviour of the system can begin to emerge. Furthermore, when several of these units are studied simultaneously then the experiment is said to be biologically replicated and such data sets permit the inference of systemic behaviour in the population as a whole. The time series data sets arising from these replicated `omics experiments possess unique characteristics that make for challenging statistical analysis. They are very short (3-10 time points is typical), heterogeneous, noisy, frequently irregularly sampled and often have missing observations, in addition to being very highly dimensional. To overcome some of these difficulties, researchers in the field of genomics have turned to functional data analysis, which has proven to be successful in modelling unreplicated data sets. Replicated data sets, however, have received far less attention, due to the complexity introduced by the extremely small sample sizes and multiple levels of variation - the between-variable and the between-replicate. Furthermore, despite the remarkable similarities between genomics and metabolomics time series data sets, these methods have been far less successful at establishing themselves in the latter field. In this thesis we present a general statistical framework for the analysis of replicated, high-dimensional biological time series data sets. Supported by three case studies, we develop novel models and algorithms for tackling the unique challenges that each data set presents. We show how these fitted models can be used in dimensionality reduction, summarising the thousands of observed time series into a small number of representative temporal profiles that are eminently biologically interpretable. We introduce a novel moderated functional t -statistic that can be used for detecting variables that differ significantly between two biological groups, leveraging the high dimensionality of the data in order to increase power. In all instances detailed simulation studies are used to demonstrate that the methods outperform existing state-of-the-art approaches. With practical data analysis in mind, careful consideration is given to the implementation of the methods in software that is computationally efficient, with parallel programming exploited wherever possible. In most instances, the methods have resulted in novel biological findings when applied to real data, and represent, as far as we are aware, the first application of such functional data analysis models to metabolomics time series experiments. |