Andrej-Nikolai Spiess UKE Hamburg, Germany |
Abstract
In the last ten years, the amount of experimental data acquired by high-throughput technologies such as microarrays and RNA sequencing (RNAseq) has increased exponentially and resulted in partly Gigabyte-sized expression matrices. It is not uncommon that the researcher is faced with tables of 20000 rows (transcripts, genes) and 2000 columns (samples), necessitating mathematical, computational and visual approaches that are specifically tailored to these high-dimensional datasets. Frequently, the wet lab scientist “outsources” these analyses to an associated bioinformatics department, getting in return an often black box-type sophisticated analysis on which to rely. Here, it is important that a common ground on existing analysis approaches of this kind of data must be established. In my talk, I will give a concise and comprehensive overview on existing methods to analyze large-scale gene expression data. Without going into deep mathematical details – these can be obtained from the literature – I will provide an outline on the important aspects and idiosyncrasies of current methodology based largely on the 2D- and 3D-visual depiction of data. Starting from very basic topics such as data cleaning/normalization/scaling, I will emphasize on efforts to uncover the intrinsic signature of the data (without imposing any presumptions), based on unsupervised clustering methods such as hierarchical clustering and dimension reduction methods such as PCA (linear) or the recent t-SNE approach (non-linear). I will demonstrate that in published datasets, the intrinsic structure of the data can be significantly different to the one assumed or defined by the experimental setup (such as batch effects). Next, I will give a summary on how to filter signatures that discriminate between different cellular states and how to use computationally expensive methods (bootstrapping, cross-validation) to avoid extracting signatures that perform great on the training set but bad on independent data (overfitting). Along these lines, a short introduction on recent machine learning approaches such a random forests, neural networks and gradient boosting will be delivered, and their advantage in finding predictive biomarkers and reduced discriminator sets through feature selection. For all the discussed approaches, I will also highlight the different pitfalls, for instance when to correct for multiple testing, why to never perform a statistical test before clustering, and (quite crucially) the identification of differential expression that is mimicked by the shifting of cellular proportions.
Back to GQ2019 overview page |
---|