UCD Alum and UCD Core Facility Statistician, Vince Buffalo gives his take on the enormous challenge facing the scientific community of integrating high quality statistical analysis of virtually limitless “omics” and next generation sequencing data.
When discussing R and Bioconductor with other researchers, it’s easy to convince them to adopt both for analyzing statistical data – the data that comes in the very final stages of a bioinformatics analysis. It’s usually much more difficult to convince them to consider working with high throughput sequencing data in Bioconductor. Folks complain that it’s (1) not worth it to process sequencing data with Bioconductor tools or (2) it’s not fast enough. I’ll address the second point in a bit; more importantly I want to emphasize that it’sabsolutely worth it to process sequencing data in Bioconductor.
In analyzing genomic data, we take very, very, very high dimension data and try to condense it into biologically meaningful conclusions without being misleading or getting something wrong. Every step is about taking dense data and making it understandable: we take sequence reads and try to assemble them into larger contigs and scaffolds, we take cDNA reads and try to map them back to genomes to understand expression, etc. At each step, our tools make heuristic or statistical choices for us. Pipelines woefully ignore these choices because in most cases, after a step is completed, a script jumps to the next step.
When I think about these steps, I try to assess what I think of as “informational leakage” in bioinformatics processing. Each step summarizes something, hopefully in a way without bias or too much noise. Informational leakage is the information that’s lost between steps. Catastrophic information leakage occurs when we lose information that could have indicated whether the data is biased or incorrect. We can hedge the risk of information leakage when we use summary statistics between steps that try to capture this leaked information.
The whole article can be read here