Computational RNA biology
Over the last two decades the field of biology has increasingly become a data-driven discipline. Thanks to the remarkable rise in the availability and diffusion of "omics" methods, the typical research lab now generates extraordinary amounts of data, which provide an invaluable resource for the scientific community to make predictions or draw unforeseen conclusions. Such explosion in the breadth and depth of readily available data has created the need for innovative analytical methods and statistical approaches, capable of dealing efficiently with large volumes of data and extracting insightful information. Along these lines, the "Computational RNA biology" lab develops algorithms, software and data-analysis strategies to study the function, evolution and regulation of RNA molecules. We have a particular focus on the application and development of methods based on Nanopore direct RNA-Sequencing, a new technology that for the first time allows to sequence full-length, native RNA molecules without the need for retrotranscription or amplification. By designing and creating dedicated tools we can leverage the potential of this technology to dissect the transcriptional output of complex loci, quantify isoform expression and determine non genomically encoded transcript features, such as epitranscriptional modifications or polyA tail length.
Our lab adheres to the principles of open science and open source software and we are committed to developing cutting-edge analytical methods and releasing them to the scientific community as open source software packages. We do so by adhering to modern software development best practices, such as thorough unit testing, continuous testing/integration, and heavy use of containerization for dependency management and full reproducibility.
Algorithms for the analysis of Nanopore direct RNA Sequencing data
The recent advent of direct RNA sequencing with the Nanopore technology allowed for the first time to natively sequence full-length RNA molecules without the need to retro-transcribe them into cDNA, thus overcoming many of the limitations and biases of short-read sequencing. Furthermore, Nanopore native RNA sequencing is also potentially capable of identifying RNA modifications from the raw sequencing signal, thus combining in a single technique RNA quantification and sequence-specific detection of RNA modifications. To leverage this feature we are developing Nanocompore, a statistical analysis methods that compares samples at the raw signal level in order to identify putative RNA modification sites. Applying this technique to knock-down samples for RNA modification writer enzymes allowed us to identify several pseudouridine and m6A sites in a panel of human long non-coding RNAs that were targeted for sequencing. This method will soon be released as an open-source python package and will allow researchers to identify virtually any RNA modification relative to an unmodified reference (e.g. knock-downs or in vitro transcribed RNAs). This project is in collaboration with Dr Adrien Leger (Birney lab, EMBL-EBI) and Dr Paulo Amaral (Kouzarides lab, University of Cambridge).
Development of genome analysis/annotation tools
Efficient and reliable tools that do one thing and do it well are the workhorse of bioinformatics. The scientific community heavily relies on this type of software, as it saves time, reduces bugs and code duplication and allows better reproducibility of analysis pipelines. In line with this, the lab strives to release all the tools that we develop as stand-alone, well documented software packages. One recent example is bedparse, a python tool that aims to simplify and standardize many of the operations and feature extractions commonly done on BED files. Bedparse allows to annotate and filter transcripts, convert between formats and extract useful transcript features, such as CDS, introns, exons or promoters. It is thoroughly and rigorously tested through an automated test suit, has comprehensive documentation and usage tutorial and is released under the MIT open source software license.
Expression profiling of Endogenous Retroviruses
During the course of evolution it happened multiple times that retroviral particles infected the germline and became part of the human genome, with the result that ancient pro-viral genomes now account for approximately 5-8% of the genome of present day humans. In response to these insults we have evolved precise defense mechanisms, which in normal conditions control and regulate the expression and activity of these endogenous retroviruses (ERVs). However, when these mechanisms of control fail, the de-regulated expression of ERVs can interfere with the physiological processes of the cell and has the potential to lead to or contribute to the development of diseases such as cancer. Our research aims to provide an accurate characterization of ERV expression profiles in health and disease and to identify the molecular mechanisms responsible for their aberrant expression in cancer.
We have previously developed a computational method that allowed to de-convolute transposon‐driven transcription in spermatogenesis based on expression signal from Illumina short-read RNA-Seq datasets. We have then improved such method using the Markov Cluster Algorithm to accurately quantify expression of ERV families and preliminary data support the validity of the approach. At present, we are further refining ERV detection through the use of Nanopore direct RNA-Seq, which allows to sequence and quantify full-length transcripts derived from individual ERV insertions. The combination of these approaches will allow us to produce a database of ERV expression signatures, providing for the first time a consistent and accurate picture of the landscape of ERV expression in physiological and pathological conditions.