Ensemble Modeling Approach Targeting Heterogeneous RNA-Seq Data

Enrico Capobianco

Ensemble Modeling Approach Targeting Heterogeneous RNA-Seq Data

By accessing The Cancer Genome Atlas (TCGA) Dr. Capobianco and his team built the wide-spectrum transcriptome landscape of skin cutaneous melanoma (SKCM) from 103 primary tumor samples. They measured the expression levels of protein coding genes and non-coding RNAs (ncRNAs). While recapitulating the biotypes from the SKCM differential expression profile, they found that no single method could offer a convincing solution, impeding the identification of a consensus of significant biotype values. An ensemble modeling approach was thus proposed to reconcile all evidences.

By stepping from a redundant consensus set with thousands of significant detections, to a substantially reduced model-driven coreset, their methodological strategy led to a few advantages:

  • First, a constrained space of differentially expressed bioentities was created to select high-confidence values enriching for relevant pathway terms.
  • Second, the ensemble approach used alternate regression modeling principles to embed the profiles obtained from each method.
  • Third, a comparative assessment of predictive power became possible.

Notably, a generalization of their approach is possible across cancer types and experimental system conditions generating profiles. Their strategy yielded a multitude of high-quality protein coding gene and ncRNA identifications, and revealed potential associations between pseudogenes and parental genes. Among their validations, PINK1 was one here detected as a differentially expressed target parental gene, and is well-known to be associated with Parkinson and cancer (especially melanoma).

The poster “Ensemble Modeling Approach Targeting Heterogeneous RNA-Seq data: Application to Melanoma Pseudogenes” was presented at the Big Data 2017 Conference: Big Data in Biomedicine: Transforming Lives Through Precision Health at Stanford University, May 24-25, 2017.