The Gene Thicket

TUM Master's Thesis

PDF and CODE

The gene thicket is my Master’s thesis project. The gene thicket is a deep learning-based model, which goal is to build gene regulatory networks that are capable to predict future gene expressions.

What is so amazing about gene regulatory networks?

Gene Regulatory Networks are graphs that describe the regulatory process of a system. Their nodes are genes and their edges represent the regulatory relationship between two genes. The edges have a direction (which gene regulates which other gene), a sign (activation or repression) and a weight (strength of the relationship). If we understand how a system works, then we can develop better products to help people, animals and every other living organism.

The inference of gene regulatory networks (GRNs) has been an area of research for more than twenty years! GRNs are an amazing challenge in Computational Biology since they relate to many different things, such as perturbations, trajectory inference, RNA velocity and much more! We can also be creative and incorporate different types of data.

The gene thicket

The gene thicket is a GRN inference method that predicts the future of each cell incorporating scATAC-seq data as prior, temporal data, a CNN architecture that predicts using only the past, and modified filters of attention to include the signs and magnitude of how the transcription factors affect the target genes.

The motivation to build the gene thicket is to have a method that considers non-linear relationships between target genes and transcription factors, such as SCENIC, but that also considers time series data. This is because the change in expression of transcription factors has an impact in the change of expression of target genes, which is observed when the expression profiles are sorted through time.

An advantage of having a time series approach is that we can make forecasts about the expression values of target genes using the previous expression values of transcription factors. Then, we could compute displacements and simulate trajectories, in a similar way that CellOracle does. The idea of using deep learning comes from the fact that neural networks are universal non-linear approximators. By using temporal convolutional neural networks with attention, we ensure the scalability and interpretability.

Besides the use of scATAC-seq data, the gene thicket has the advantage of interpretability by using attention scores, the causal validation by using temporal data and knowing the relation of the influence between the transcription factors and the target genes by using signs.

The results

To evaluate the quality of any reconstructed gene regulatory network is extremely challenging, because a ground truth does not exist. Moreover, there are diverse factors that have an impact on experimental data and could affect the interactions in a system. These factors can be primary cell types, environmental conditions, technology platforms and cell lines. To overcome these limitations, I evaluated the model’s performance using synthetic and curated data, and investigated the model’s behavior in real biological data that describes the Pancreatic endocrinogenesis.

The gene thicket is able to predict linear interactions between transcription factors and target genes from the synthetic data (left), however it struggles to predict bifurcations on curated data, since this behaviour was not considered when modelling.
When using the pancreas data, the gene thicket is able to predict many trajectories (as we can see on the left side), however it struggles with some abrupt changes in transcription (on the right side).

Conclusions and Outlook

  • The gene thicket is a first step to iterative GRN-velocity approach.
  • We can compute cell displacements using a GRN inference method.
  • CNN’s are flexible when computing GRNs.
    • able to look at many points in the past
    • approximate non-linear trends
    • scalable
    • can be interpretable (even discover delays)
  • We need to consider:
    • multiple trajectories
    • abrupt changes in gene expression trends
    • uncertainty