MOTIVATION

At the moment, in many studies, various "omics" platforms are used in parallel to study cell transcriptome, genome, proteome, epigenome. The outcome of the most techniques is commonly reduced to a list of genes (differentially expressed, hypo/ hyper methylated, mutated, amplified/deleted and so on). Although evidently dependent, in many cases, genes reported by different "omics" techniques are not overlapped. Finding functional associations between identified lists pose a bioinformatics challenge. Here we propose to use a network based framework incorporating robust statistical principles to estimate significance of inferred models.

OUR MODEL

In our model we have two input gene lists (referred to as gene list a and gene list b) that have been selected (based on some experimental results) from some general sets of genes (denoted as list A and list A) representing either all genes from a genome or genes that have been profiled by "omics" platforms (like all genes from the chip). We are using a reference gene network (external knowledge) which is supposed to model spreading the signal in the cell from one gene to another. At the moment our tool employs 2 different reference gene networks: Reactome pathway database and Intact database of protein interactions.

To start computations you need to provide:

  • gene list a
  • gene list b
  • reference gene set for list b (Optional)
Ideally you need to provide also a list B (a reference set of genes used to select list b). This is optional. If list B is not provided we assume that list B is all known genes.

METHODS

Gene vs. Gene List The core of our approach is statistical model to relate a gene (gene "a") to a gene list (list b) given a reference gene network and a list B. To implement this we use a schema presented in figure 1:

Figure 1. statistical model to relate a gene "a" to a gene list (list b)

  • First, we compute the distance from gene "a" to all genes from list b using reference gene network. Distance is defined as a minimal number of steps required to get from one gene to another using edges of the reference network.
  • Second, we define the connectivity score Sab (between gene "a" and list b) based on the number of genes from b having distance 1,2,3, ,n to gene "a".
  • Third, to find whether the connectivity score is significant we implement Monte Carlo procedure. We sample randomly from list B the gene list "r" equivalent in size to the list b. We deduce the connectivity score Sr (between gene "a" and list "r"). We repeat the procedure N times (up to N= 10 000 if required) to find out the distribution Srj (j=1, 2, , N) of the connectivity score between gene "a" and a random gene list (equivalent to the input list b). The significance (p-value) of the score Sab is computed as p = k/N, where k is a number of times the score Sab was less or equal to the scores from Srj distribution.

Gene List vs. Gene List

The "Gene vs. Gene List" procedure is repeated for each gene "a" from the input list A. In this case we test a number of hypotheses (equals to the number of genes in list A) and need to apply standard FDR procedure to adjust for multiple testing.

OUTPUT

As output, the genes from the list a are ranked by significance score of the inferred model in relation to the list b. The p-value of the model for a gene "a" can be interpreted as a probability to get the same (or better) connectivity score Sab for a random gene list equivalent in size to list b. For each gene "a" from the list a with significant p-value the visualization of the network model is provided. Example is presented in figure 2.

Figure 2. Example of network model for gene "a"

Please note that you can produce how quality figures of the network models. Please read section "How to produce high quality Network Figures"