test
test test test

eXtasy: Variant Prioritization by Genomic Data Fusion


What is eXtasy?

eXtasy is a pipeline for ranking nonsynonymous single nucleotide variants given a specific phenotype. It takes into account the putative deleteriousness of the variant, haploinsufficiency predictions of the underlying gene and the similarity of the given gene to known genes in the given phenotype.

Who develops eXtasy?

eXtasy was developed in the Bioinformatics group at the Department of Electrical Engineering of the University of Leuven (part of the iMinds Future Health Department). It was implemented by Alejandro Sifrim and Dusan Popovic under the supervision of Prof. Jan Aerts, Prof. Bart de Moor and Prof. Yves Moreau.

What is the input of eXtasy?

One can run eXtasy on any VCF file mapped to hg19/Gchr37. As a second input the user can choose any of the precomputed gene prioritization files for a given HPO term (downloadable here). In the near future we will provide the user the possibility of creating custom gene prioritizations given a set of phenotype-associated genes.



Run eXtasy online:

(Comma-separated)
Example Data: miller.vcf,schinzel_giedion.vcf
We provide two example vcf files which were generated by adding published disease causing variants for Miller syndrome (causative gene: DHODH, Ng et al., 2010, Nature Genetics) or Schinzel-Giedion syndrome (causative gene: SETBP1, Hoischen et al., 2010, Nature Genetics) to a publicly available VCF file of the exome of a healthy individual (obtained from here). These files can be prioritized against any of the phenotype terms which characterize the syndromes. For Schinzel-Giedion this could for example be HP:0009924 (Hypoplasia/aplasia involving the nose) or for Miller syndrome this could be HP:0000347 (Micrognathia).
Human Phenotype Ontology (HPO) terms are autocompleted (e.g. typing 'polyd' will autocomplete to 'polydactyly'), example HPO terms are HP:0002752 or HP:0100258. A list of available phenotypes for eXtasy can be found here.. In the case of problems trying to find the correct HPO ID, we recommend using the Phenomizer tool, which can resolve frequently used synonyms. HPO IDs from Phenomizer can directly serve as inputs to eXtasy (if no prioritization is available for a given HPO ID, a warning will be given).
The generated output is a list of missense variants annotated with all the features used by eXtasy and two eXtasy scores: "complete" and "imputed". These scores range from 0 to 1, representing the proportion of votes from the individual decision trees in the Random Forest to call the variant “disease-causing”. A score of 1 signifies that all trees in the Random Forest voted that the variant was deemed disease-causing, while 0 would mean that none of the trees voted this. In the benchmark reported in the eXtasy manuscript a majority vote ( a score > 0.5) was set as a threshold to call a variant disease-causing. Due to the occurrence of missing data in the input features of the variants we trained 2 different Random Forests: “complete” and “imputed”. If all features were present, the "complete" model scores give the best theoretical performance as it contains the most information. If missing values were present only the "imputed" scores will be given. These are calculated on a smaller model excluding haploinsufficiency scores and replacing any missing deleteriousness scores by the median across all variants for those scores.
The outputfile is compressed (and can be opened with tools like 7-zip, winzip or winrar). This compressed archive contains a file with the .extasy extension, this file is a tab-separated text file which can be opened in any text editor or spreadsheet tool (like OpenOffice Calc or Microsoft Excel).
As a guideline from preliminary benchmarks obtained by injecting single disease-causing variants (from HGMD) into simulated whole-exome data (~9,000 non-disease-related variants), we observed that using eXtasy the disease-causing variants had a median rank of 5 (1st Qu.: 1, 3rd Qu.: 29, mean 41)(Unpublished results). This naturally depends on the background variation present and the stringency of filtering before prioritization.
Input file uploads and results are systemically removed after 2 weeks. We garantuee confidentiality of the uploaded files and will only inspect uploaded files/results in case of a reported error or on request of the user.

Run eXtasy offline:

To install eXtasy:

mkdir eXtasy
cd eXtasy
wget https://github.com/asifrim/eXtasy/archive/0.1.tar.gz -O 0.1.tar.gz
tar -vxzf 0.1.tar.gz
./install.rb

The install.rb script will check if R (and the required libraries), Tabix, bgzip and Bedtools intersectBed are found in the user's PATH environment variable. It will also download the necessary input data for eXtasy (~10GB). eXtasy will require 30GB of free disk space once all the input data is extracted. To add a directory to the PATH variable one can use the export command:

export PATH=$PATH:/path/of/folder/to/be/added/

Example eXtasy run:

./extasy.rb -i ~/data/example.vcf -g ./geneprios/res/HP_0000033_fgs.tsv


The -i option takes as an argument the path to the vcf-file with the variants that you want to prioritize. The -g option takes an Endeavour gene prioritization file which can be found in the "geneprios/res/" folder. A list of all precomputed phenotypes to choose from is available on the website (here). In upcoming versions custom gene prioritization based on a user-defined group of genes will be added.

The output can be found in the same directory as the input vcf file. It will contain two eXtasy scores and all the features used to compute these scores. The eXtasy scores can be found in the last two columns. If no missing values were present for that file the complete random forest model was able to run and will return a score. If missing values were present a less complete model was run, omitting haploinsufficiency scores (as this is the most missing feature) and replacing any further missing values with the median of that feature. The complete model outperforms the imputed model in most cases.

For a more complete guide to eXtasy read the README.


Output format:

Column number Name Description
1 chromosome Chromosome
2 refbase Reference Base
3 altbase Alternative Base
4 position Chromosomal position (GRCh37b)
5 genename HUGO gene symbol
6 carol_score CAROL score: Aggregate deleteriousness prediction score combining SIFT and Polyphen scores (Lopes et al., 2012, Human Heredity)
7 haploinsufficiency Haploinsufficiency prediction score (Huang et al., 2010, PLOS Genetics)
8 lrt_score Deleteriousness prediction score based on Likelihood ratio tests (LRT) (Chun et al., 2009, Genome Research). Precomputed scores were gathered from dbNSFP.
9 mutationtaster_score MutationTaster deleteriousness prediction score (Schwarz et al., 2010, Nature Methods). Precomputed scores were gathered from dbNSFP.
10 phastconsplacentalmammals PhastCons conservation scores for the placental mammals group based on hidden Markov models. It considers not only the base in question but also the flanking bases. Gathered from USCS genome browser.
11 phastconsprimates PhastCons conservation scores for the primates group based on hidden Markov models. It considers not only the base in question but also the flanking bases. Gathered from USCS genome browser.
12 phastconsvertebrate PhastCons conservation scores for the vertebrate group based on hidden Markov models. It considers not only the base in question but also the flanking bases. Gathered from USCS genome browser.
13 phylopplacentalmammals phyloP conservation scores for the placental mammals group. These scores consider only the individual base. Gathered from USCS genome browser.
14 phylopprimates phyloP conservation scores for the primates group. These scores consider only the individual base. Gathered from USCS genome browser.
15 phylopvertebrate phyloP conservation scores for the vertebrate group. These scores consider only the individual base. Gathered from USCS genome browser.
16 polyphen_score Polyphen2 deleteriousness prediction score (Adzhubei et al., 2010, Nature Methods). Precomputed scores were gathered from dbNSFP.
17 sift_score SIFT deleteriousness prediction score (Kumar et al., 2009, Nature Protocols). Precomputed scores were gathered from dbNSFP.
18 variant_id Unique extasy internal variant ID
19 blast Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the Blast data source. This datasource is based on the sequence similarities between all human proteins. It is obtained by using NCBI Blast on the protein sequences from EnsEMBL.
20 expression_suetal Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the Su et al. data source. This dataset comes from a large scale analysis of the human transcriptome (Su et al., PNAS 2002). It was obtained by profiling gene expression from 91 human tissues, organs or cell lines.
21 gene_ontology Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the Su et al. data source. This dataset comes from a large scale analysis of the human transcriptome (Su et al., PNAS 2002). It was obtained by profiling gene expression from 91 human tissues, organs or cell lines.
22 interaction_biogrid Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the BIOGRID data source. This datasource is a PPI network and is based on the integration of several large-scale experimental datasets. It contains both direct physical interactions as well as indirect interactions such as genetic interactions. This data comes from the BioGRID database.
23 interaction_hprd Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the HPRD data source. This datasource is a manually verified human PPI network and is based on the integration of several large-scale experimental datasets. It contains both direct physical interactions as well as indirect interactions such as co-localization. This data comes from the BioGRID database.
24 interaction_innetdb Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the IntNetDB data source. The IntNetDB database integrates various gene networks including PPI and functional networks into one global gene network that contains both known and predicted gene interactions.
25 interaction_string Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the String data source. This datasource comes from the String database and is organized as a gene network with edges connecting the genes that are functionally related (or predicted to be). String integrates several types of information including PPI networks, gene expression datasets and text-mining data of the scientific literature.
26 interpro Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the INTERPRO data source. This datasource is based on associations between genes and domains identified from the sequences of the associated gene products. These domains describe the functional elements of the proteins and are coming from the InterPro consortium.
27 kegg Endeavour (Aerts et al., 2006, Nature Biotechnology)similarity score compared to known phenotype-associated genes for the KEGG data source. This datasource is based on bio-molecular pathways from the KEGG database. These pathways describe common molecular processes and list the genes playing a role in these pathways.
28 nb Endeavour (Aerts et al., 2006, Nature Biotechnology) global rank across data sources in the prioritization against known phenotype-associated genes.
29 p_val Endeavour (Aerts et al., 2006, Nature Biotechnology) P-value of the order statistics across all data sources.
30 precalculated_ouzounis Endeavour (Aerts et al., 2006, Nature Biotechnology) similarity score compared to known phenotype-associated genes for the Ouzounis data source. This dataset represents a priori disease probabilities (Lopez-Bigas et al., NAR 2004). It is based on the use of sequence features (e.g., sequence length, UTR length, number of introns, intron length) and a statistical framework to discriminate the human disease causing genes from the rest of the genome.
31 q_int Endeavour (Aerts et al., 2006, Nature Biotechnology) normalized version of p_val.
32 swissprot Endeavour (Aerts et al., 2006, Nature Biotechnology) similarity score compared to known phenotype-associated genes for the Swissprot data source. This datasource is extracted from the UniProt -TrEMBL/SwissProt database. In particular, the keywords that are associated to the genes by UniProt are collected and organized in a pseudo-ontology. These keywords describe in general the main function of the genes and their products.
33 text Endeavour (Aerts et al., 2006, Nature Biotechnology) similarity score compared to known phenotype-associated genes for the Text mining data source. This datasource associates genes with keywords found in publications associated with these genes (from GeneRIFs). This text-mining data was obtained using a framework derived from the TxtGate tool.
34 extasy_complete Global eXtasy score using the complete Random Forest model using all features. Only variants which have no missing values in the previous features can have this score. It ranges from 0 to 1 and represents the proportion of trees (n=1000) in the Random Forest voting for the variant being "disease-causing".
35 extasy_imputed Global eXtasy score using the imputed Random Forest model, this model doesn't use haploinsufficiency scores. Missing deleteriousness prediction scores (SIFT, Polyphen2, MutationTaster, CAROL, LRT) are substituted with the median across all variants for those scores. All variants have an extasy_imputed score. It ranges from 0 to 1 and represents the proportion of trees (n=1000) in the Random Forest voting for the variant being "disease-causing".

What are the prerequisites for eXtasy?
  • a Unix environment
  • Ruby 1.9.1 or higher
  • A working installation of the R statistical framework with the "randomForest" package installed
  • Tabix
  • Bedtools
  • Speed considerations:

    Average job completion time over all jobs submitted to the eXtasy webtool: 00:59:32.7972
    Currently running eXtasy takes about 5-10 minutes for a single exome (~ 40 000 variants) on a standard single core (currently we don't perform any parallelization within one job). It uses only small amounts of RAM memory, allowing it to be run on almost any computer. Most of the time is spent annotating the variants. Significant increases in speed can be achieved by performing this step only once (using the -k and -r options) when prioritizing against mulitple phenotypes.


    News:

  • July 5, 2013: Major changes to the webtool and standalone versions: ability to keep intermediate files for speeding up the pipeline, multiple HPO term inputs and automatic aggregation of cross-phenotype results, ability to specify output filenames
  • June 13, 2013: The webtool now throws errors when there are no nonsynonymous variants in the input file (this also happens if the input file is in a format that can't be read by eXtasy) or when the HPO term is incorrectly formatted. Also a bug is fixed where white spaces in the input file's name weren't escaped properly.
  • April 23, 2013: eXtasy can now be run online by uploading a VCF file and specifying an HPO term. A link containing the results is e-mailed once the computation is finished.
  • April 8, 2013: Initial release of eXtasy!

  • Download eXtasy:

    eXtasy-0.1.tar.gz (Stable build)
    eXtasy-master.zip (Development build)
    README
    List of available phenotypes


    Github:

    eXtasy's Github page