Standalone & Datasets

AGAIN standalone is for users who want to run it locally in a command-line environment (also in GitHub).


Current Version: 1


Programs:


AGAIN_VCF.py
AGAIN_VCF_batch.py

Dependency:


The code is written in python3, and requires bedtools installed.

Reference Datasets:


Download them and put them in the same folder with the AGAIN programs.

  • Data_AGAIN_coding_map_GRCh37.txt
  • Data_AGAIN_detection_GRCh37_all.bed
  • Data_AGAIN_detection_GRCh37_canonical.bed

  • Data_AGAIN_coding_map_GRCh38.txt
  • Data_AGAIN_detection_GRCh38_all.bed
  • Data_AGAIN_detection_GRCh38_canonical.bed
  • File Format:


    Input: Variants in VCF format, with 5 mandatory and tab-delimited fields (CHROM, POS, ID, REF, ALT).

    Output: AGAIN-detected variants will be output with the following annotations.

  • SAMPLE (only for AGAIN_VCF_batch.py)
  • CHROM, POS, ID, REF, ALT (exactly the same as input)
  • STRAND
  • VAR_TYPE (snv, x nt-del, x nt-ins)
  • GENE
  • TRANSCRIPT_IVS (e.g., ENST123456789_IVS10)
  • CANONICAL (canonical transcript_IVS)
  • AGAIN_ZONE (ZONE1/ZONE2, ZONE1 is from 1st BP to ACC, ZONE2 is from 2nd BP to 1st BP)
  • AGAIN_YAG (YES/NO, if the AG-gain variant also fits YAG)
  • AGAIN_BP_DIST (the distance from the created AG to BP)
  • AGAIN_ACC_DIST (the distance from the created AG to ACC)
  • AGAIN_HIGHRISK (YES/NO, if the AG-gain variant falls inside high-risk [BP+8, ACC-4] region)
  • AGAIN_SCORE (score of the AG-gain variant (suggested cutoff >= 3, max = 5))
  • PROT_SEQ_WT (wild-type protein sequence)
  • PROT_SEQ_NEW_ACC (consequent protein sequence if new acceptor site is created)
  • HGVS_NEW_ACC (protein-level HGVS annotation if new acceptor site is created)
  • PROT_SEQ_EXON_SKIP (consequent protein sequence if exon skipping occurs)
  • HGVS_EXON_SKIP (protein-level HGVS annotation if exon skipping occurs)
  • Command:


    
        python AGAIN_VCF.py -i variants.vcf
    
        python AGAIN_VCF.py -g GRCh37 -t all -i variants.vcf
    
        arguments:
          -h, --help         show help message
          -g, --genome       human genome assembly {GRCh37, GRCh38}, default: GRCh37
          -t, --transcript   all/canonical transcript? {all, canonical}, default: all
          -i, --input        input variants in VCF format file
        
    
        python AGAIN_VCF_batch.py -d directory -s samplelist.txt -o output.txt
        
        python AGAIN_VCF_batch.py -g GRCh37 -t all -d directory -s samplelist.txt -o output.txt
    
        arguments:
          -h, --help         show help message
          -g, --genome       human genome assembly {GRCh37, GRCh38}, default: GRCh37
          -t, --transcript   all/canonical transcript? {all, canonical}, default: all
          -d, --dir          directory of VCF files
          -s, --sample       sample list (without .vcf extension) in the above directory
          -o, --output       output filename
        

    AGAIN Scoring Scheme:


    We recommend to focus on the AG-gain variants with AGAIN_HIGHRISK = 'YES', and then prioritize by AGAIN_SCORE.