BPHunter_fastq2BP is for users who want to identify BP positions from RNA-seq data.
Package:
BPHunter_fastq2BP.zip
Dependency:
The code is written in
python3,
and requires
Biopython and
BLAST+ installed.
Workflow:
Preparation: users should have aligned the RNA-seq raw reads to human reference genome first, and retain the unmapped reads (in fastq) for BP identification.
Pre-construction of BLAST index databases (inside the package) for the 20-nt intronic sequences downstream of 5’ss (as 20-nt 5’ss library), and the inverted 200-nt intronic sequences upstream of 3’ss (200-nt 3’ss library).
Align all input RNA-seq reads to the 20-nt 5’ss library, and retain the reads that have a perfect 20-nt match or only one mismatch, as BP-searching 5’ss-hit reads.
For each 5’ss-hit read, trim away the read sequence from the start of its alignment to the 20-nt 5’ss library, and invert the remaining sequence.
Align the trimmed and inverted 5’ss-hit reads to the 200-nt 3’ss library, and retain the reads that have at least a 20-nt alignment with at least 95% identity in the same intron, as BP-searching 5’ss-3’ss-hit reads.
For 5’ss-3’ss-hit reads, the ends of the aligned sequence in the 200-nt 3’ss library were used to determine the genomic positions of BP.
File Format:
Input: RNA-seq reads, that are unmapped to human genome, in fastq format.
Output: The identified BP positions with following informaiton: CHROM, POS, STRAND, NUCLEOTIDE, MOTIF SEQUENCE [-7, +3], GENE, and #READS.
The output genomic position is on human genome assembly hg19. To convert to hg38 coordinates, users could use
LiftOver tool.
Command:
python BPHunter_fastq2BP.py -i reads.fastq
python BPHunter_fastq2BP.py -i reads.fastq -e 0.01 -t 5 -r no
arguments:
-h, --help show help message
-i, --input input RNA-seq reads in fastq-format file
-e, --evalue evalue cutoff in BLAST alignment {float}, default: 0.01
-t, --threads number of threads in running BLAST {int}, default: 1
-r, --blast_removal remove the intermediate files generated by BLAST? {yes/no}, default: no