Ahsoka is a cluster with 24 different nodes, each of them with 48 GB of RAM memory and 16 processors. Ahsoka has a fibre channel RAID to store around 15 Terabytes of biological data. This cluster is used to run both serial and parallel biocomputing related jobs. We dedicate the cluster mainly to sequence analysis research and Next Generation Sequencing projects.
The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences at once. These sequences are analize in this cluster as it is specially design to store huge amount of data and process high number of reads at the same time.
These reads are becoming important due to the posibility to sequence whole-genomes fast and cheap. These techniques are used for a variaty of sequencing assays: gene expression , splicing studies and so on. One of the challenges associatted with this technology is the so-called ‘read mapping’ problem. A lot of specialized software exist with the aim of mapping millions of short DNA reads in the minimun processiong time.
Some of these softwares are commercial, such as the ELAND program from Illumina and some of them are open-source.
Some of these open-source software for NGS projects installed on ahsoka are:
BWA: Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements two algorithms, bwa-short and BWA-SW. The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates.
BFast: BFAST facilitates the fast and accurate mapping of short reads to reference sequences. Some advantages of BFAST include:
- Speed: enables billions of short reads to be mapped quickly.
- Accuracy: A priori probabilities for mapping reads with defined set of variants.
- An easy way to measurably tune accuracy at the expense of speed.
FastQC: FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
SAMtools : SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
BAMtools : BamTools provides a fast, flexible C++ API & toolkit for reading, writing, and manipulating BAM files.
PileLine : PileLine is a flexible command-line toolkit for efficient handling, filtering, and comparison of genomic position (GP) files produced by next-generation sequencing experiments.
TopHat : TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
The vast number of data produced by next-generation sequencing (NGS)
techniques poses significant computational challenges and many computational steps are required to translate this output into high-quality results. Many of these bioinformatic analyses consist on a set of stages that are repetitive and routinely executed. Developing a workflow able to perform all stages in an automatic and reliable way is crucial to eliminate manual steps and to speed up result generation.