Show icon Show search tips...
Hide icon Hide search tips...

[CCICADA-announce] DIMACS/CCICADA Interdisciplinary SeminarSeries - Monday, April 16, 2012

Linda Casals lindac at
Mon Apr 16 08:45:58 EDT 2012


DIMACS/CCICADA Interdisciplinary Seminar Series Presents
Title: Indel-sensitive Read Mapping with 2 and 3-gram 
        Frequencies and Cache Oblivious kd-Trees

Speaker: Pavel Mahmud, Rutgers University 

Date: Monday, April 16, 2012 11:00am - 12:00pm

Location: DIMACS Center, CoRE Bldg, Room 431, Rutgers University                 
             Busch Campus, Piscataway, NJ


Mapping billions of reads from Next Generation Sequencing (NGS)
experiments to reference genomes is a crucial task which requires
thousands of hours of running time on a single CPU even for the
fastest known implementations. The most frequently used methods rely
on the existence of exact q-gram matches, which are either used
explicitly in the form of q-gram filtering, or implicitly to
accelerate computations. Current approaches have difficulties dealing
with matches of large edit distance, particularly in the presence of
frequent or large insertions and deletions (indels). This is a serious
obstacle both in determining the spectrum and abundance of genetic
variations and in personal genomics.

We demonstrate the effectiveness of embedding genomes and reads in a
vector space by mapping fixed-length reads and sub-sequences of the
reference genomes to q-gram frequency vectors for q = 2 or 3. The
approximate matching problem is thus rephrased as one of finding
nearest neighbors using spatial index data structures; all q-grams in
a putative match are considered simultaneously. The L_1 distance
between frequency vectors also has the benefit of providing lower
bounds for an edit distance with affine gap-costs. With the use of a
cache-oblivious kd-tree we realize running times which match the
state-of-the-art. Our approach however is the first which can map
large proportions of reads containing insertions and deletions of
length 1--15bp. Additionally, running time and memory requirements are
constant for read lengths between 100--1000bp.

DIMACS/CCICADA Interdisciplinary Series, Complete Spring Calendar 2012

More information about the Dimacs-ccicada-announce mailing list