We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/Os) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match.
Secondary storage is modeled as a magnetic disk capable of transferring
blocks each containing
records in a single time unit; the
records in each block must be input from or output to
contiguous
locations on the disk. We give two optimal algorithms for the problems,
which are variants of merge sorting and distribution sorting. In
particular we show for
that the standard merge sorting algorithm
is an optimal external sorting method, up to a constant factor in the
number of I/Os. Our sorting algorithms use the same number of I/Os as
does the permutation phase of key sorting, except when the internal
memory size is extremely small, thus affirming the popular adage that
key sorting is not faster. We also give a simpler and more direct
derivation of Hong and Kung's lower bound for the FFT for the special
case
.
In this paper we introduce input/output (I/O) overhead as
a complexity measure for VLSI implementations of two-dimensional
lattice computations of the type arising in the simulation of
physical systems. We show by pebbling arguments that
when there are
processing elements
available. If the results are required to be observed at every
generation, and no on-chip storage is allowed, we show the lower bound
is the constant 2. We then examine four VLSI architectures and show
that one of them, the multi-generation sweep architecture, also has
I/O overhead proportional to
. We compare the constants of
proportionality between the lower bound and the architecture.
Finally, we prove a closed-form for the discrete minimization equation
giving the optimal number of generations to compute for the
multi-generation sweep architecture.
Parallel algorithms for several graph and geometric problems are
presented, including transitive closure and topological sorting in
planar -graphs, preprocessing planar subdivisions for point location
queries, and construction of visibility representations and drawings of
planar graphs. Most of these algorithms achieve optimal
running time using
processors in the EREW PRAM model,
being the number of vertices.
We provide the first optimal algorithms in terms of the number of
input/outputs (I/Os) required between internal memory and multiple
secondary storage devices for the problems of sorting, FFT, matrix
transposition, standard matrix multiplication, and related problems.
Our two-level memory model is new and gives a realistic treatment of
parallel block transfer, in which during a single I/O each of the
secondary storage devices can simultaneously transfer a contiguous
block of
records. The model pertains to a large-scale uniprocessor
system or parallel multiprocessor system with
disks. In addition,
the sorting, FFT, permutation network, and standard matrix
multiplication algorithms are typically optimal in terms of the amount
of internal processing time. The difficulty in developing optimal
algorithms is to cope with the partitioning of memory into
separate
physical devices. Our algorithms' performance can be significantly
better than those obtained by the well-known but nonoptimal technique of
disk striping. Our optimal sorting algorithm is randomized, but
practical; the probability of using more than
times the optimal
number of I/Os is exponentially small in
,
where
is the internal memory size.
In this paper we introduce parallel versions of two hierarchical memory
models and give optimal algorithms in these models for sorting, FFT, and
matrix multiplication. In our parallel models, there are memory
hierarchies operating simultaneously; communication among the
hierarchies takes place at a base memory level. Our optimal sorting
algorithm is randomized and is based upon the probabilistic partitioning
technique developed in the companion paper for optimal disk sorting in a
two-level memory with parallel block transfer. The probability of using
times the optimal running time is exponentially small in
.
We present an algorithm for sorting efficiently with parallel two-level memories. Our
main result is an elegant, easy-to-implement, optimal, detemzinistic algorithm for external sorting
with disk drives. This result answers in the affirmative the open problem posed by Vitter and
Shriver of whether an optimal algorithm exists that is deterministic. Our measure of performance
is the number of parallel input/output (I/0) operations, in which each of the
disks can
simultaneously transfer a block of
contiguous records. We assume that internal memory can
hold
records. Our algorithm sorts
records in the optimal bound of
deterministically, and thus it improves upon Vitter and Shriver's optimal randomized
algorithm as well as the well-known deterministic but nonoptimal technique of disk striping. It is
also practical to implement.
We present several efficient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Feig, and its parallelization P-UMH. We give optimal and nearly-optimal algorithms for a wide range of bandwidth degradations, including a parsimonious algorithm for constant bandwidth. We also develop optimal sorting algorithms for all bandwidths for other versions of UMH and P-UMH, including natural restrictions we introduce called RUMH and P-RUMH, which more closely correspond to current programming languages.
We present a load balancing technique that leads to an optimal
deterministic algorithm called Balance Sort for external sorting
on multiple disks. Our measure of performance is the number of
input/output (I/O) operations. In each I/O, each of the
disks can simultaneously transfer a block of data. Our
algorithm improves upon the randomized optimal algorithm of
Vitter and Shriver as well as the (non-optimal) commonly-used
technique of disk striping. It also improves upon our earlier
merge-based sorting algorithm in that it has smaller constants
hidden in the big-oh notation, and it is possible to implement
using only striped writes (but independent reads). In a
companion paper, we show how to modify the algorithm to achieve
optimal CPU time, even on parallel processors and parallel
memory hierarchies.
We present a practical deterministic load balancing strategy for distribution sort that is applicable to parallel disks and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single CPU, as described in the companion paper. However, the internal processing of Balance Sort does not seem parallelizable. In this paper, we develop an elegant variation that achieves full parallel speedup. The algorithms so derived are optimal for all parallel memory hierarchies with any type of a PRAM base-level interconnection and are either optimal or best-known for a hypercube interconnection. We show how to achieve optimal internal processing time as well as optimal number of I/Os in parallel two-level memories.
In this paper, we extend Valiant's sequential model of concept learning
from examples [Valiant 1984] and introduce models for the efficient
learning of concept classes from examples in parallel. We say
that a concept class is -learnable if it can be
learned in polylog time with a polynomial number of processors. We show
that several concept classes which are polynomial-time learnable are
-learnable in constant time. Some other classes can be shown
to be
-learnable in logarithmic time, but not in constant
time. Our main result shows that other classes, such as
-fold unions
of geometrical objects in Euclidean space, which are polynomial-time
learnable by a greedy set cover technique, are
-learnable
using a non-greedy technique. We also show that (unless
) several polynomial-time learnable concept
classes related to linear programming are not
-learnable.
Equivalence of various parallel learning models and issues of
fault-tolerance are also discussed.
We show that high-resolution images can be encoded and decoded efficiently in parallel. We present an algorithm based on the hierarchical MLP method, used either with Huffman coding or with a new variant of arithmetic coding called quasi-arithmetic coding. The coding step can be parallelized, even though the codes for different pixels are of different lengths; parallelization of the prediction and error modeling components is straightforward.
Fractional cascading is a technique designed to allow efficient
sequential search in a graph with catalogs of total size . The
search consists of locating a key in the catalogs along a path. In this
paper we show how to preprocess a variety of fractional cascaded data
structures whose underlying graph is a tree so that searching can be
done efficiently in parallel. The preprocessing takes
time
with
processors on an EREW PRAM. For a balanced binary tree
cooperative search along root-to-leaf paths can be done in
time using
processors on a CREW PRAM. Both of these
time/processor constraints are optimal. The searching in the fractional
cascaded data structure can be either explicit, in which the search path
is specified before the search starts, or implicit, in which the
branching is determined at each node. We apply this technique to a
variety of geometric problems, including point location, range search,
and segment intersection search.
We give efficient parallel algorithms to compute shortest-paths in
planar layered digraphs. We show that these digraphs admit special kinds
of separators, called one-way separators, which allow paths in the
graph to cross them only once. We use these separators to give
divide-and-conquer solutions to the problem of finding the shortest
paths. We first give a simple algorithm that works on the CREW PRAM
model and computes the shortest path between any two vertices of an
-node planar layered digraph in time
using
processors. A CRCW
version of this algorithm runs in
time and uses
processors. We then improve the time bound
to
on the CREW model and
on the CRCW
model. The processor bounds still remain
for the CREW model
and
for the CRCW model.
In recent years, I/O-efficient algorithms for a wide variety of problems have appeared in the literature. Thus far, however, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to fill this void. It supports I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only of explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation.
In this paper, we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks, while others are of our own creation.
We demonstrate that the CPU overhead required to manage I/O is small and that even with just a single disk the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.
We consider the problem of sorting a file of records
on the
-disk model of parallel I/O in
which there are two sources of parallelism. Records are transferred
to and from disk concurrently in blocks of
contiguous records.
In each I/O operation, up to one block can be transferred to or from
each of the
disks in parallel. We propose a simple, efficient,
randomized mergesort algorithm called SRM that uses a
forecast-and-flush approach to overcome the inherent difficulties of
simple merging on parallel disks. SRM exhibits a limited use of
randomization and also has a useful deterministic version.
Generalizing the technique of forecasting, our
algorithm is able to read in, at any time, the “right” block from
any disk, and using the technique of flushing, our algorithm evicts,
without any I/O overhead, just the “right” blocks from memory to
make space for new ones to be read in. The disk layout of SRM is
such that it enjoys perfect write parallelism, avoiding fundamental
inefficiencies of previous mergesort algorithms. By analysis of
generalized maximum occupancy problems we are able to
derive an analytical upper bound on SRM's expected overhead valid
for arbitrary inputs.
The upper bound derived on expected I/O performance of SRM indicates
that SRM is provably better than disk-striped mergesort (DSM) for
realistic parameter values ,
, and
. Average-case
simulations show
further improvement on the analytical upper bound. Unlike previously
proposed optimal sorting algorithms, SRM outperforms
DSM even when the number
of parallel disks is small.
Slides for talk (gzip-compressed postscript)
For a wide variety of computational tasks, disk I/O continues to be a serious obstacle to high performance. To meet demanding I/O requirements, systems are designed to use multiple disk drives that share one or more I/O ports to form a disk farm or RAID array. The focus of the present paper is on systems that use multiple disks per SCSI bus. We measured the performance of concurrent random I/Os for three types of SCSI disk drives and three types of computers. The measurements enable us to study bus-related phenomena that impair performance. We describe these phenomena, and present a new I/O performance model that incorporates bus effects to predict the average throughput achieved by concurrent random I/Os that share a SCSI bus. This model, although relatively simple, predicts performance on these platforms to within 11% for fixed I/O sizes in the range 16-128 KB/s. We then describe a technique to improve the I/O throughput. This technique increases the percentage of disk head positioning time that is overlapped with data transfers, and increases the percentage of transfers that occur at bus bandwidth, rather than at disk-head bandwidth. Our technique is most effective for large I/Os and high concurrency--an important performance region for large-scale computing--our improvements are 10-20% better than the naive method for random workloads.
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
Slides for ICALP '99 talk (gzip-compressed postscript)
The data sets for many of today's computer applications are too large to fit within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of online data structures for external memory, some very old and some very new, such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), buffer trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way.
In this paper we consider the problem of constructing
planar orthogonal grid drawings (or more simply,
layouts) of graphs, with the goal of minimizing the number of
bends along the edges. We present optimal parallel algorithms
that construct graph layouts with maximum edge length,
area, and at most
bends (for biconnected graphs) and
bends (for simply connected graphs).
All three of these quality measures for the layouts are optimal in the
worst case for biconnected graphs.
The algorithm runs on a CREW PRAM
in
time with
processors, thus achieving optimal
time and processor utilization.
Applications include VLSI layout, graph drawing, and wireless communication.
Slides for a talk (Adobe pdf format)
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
Parallel independent disks can enhance the performance of external memory (EM) algorithms, but the programming task is often difficult. In this paper we develop randomized variants of distribution sort for use with parallel independent disks. We propose a simple variant called randomized cycling distribution sort (RCD) and prove that it has optimal expected I/O complexity. The analysis uses a novel reduction to a model with significantly fewer probabilistic interdependencies. Experimental evidence is provided to support its practicality. Other simple variants are also examined experimentally and appear to offer similar advantages to RCD. Based upon ideas in RCD we propose general techniques that transparently simulate algorithms developed for the unrealistic multihead disk model so that they can be run on the realistic parallel disk model. The simulation is optimal for two important classes of algorithms: the class of multipass algorithms, which make a complete pass through their data before accessing any element a second time, and the algorithms based upon the well-known distribution paradigm of EM computation.
Parallel disks promise to be a cost effective means for achieving high bandwidth in applications involving massive data sets, but algorithms for parallel disks can be difficult to devise. To combat this problem, we define a useful and natural duality between writing to parallel disks and the seemingly more difficult problem of prefetching. We first explore this duality for applications involving read-once accesses using parallel disks. We get a simple linear time algorithm for computing optimal prefetch schedules and analyze the efficiency of the resulting schedules for randomly placed data and for arbitrary interleaved accesses to striped sequences. Duality also provides an optimal schedule for the integrated caching and prefetching problem, in which blocks can be accessed multiple times. Another application of this duality gives us the first parallel disk sorting algorithms that are provably optimal up to lower order terms. One of these algorithms is a simple and practical variant of multiway merge sort, addressing a question that has been open for some time.
Slides for CPM '10 keynote talk (Adobe pdf)
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
This study explores an alternative way of storing text files to answer
exact match queries faster. We decompose the original file into two
parts as filter and payload. The filter part contains the most
informative bits of each byte, and the remaining bits of the bytes
are concatenated in order of appearance to generate the payload. We
refer to this structure as
-bit filtered format. When an input
pattern is to be searched on the
-bit filtered structure, the same
decomposition is performed on the pattern. The
bits from each byte
of the pattern form the pattern filter bit sequence, and the rest is
the payload. The pattern filter is first scanned on the filter part
of the file. At each match position detected in the filter part, the
pattern payload is verified against the corresponding location in the
payload part of the text. Thus, instead of searching an
-byte
pattern on an
-byte text, the first
bits are scanned on
bits, followed by a verification of
bits on
the respective locations of the matching positions. Experiments
conducted on natural language texts, plain ASCII DNA sequences, and
random byte sequences showed that the search performance with the
proposed scheme is on average two times faster than the tested best
exact pattern matching algorithms. The highest gain is obtained on
plain ASCII DNA sequences. We also developed an effective bitwise
pattern matching algorithm of possible independent interest within
this study.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays
(SSAs) and propose a new short read aligner named -RA (PSI-RA:
parallel sparse index read aligner). The motivation in using SSAs is
the ability to trade memory against time. It is possible to fine tune
the space consumption of the index based on the available memory of
the machine and the minimum length of the arriving pattern queries.
Although SSAs have been studied before for exact matching of short
reads, an elegant way of approximate matching capability was missing.
We provide this by defining the rightmost mismatch criteria that
prioritize the errors towards the end of the reads, where errors are
more probable.
-RA supports any number of mismatches in aligning
reads. We give comparisons with some of the well-known short read
aligners, and show that indexing a genome with SSA is a good
alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions: -RA is expected to serve as a valuable tool in the alignment of
short reads generated by the next generation high-throughput
sequencing technology.
-RA is very fast in exact matching and
also supports rightmost approximate matching. The SSA structure that
-RA is built on naturally incorporates the modern multicore
architecture and thus further speed-up can be gained. All the
information, including the source code of
-RA, can be downloaded
at http://www.busillis.com/o_kulekci/PSIRA.zip.
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.
Let
be a text of
total length
, where characters of each
are chosen from an
alphabet
of size
, and
denotes a wildcard
symbol. The text indexing with wildcards problem is to index
such
that when we are given a query pattern
, we can locate the
occurrences of
in
efficiently. This problem has been applied
in indexing genomic sequences that contain single-nucleotide
polymorphisms (SNP) because SNP can be modeled as wildcards. Recently
Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes
for this problem. In this paper, we present the first compressed
index for this problem, which takes only
bits space, where
is the
th-order
empirical entropy (
) of
.
In recent years, there has been increasing interest in planted motif search (PMS) with applications to
discovering significant segments in biological sequences. However, there has been little discussion about PMS over large
alphabets. This paper focuses on motif stem search (MSS), which was recently introduced to search motifs on large-alphabet
inputs. A motif stem is an
-length string with some wildcards. The goal of the MSS problem is to find a set of stems that
represents a superset of all
motifs present in the input sequences, and the superset is expected to be as small as possible.
The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular
expressions. (2) We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an
efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous algorithms, StemFinder
runs much faster and first solves the (17, 8), (19, 9) and (21, 10) challenging instances on protein sequences; moreover,
StemFinder reports fewer stems which represent a smaller superset of all
motifs. StemFinder is freely available at
http://sites.google.com/site/feqond/stemfinder.
In this paper we describe compressed indexes that support pattern matching queries for strings
with wildcards. We present a data structure that uses
bits for any
and reports
all occ occurrences of a wildcard string in
time, where
is the alphabet
size,
is the number of alphabet symbols, and
is the number of wildcard symbols in the
query string. We also present an
-bit index with
query time and
an
-bit index with
query time. These are the first
data structures for this problem that need only
bits of space.
In this paper, we develop a simple and practical storage scheme for
compressed suffix arrays (CSA). Our CSA can be constructed in linear time and
needs
bits of space simultaneously for any
and any constant
, where
denotes the
-th order entropy. We compare the
performance of our method with two established compressed indexing methods,
viz. the FM-index and Sadakane's CSA. Experiments on the Canterbury Corpus and
the Pizza&Chili Corpus show significant advantages of our algorithm over two
other indexes in terms of compression and query time. Our storage scheme
achieves better performance on all types of data present in these two corpora,
except for evenly distributed data, such as DNA. The source code for our CSA is
available online.
The planted motif discovery has been successfully
used to locate transcription factor binding sites in dozens
of promoter sequences over the past decade. However, there has
not been enough work done in identifying
motifs in the
next-generation sequencing (ChIP-seq) data sets, which contain
thousands of input sequences and thereby bring new challenge
to make a good identification in reasonable time. To cater this
need, we propose a new planted
motif discovery algorithm
named MCES, which identifies motifs by mining and combining
emerging substrings. Specially, to handle larger data sets, we
design a MapReduce-based strategy to mine emerging substrings
distributedly. Experimental results on the simulated data show
that i) MCES is able to identify
motifs efficiently and effectively
in thousands to millions of input sequences, and runs faster
than the state-of-the-art
motif discovery algorithms, such as
F-motif and TraverStringsR; ii) MCES is able to identify motifs
without known lengths, and has a better identification accuracy
than the competing algorithm CisFinder. Also, the validity of
MCES is tested on real data sets. MCES is freely available at
http://sites.google.com/site/feqond/mces.
The planted motif search (PMS) is an important yet
challenging problem in computational biology. Pattern-driven PMS
algorithms usually use
out of
input sequences as reference
sequences to generate candidate motifs, and they can find all the
motifs in the input sequences. However, most of them simply take
the first
sequences in the input as reference sequences without
elaborate selection processes, and thus they may exhibit sharp
fluctuations in running time, especially for large alphabets.
In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100.
The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.
Graph similarity search has received considerable attention in many applications, such as bioinformatics, data mining, pattern recognition, and social networks. Existing methods for this problem have limited scalability because of the huge amount of memory they consume when handling very large graph databases with millions or billions of graphs.
In this paper, we study the problem of graph similarity search under the graph edit distance constraint. We present a space-efficient index structure based upon the q-gram tree that incorporates succinct data structures and hybrid encoding to achieve improved query time performance with minimal space usage. Specifically, the space usage of our index requires only 5%-15% of the previous state-of-the-art indexing size on the tested data while at the same time achieving 2-3 times acceleration in query time with small data sets. We also boost the query performance by augmenting the global filter with range search, which allows us to perform a query in a reduced region. In addition, we propose two effective filters that combine degree structures and label structures. Extensive experiments demonstrate that our proposed approach is superior in space and competitive in filtering to the state-of-the-art approaches. To the best of our knowledge, our index is the first in-memory index for this problem that successfully scales to cope with the large dataset of 25 million chemical structure graphs from the PubChem dataset.
We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures, and special tables, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate, and extract queries on reads, our method uses 2.7-41.66 percentage points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times, and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88 percentage points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.
The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex.
Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function. We analyze the performance in theory and practice of our recommended indexing method, called GeCSA. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index GeCSA has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.