7 PARALLEL ALGORITHMS AND COMPUTATIONAL SCIENCE

A. Aggarwal and J. S. Vitter. “The Input/Output Complexity of Sorting and Related Problems,” Communications of the ACM, 31(9), September 1988, 1116-1127. A shorter version appears in “The I/O Complexity of Sorting and Related Problems,” Proceedings of the 14th Annual International Colloquium on Automata, Languages, and Programming (ICALP '87), Karlsruhe, West Germany, July 1987, published in Lecture Notes in Computer Science, 267, Springer-Verlag, Berlin.
We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/Os) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match.
Secondary storage is modeled as a magnetic disk capable of transferring blocks each containing records in a single time unit; the records in each block must be input from or output to contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case .
M. H. Nodine, D. P. Lopresti and J. S. Vitter. “I/O Overhead and Parallel VLSI Architectures for Lattice Computations,” IEEE Transactions on Computers, 40 (7), July 1991, 843-852.
In this paper we introduce input/output (I/O) overhead $\psi$ as a complexity measure for VLSI implementations of two-dimensional lattice computations of the type arising in the simulation of physical systems. We show by pebbling arguments that $\psi=\Omega(n^{-1})$ when there are processing elements available. If the results are required to be observed at every generation, and no on-chip storage is allowed, we show the lower bound is the constant 2. We then examine four VLSI architectures and show that one of them, the multi-generation sweep architecture, also has I/O overhead proportional to $n^{-1}$ . We compare the constants of proportionality between the lower bound and the architecture. Finally, we prove a closed-form for the discrete minimization equation giving the optimal number of generations to compute for the multi-generation sweep architecture.
R. Tamassia and J. S. Vitter. “Parallel Transitive Closure and Point Location in Planar Structures,” SIAM Journal on Computing, 20(4), August 1991, 708-725. A shorter version appears in “Optimal Parallel Algorithms for Transitive Closure and Point Location in Planar Structures,” Proceedings of the 1st Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '89), Sante Fe, N. M., June 1989, 399-408. Also appears as an invited paper in Proceedings of the International Workshop on Discrete Algorithms and Complexity, Fukuoka, Japan, November 1989, 169-178.
Parallel algorithms for several graph and geometric problems are presented, including transitive closure and topological sorting in planar -graphs, preprocessing planar subdivisions for point location queries, and construction of visibility representations and drawings of planar graphs. Most of these algorithms achieve optimal $O(\log n)$ running time using $n /\log n$ processors in the EREW PRAM model, being the number of vertices.
J. S. Vitter and E. A. M. Shriver. “Algorithms for Parallel Memory I: Two-Level Memories,” double special issue on large-scale memories in Algorithmica, 12(2-3), 1994, 110-147. A shorter version appears in “Optimal Disk I/O with Parallel Block Transfer,” Proceedings of the 22nd Annual ACM Symposium on Theory of Computing (STOC '90), Baltimore, MD, May 1990, 159-169.
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the secondary storage devices can simultaneously transfer a contiguous block of records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system with disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into separate physical devices. Our algorithms' performance can be significantly better than those obtained by the well-known but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than $\ell$ times the optimal number of I/Os is exponentially small in $\ell (\log \ell) \log (M/B)$ , where is the internal memory size.
J. S. Vitter and E. A. M. Shriver. “Algorithms for Parallel Memory II: Hierarchical Multilevel Memories,” double special issue on large-scale memories in Algorithmica, 12(2-3), 1994, 148-169. A shorter version appears in “Optimal Disk I/O with Parallel Block Transfer,” Proceedings of the 22nd Annual ACM Symposium on Theory of Computing (STOC '90), Baltimore, MD, May 1990, 159-169.
In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there are memory hierarchies operating simultaneously; communication among the hierarchies takes place at a base memory level. Our optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a two-level memory with parallel block transfer. The probability of using $\ell$ times the optimal running time is exponentially small in $\ell (\log \ell) \log P$ .
M. H. Nodine and J. S. Vitter. “Greed Sort: An Optimal Sorting Algorithm on Parallel Disks,” Journal of the ACM, 42(4), July 1995, 919-933. An extended abstract appears in “Large-Scale Sorting in Parallel Memories,” Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '91), Hilton Head, SC, July 1991, 29-39.
We present an algorithm for sorting efficiently with parallel two-level memories. Our main result is an elegant, easy-to-implement, optimal, detemzinistic algorithm for external sorting with disk drives. This result answers in the affirmative the open problem posed by Vitter and Shriver of whether an optimal algorithm exists that is deterministic. Our measure of performance is the number of parallel input/output (I/0) operations, in which each of the disks can simultaneously transfer a block of contiguous records. We assume that internal memory can hold records. Our algorithm sorts records in the optimal bound of deterministically, and thus it improves upon Vitter and Shriver's optimal randomized algorithm as well as the well-known deterministic but nonoptimal technique of disk striping. It is also practical to implement.
J. S. Vitter and M. H. Nodine. “Large-Scale Sorting in Uniform Memory Hierarchies,” special issue on parallel I/O systems in Journal of Parallel and Distributed Computing, 17, January 1993, 107-114. A shorter version appears in “Large-Scale Sorting in Parallel Memories,” Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '91), Hilton Head, SC, July 1991, 29-39.
We present several efficient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Feig, and its parallelization P-UMH. We give optimal and nearly-optimal algorithms for a wide range of bandwidth degradations, including a parsimonious algorithm for constant bandwidth. We also develop optimal sorting algorithms for all bandwidths for other versions of UMH and P-UMH, including natural restrictions we introduce called RUMH and P-RUMH, which more closely correspond to current programming languages.
M. H. Nodine and J. S. Vitter. “Optimal Deterministic Sorting on Parallel Disks.” manuscript. A shorter version appears in Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '93), Velen, Germany, June-July 1993, 120-129.
We present a load balancing technique that leads to an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks. Our measure of performance is the number of input/output (I/O) operations. In each I/O, each of the disks can simultaneously transfer a block of data. Our algorithm improves upon the randomized optimal algorithm of Vitter and Shriver as well as the (non-optimal) commonly-used technique of disk striping. It also improves upon our earlier merge-based sorting algorithm in that it has smaller constants hidden in the big-oh notation, and it is possible to implement using only striped writes (but independent reads). In a companion paper, we show how to modify the algorithm to achieve optimal CPU time, even on parallel processors and parallel memory hierarchies.
M. H. Nodine and J. S. Vitter. “Optimal Deterministic Sorting on Parallel Processors and Parallel Memory Hierarchies,” manuscript. A shorter version appears in Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '93), Velen, Germany, June-July 1993, 120-129.
We present a practical deterministic load balancing strategy for distribution sort that is applicable to parallel disks and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single CPU, as described in the companion paper. However, the internal processing of Balance Sort does not seem parallelizable. In this paper, we develop an elegant variation that achieves full parallel speedup. The algorithms so derived are optimal for all parallel memory hierarchies with any type of a PRAM base-level interconnection and are either optimal or best-known for a hypercube interconnection. We show how to achieve optimal internal processing time as well as optimal number of I/Os in parallel two-level memories.
J. S. Vitter and J.-H. Lin. “Learning in Parallel,” Information and Computation, 92(2), February 1992, 179-202. A shorter version appears in Proceedings of the 1st Annual ACM Workshop on Computational Learning Theory (COLT '88), Cambridge, MA, August 1988, published by Morgan Kaufmann, San Mateo, CA, 106-124.
In this paper, we extend Valiant's sequential model of concept learning from examples [Valiant 1984] and introduce models for the efficient learning of concept classes from examples in parallel. We say that a concept class is ${\cal NC}$ -learnable if it can be learned in polylog time with a polynomial number of processors. We show that several concept classes which are polynomial-time learnable are ${\cal NC}$ -learnable in constant time. Some other classes can be shown to be ${\cal NC}$ -learnable in logarithmic time, but not in constant time. Our main result shows that other classes, such as -fold unions of geometrical objects in Euclidean space, which are polynomial-time learnable by a greedy set cover technique, are ${\cal NC}$ -learnable using a non-greedy technique. We also show that (unless ${\cal P}\subseteq {\cal RNC}$ ) several polynomial-time learnable concept classes related to linear programming are not ${\cal NC}$ -learnable. Equivalence of various parallel learning models and issues of fault-tolerance are also discussed.
P. G. Howard and J. S. Vitter. “Parallel Lossless Image Compression Using Huffman and Arithmetic Coding,” Information Processing Letters, 59, 1996, 65-73. A shorter version appears in Proceedings of the 1992 IEEE Data Compression Conference (DCC '92), Snowbird, UT, March 1992, 299-308.
We show that high-resolution images can be encoded and decoded efficiently in parallel. We present an algorithm based on the hierarchical MLP method, used either with Huffman coding or with a new variant of arithmetic coding called quasi-arithmetic coding. The coding step can be parallelized, even though the codes for different pixels are of different lengths; parallelization of the prediction and error modeling components is straightforward.
R. Tamassia and J. S. Vitter. “Optimal Cooperative Search in Fractional Cascaded Data Structures,” invited paper in special issue on on parallel computing in Algorithmica 15(2), February 1996, 154-171. A shorter version appears in Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '90), Crete, Greece, July 1990, 307-316.
Fractional cascading is a technique designed to allow efficient sequential search in a graph with catalogs of total size . The search consists of locating a key in the catalogs along a path. In this paper we show how to preprocess a variety of fractional cascaded data structures whose underlying graph is a tree so that searching can be done efficiently in parallel. The preprocessing takes $O(\log n)$ time with $n /\log n$ processors on an EREW PRAM. For a balanced binary tree cooperative search along root-to-leaf paths can be done in $O((\log n)/\log p)$ time using processors on a CREW PRAM. Both of these time/processor constraints are optimal. The searching in the fractional cascaded data structure can be either explicit, in which the search path is specified before the search starts, or implicit, in which the branching is determined at each node. We apply this technique to a variety of geometric problems, including point location, range search, and segment intersection search.
S. Subramanian, R. Tamassia, and J. S. Vitter. “An Efficient Parallel Algorithm for Shortest Paths in Planar Layered Digraphs,” Algorithmica, 14, 1995, 322-339. A shorter version appears in “A Divide and Conquer Approach to Shortest Paths in Planar Layered Graphs,” Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, Dallas, TX, December 1992, 176-183.
We give efficient parallel algorithms to compute shortest-paths in planar layered digraphs. We show that these digraphs admit special kinds of separators, called one-way separators, which allow paths in the graph to cross them only once. We use these separators to give divide-and-conquer solutions to the problem of finding the shortest paths. We first give a simple algorithm that works on the CREW PRAM model and computes the shortest path between any two vertices of an -node planar layered digraph in time $O(\log^3n)$ using $n /\log n$ processors. A CRCW version of this algorithm runs in $O(\log^2 n\log\log n)$ time and uses $O(n/\log\log n)$ processors. We then improve the time bound to $O(\log^2 n)$ on the CREW model and $O(\log n\log \log n)$ on the CRCW model. The processor bounds still remain $n /\log n$ for the CREW model and $n/ \log\log n$ for the CRCW model.
TPIE: Transparent Parallel I/O Programming Environment. The TPIE software project, initially begun by Darren Vengroff as part of his PhD dissertation work, is being carried on at Duke University and Aarhus University. TPIE provides a high-level implementation platform for applications that require efficient external memory access. The manual and distribution information can be found on the web site.
D. E. Vengroff and J. S. Vitter. “I/O-Efficient Scientific Computation using TPIE,” Proceedings of the Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, September 1996, published in NASA Conference Publication 3340, Volume II, 553-570.
In recent years, I/O-efficient algorithms for a wide variety of problems have appeared in the literature. Thus far, however, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to fill this void. It supports I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only of explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation.
In this paper, we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks, while others are of our own creation.
We demonstrate that the CPU overhead required to manage I/O is small and that even with just a single disk the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.
R. D. Barve, E. F. Grove and J. S. Vitter. “Simple Randomized Mergesort on Parallel Disks,” special issue on parallel I/O in Parallel Computing, 23(4), 1997. A shorter version appears in Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96), Padua, Italy, June 1996, 109-118.
We consider the problem of sorting a file of records on the -disk model of parallel I/O in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of contiguous records. In each I/O operation, up to one block can be transferred to or from each of the disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the “right” block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the “right” blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs.
The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values , , and . Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms DSM even when the number of parallel disks is small.
R. D. Barve, E. A. M. Shriver, P. Gibbons, B. Hillyer, Y. Matias, and J. S. Vitter. “Modeling and Optimizing I/O Throughput of Multiple Disks on a Bus,” Proceedings of the Joint International ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '99), Atlanta, GA, May 1999, 83-92. An abstract appears in Proceedings of the Joint ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '98/PERFORMANCE '98), Madison, WI, June 1998, 264-265. Our modeling and optimization work forms the basis of two patent applications currently filed and pending with the patent office.
Slides for talk (gzip-compressed postscript)
For a wide variety of computational tasks, disk I/O continues to be a serious obstacle to high performance. To meet demanding I/O requirements, systems are designed to use multiple disk drives that share one or more I/O ports to form a disk farm or RAID array. The focus of the present paper is on systems that use multiple disks per SCSI bus. We measured the performance of concurrent random I/Os for three types of SCSI disk drives and three types of computers. The measurements enable us to study bus-related phenomena that impair performance. We describe these phenomena, and present a new I/O performance model that incorporates bus effects to predict the average throughput achieved by concurrent random I/Os that share a SCSI bus. This model, although relatively simple, predicts performance on these platforms to within 11% for fixed I/O sizes in the range 16-128 KB/s. We then describe a technique to improve the I/O throughput. This technique increases the percentage of disk head positioning time that is overlapped with data transfers, and increases the percentage of transfers that occur at bus bandwidth, rather than at disk-head bandwidth. Our technique is most effective for large I/Os and high concurrency--an important performance region for large-scale computing--our improvements are 10-20% better than the naive method for random workloads.
J. S. Vitter. “Online Data Structures in External Memory,” invited keynote paper in Proceedings of the 26th Annual International Colloquium on Automata, Languages, and Programming (ICALP '99), Prague, Czech Republic, July 1999, published in Lecture Notes in Computer Science, 1644, Springer-Verlag, Berlin, 119-133.
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
Slides for ICALP '99 talk (gzip-compressed postscript)
The data sets for many of today's computer applications are too large to fit within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of online data structures for external memory, some very old and some very new, such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), buffer trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way.
R. Tamassia, I. G. Tollis, and J. S. Vitter. “A Parallel Algorithm for Planar Orthogonal Grid Drawings,” Parallel Processing Letters, March 2000. An extended abstract appears in “Lower Bounds and Parallel Algorithms for Planar Orthogonal Grid Drawings,” Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing (SPDP '91), Dallas, TX, December 1991, 386-393.
In this paper we consider the problem of constructing planar orthogonal grid drawings (or more simply, layouts) of graphs, with the goal of minimizing the number of bends along the edges. We present optimal parallel algorithms that construct graph layouts with maximum edge length, area, and at most bends (for biconnected graphs) and bends (for simply connected graphs). All three of these quality measures for the layouts are optimal in the worst case for biconnected graphs. The algorithm runs on a CREW PRAM in $O(\log n)$ time with $n /\log n$ processors, thus achieving optimal time and processor utilization. Applications include VLSI layout, graph drawing, and wireless communication.
J. S. Vitter. “External Memory Algorithms and Data Structures: Dealing with Massive Data,” ACM Computing Surveys, 33(2), June 2001, 209-271. Shorter versions appear in “External Memory Algorithms,” invited keynote paper in Proceedings of the 6th Annual European Symposium on Algorithms (ESA '98), Venice, August 1998, published in Lecture Notes in Computer Science, 1461, Springer, Berlin, Germany, 1-25, and in an invited tutorial in Proceedings of the 17th Annual ACM Symposium on Principles of Database Systems (PODS '98), Seattle, WA, June 1998, 119-128.
Slides for a talk (Adobe pdf format)
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
J. S. Vitter and D. A. Hutchinson. “Distribution Sort with Randomized Cycling,” Journal of the ACM, 53(7), July 2006, 656-680. A shorter version appeared in Proceedings of the 12th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '01), Washington, DC, January 2001.
Parallel independent disks can enhance the performance of external memory (EM) algorithms, but the programming task is often difficult. In this paper we develop randomized variants of distribution sort for use with parallel independent disks. We propose a simple variant called randomized cycling distribution sort (RCD) and prove that it has optimal expected I/O complexity. The analysis uses a novel reduction to a model with significantly fewer probabilistic interdependencies. Experimental evidence is provided to support its practicality. Other simple variants are also examined experimentally and appear to offer similar advantages to RCD. Based upon ideas in RCD we propose general techniques that transparently simulate algorithms developed for the unrealistic multihead disk model so that they can be run on the realistic parallel disk model. The simulation is optimal for two important classes of algorithms: the class of multipass algorithms, which make a complete pass through their data before accessing any element a second time, and the algorithms based upon the well-known distribution paradigm of EM computation.
D. A. Hutchinson, P. Sanders, and J. S. Vitter. “Duality Between Prefetching and Queued Writing with Parallel Disks,” SIAM Journal on Computing, 34(6), 1443-1463, June 2005. An extended abstract appears in Proceedings of the 9th Annual European Symposium on Algorithms (ESA '01), Århus, Denmark, August 2001, published in Lecture Notes in Computer Science, 2161, Springer-Verlag, Berlin, Germany.
Parallel disks promise to be a cost effective means for achieving high bandwidth in applications involving massive data sets, but algorithms for parallel disks can be difficult to devise. To combat this problem, we define a useful and natural duality between writing to parallel disks and the seemingly more difficult problem of prefetching. We first explore this duality for applications involving read-once accesses using parallel disks. We get a simple linear time algorithm for computing optimal prefetch schedules and analyze the efficiency of the resulting schedules for randomly placed data and for arbitrary interleaved accesses to striped sequences. Duality also provides an optimal schedule for the integrated caching and prefetching problem, in which blocks can be accessed multiple times. Another application of this duality gives us the first parallel disk sorting algorithms that are provably optimal up to lower order terms. One of these algorithms is a simple and practical variant of multiway merge sort, addressing a question that has been open for some time.
W.-K. Hon, R. Shah, and J. S. Vitter. “Compression, Indexing, and Retrieval for Massive String Data,” invited keynote paper in Proceedings of the 19th Annual Conference on Combinatorial Pattern Matching (CPM '10), New York, NY, June 2010, published in Lecture Notes in Computer Science, 4580 Springer, Berlin, Germany, 40-51. This keynote address from CPM 2010 gives a quick history of indexing using compressed data structures, as well as some of the current challenges to realize their full potential.
Slides for CPM '10 keynote talk (Adobe pdf)
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
M. Oguzhan Külekci, J. S. Vitter, and Bojian Xu. “Fast Pattern Matching via -bit Filtering Based Text Decomposition," invited paper from ISCIS 2010 in The Computer Journal, December 20, 2010. An extended abstract appears in “Boosting Pattern Matching Performance via -bit Filtering," Proceedings of the 25th International Symposium on Computer and Information Sciences (ISCIS '10), London, September 2010, published in Lecture Notes in Electrical Engineering, 1, 62, Springer, Berlin, Germany, 27-32.
This study explores an alternative way of storing text files to answer exact match queries faster. We decompose the original file into two parts as filter and payload. The filter part contains the most informative bits of each byte, and the remaining bits of the bytes are concatenated in order of appearance to generate the payload. We refer to this structure as -bit filtered format. When an input pattern is to be searched on the -bit filtered structure, the same decomposition is performed on the pattern. The bits from each byte of the pattern form the pattern filter bit sequence, and the rest is the payload. The pattern filter is first scanned on the filter part of the file. At each match position detected in the filter part, the pattern payload is verified against the corresponding location in the payload part of the text. Thus, instead of searching an -byte pattern on an -byte text, the first bits are scanned on bits, followed by a verification of bits on the respective locations of the matching positions. Experiments conducted on natural language texts, plain ASCII DNA sequences, and random byte sequences showed that the search performance with the proposed scheme is on average two times faster than the tested best exact pattern matching algorithms. The highest gain is obtained on plain ASCII DNA sequences. We also developed an effective bitwise pattern matching algorithm of possible independent interest within this study.
M. O. Külekci, W.-K. Hon, R. Shah, J. S. Vitter, and B. Xu. “ $\Psi$ -RA: a Parallel Sparse Index for Genomic Read Alignment,” invited paper from BIBM 2010 in BMC Genomics, 12(Suppl. 2), 2011, S7. An extended abstract appears in “ $\Psi$ -RA: A Parallel Sparse Index for Read Alignment on Genomes,” Proceedings of the IEEE International Conference on Bioinformatics & Biomedicine (BIBM '10), Hong Kong, December 2010, 663-668.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named $\Psi$ -RA (PSI-RA: parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption of the index based on the available memory of the machine and the minimum length of the arriving pattern queries. Although SSAs have been studied before for exact matching of short reads, an elegant way of approximate matching capability was missing. We provide this by defining the rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors are more probable. $\Psi$ -RA supports any number of mismatches in aligning reads. We give comparisons with some of the well-known short read aligners, and show that indexing a genome with SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions: $\Psi$ -RA is expected to serve as a valuable tool in the alignment of short reads generated by the next generation high-throughput sequencing technology. $\Psi$ -RA is very fast in exact matching and also supports rightmost approximate matching. The SSA structure that $\Psi$ -RA is built on naturally incorporates the modern multicore architecture and thus further speed-up can be gained. All the information, including the source code of $\Psi$ -RA, can be downloaded at http://www.busillis.com/o_kulekci/PSIRA.zip.
M. O. Külekci, J. S. Vitter, and B. Xu. “Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree,” under submission. An extended abstract appears in “Time- and Space-efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Trees,” Proceedings of the IEEE International Conference on Bioinformatics & Biomedicine (BIBM '10), Hong Kong, December 2010, 622-625.
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.
T.-H. Ku, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. “Compressed Text Indexing With Wildcards,” in preparation. An extended abstract appears in Proceedings of the 18th International Conference on String Processing and Information Retrieval (SPIRE '11), Pisa, Italy, October 2011, published in Lecture Notes in Computer Science, Springer, Berlin, Germany.
Let $T=T_1\phi^{k_1}T_2\phi^{k_2}\cdots\phi^{k_d}T_{d+1}$ be a text of total length , where characters of each are chosen from an alphabet $\Sigma$ of size $\sigma$ , and $\phi$ denotes a wildcard symbol. The text indexing with wildcards problem is to index such that when we are given a query pattern , we can locate the occurrences of in efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only $nH_h +o(n\log \sigma)+O(d\log n)$ bits space, where is the th-order empirical entropy ( $h=o(\log_{\sigma} n)$ ) of .
Q. Yu, H. Huo, J. S. Vitter, J. Huan, and Y. Nekrich. “An Efficient Exact Algorithm for the Motif Stem Search Problem over Large Alphabets,”Categorical Range Maxima Queries IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(2), March/April 2015, 384-397, dx.doi.org/10.1109/TCBB.2014.2361668. A shorter version appears in “StemFinder: An Efficient Algorithm for Searching Motif Stems over Large Alphabets,” Proceedings of 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM '13), Shanghai, December 2013.
In recent years, there has been increasing interest in planted $(\ell, d)$ motif search (PMS) with applications to discovering significant segments in biological sequences. However, there has been little discussion about PMS over large alphabets. This paper focuses on motif stem search (MSS), which was recently introduced to search motifs on large-alphabet inputs. A motif stem is an $\ell$ -length string with some wildcards. The goal of the MSS problem is to find a set of stems that represents a superset of all $(\ell, d)$ motifs present in the input sequences, and the superset is expected to be as small as possible. The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular expressions. (2) We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous algorithms, StemFinder runs much faster and first solves the (17, 8), (19, 9) and (21, 10) challenging instances on protein sequences; moreover, StemFinder reports fewer stems which represent a smaller superset of all $(\ell, d)$ motifs. StemFinder is freely available at http://sites.google.com/site/feqond/stemfinder.
M. Lewenstein, Y. Nekrich, and J. S. Vitter. “Space-Efficient String Indexing for Wildcard Pattern Matching,” Proceedings of the 31st Symposium on Theoretical Aspects of Computer Science (STACS '14), Lyon, France, March 2014, published in Leibniz International Proceedings in Informatics, Schloss Dagstuhl, Wadern, Germany.
In this paper we describe compressed indexes that support pattern matching queries for strings with wildcards. We present a data structure that uses $O(n \log^\epsilon n)$ bits for any $\epsilon > 0$ and reports all occ occurrences of a wildcard string in $O(\sigma^g + \log^{(3)} n + \mathop{occ})$ time, where $\sigma$ is the alphabet size, is the number of alphabet symbols, and is the number of wildcard symbols in the query string. We also present an -bit index with $O((m+ \sigma^g + \mathop{occ}) \log^\epsilon n)$ query time and an $O(n(\log \log n)^2)$ -bit index with $O((m + \sigma^g + \mathop{occ}) \log \log n)$ query time. These are the first data structures for this problem that need only bits of space.
H. Huo, L. Chen, J. S. Vitter, and Y. Nekrich. “A Practical Implementation for Compressed Suffix Arrays with Applications to Self-indexing,” in preparation. An extended abstract appears in Proceedings of the 2014 IEEE Data Compression Conference (DCC '14), Snowbird, UT, March 2014.
In this paper, we develop a simple and practical storage scheme for compressed suffix arrays (CSA). Our CSA can be constructed in linear time and needs bits of space simultaneously for any $k \leq c \log_\sigma n - 1$ and any constant , where denotes the -th order entropy. We compare the performance of our method with two established compressed indexing methods, viz. the FM-index and Sadakane's CSA. Experiments on the Canterbury Corpus and the Pizza&Chili Corpus show significant advantages of our algorithm over two other indexes in terms of compression and query time. Our storage scheme achieves better performance on all types of data present in these two corpora, except for evenly distributed data, such as DNA. The source code for our CSA is available online.
Q. Yu, H. Huo, X. Chen, H. Guo, J. S. Vitter, and J. Huan. “An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets,” IEEE Transactions on NanoBioscience, 14(5), July 2015, epublished April 9, 2015. A shorter version appears in “An Efficient Motif Finding Algorithm for Large DNA Data Sets,” Proceedings of 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM '14), Belfast, UK, November 2014.
The planted $(\ell, d)$ motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying $(\ell, d)$ motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted $(\ell, d)$ motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify $(\ell, d)$ motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art $(\ell, d)$ motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets. MCES is freely available at http://sites.google.com/site/feqond/mces.
Q. Yu, H. Huo, R. Zhao, D. Feng, J. S. Vitter, and J. Huan. “RefSelect: A Reference Sequence Selection Algorithm for Planted $(\ell, d)$ Motif Search," BMC Bioinformatics, 17(Suppl 9), July 19, 2016, dx.doi.org10.1186s12859-016-1130-6. A shorter version appears in “Reference Sequence Selection for Motif Searches,” Proceedings of 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM '15), Bethesda, MD, November 2015, 569-574.
The planted $(\ell, d)$ motif search (PMS) is an important yet challenging problem in computational biology. Pattern-driven PMS algorithms usually use out of input sequences as reference sequences to generate candidate motifs, and they can find all the $(\ell, d)$ motifs in the input sequences. However, most of them simply take the first sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets.
In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100.
The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.
X. Chen, H. Huo, J. Huan, J. S. Vitter, W. Zheng, and L. Zou. “MSQ-Index: A Succinct Index for Fast Graph Similarity Search,” IEEE Transactions on Knowledge and Data Engineering, November 20, 2019.
Graph similarity search has received considerable attention in many applications, such as bioinformatics, data mining, pattern recognition, and social networks. Existing methods for this problem have limited scalability because of the huge amount of memory they consume when handling very large graph databases with millions or billions of graphs.
In this paper, we study the problem of graph similarity search under the graph edit distance constraint. We present a space-efficient index structure based upon the q-gram tree that incorporates succinct data structures and hybrid encoding to achieve improved query time performance with minimal space usage. Specifically, the space usage of our index requires only 5%-15% of the previous state-of-the-art indexing size on the tested data while at the same time achieving 2-3 times acceleration in query time with small data sets. We also boost the query performance by augmenting the global filter with range search, which allows us to perform a query in a reduced region. In addition, we propose two effective filters that combine degree structures and label structures. Extensive experiments demonstrate that our proposed approach is superior in space and competitive in filtering to the state-of-the-art approaches. To the best of our knowledge, our index is the first in-memory index for this problem that successfully scales to cope with the large dataset of 25 million chemical structure graphs from the PubChem dataset.
H. Huo, P. Liu, C. Wang, H. Jiang, and J. S. Vitter. “CIndex: Compressed Indexes for Fast Retrieval of FASTQ Files,” Bioinformatics, btab655, published electronically September 15, 2021, doi.org/10.1093/bioinformatics/btab655, 9 pages.
We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures, and special tables, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate, and extract queries on reads, our method uses 2.7-41.66 percentage points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times, and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88 percentage points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.
The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex.
H. Huo, C. Hong, and J. S. Vitter. “Practical High-order Entropy-compressed Text Indexing,” IEEE Transactions on Knowledge and Data Engineering, 35(3), 2023, 2943-2960, doi.org/10.1109/TKDE.2021.3114401, preliminary version published electronically September 23, 2021. Source code published in Code Ocean, codeocean.com/capsule/3554560/tree/v1, November 2020.
Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function. We analyze the performance in theory and practice of our recommended indexing method, called GeCSA. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index GeCSA has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.