We examine a version of the dynamic dictionary problem in which stored items have expiration times and can be removed from the dictionary once they have expired. We show that under several reasonable assumptions about the distribution of the items, hashing with lazy deletion uses little more space than methods that use eager deletion. The simple algorithm suggested by this observation was used in a program for analyzing integrated circuit artwork.
Several new methods are presented for selecting records
at random without replacement from a file containing
records. Each algorithm selects the records for the sample
in a sequential manner -- in the same order the records
appear in the file. The algorithms are online in that the
records for the sample are selected iteratively with no
preprocessing. The algorithms require a constant amount of
space and are short and easy to implement. The main result
of this paper is the design and analysis of Algorithm D,
which does the sampling in
time, on the average;
roughly
uniform random variates are generated, and
approximately
exponentiation operations (of the form
, for real numbers
and
) are performed during
the sampling. This solves an open problem in the literature.
CPU timings on a large mainframe computer indicate that
Algorithm D is significantly faster than the sampling
algorithms in use today.
For an improved and optimized version of the random
sampling method,
see a related paper.
For reservoir methods, where is not known in advance,
see a related paper.
We introduce fast algorithms for selecting a random sample of
records without replacement from a pool of
records, where
the value of
is unknown beforehand. The main result of the
paper is the design and analysis of Algorithm Z; it does the
sampling in one pass using constant space and in
expected time, which is optimum, up to a constant
factor. Several optimizations are studied that collectively
improve the speed of the naive version of the algorithm by
an order of magnitude. We give an efficient Pascal-like
implementation that incorporates these modifications and
that is suitable for general use. Theoretical and empirical
results indicate that Algorithm Z outperforms current
methods by a significant margin.
For sampling methods where is known in advance,
see a related paper.
The aim of this chapter to describe the main mathematical methods and applications in the average-case analysis of algorithms and data structures. It comprises two parts: First, we present basic combinatorial enumerations based on symbolic methods and asymptotic methods with emphasis on complex analysis techniques (such as singularity analysis, saddle point, Mellin transforms). Next, we show how to apply these general methods to the analysis of sorting, searching, tree data structures, hashing, and dynamic algorithms. The emphasis is on algorithms for which exact “analytic models” can be derived.
We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/Os) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match.
Secondary storage is modeled as a magnetic disk capable of transferring
blocks each containing
records in a single time unit; the
records in each block must be input from or output to
contiguous
locations on the disk. We give two optimal algorithms for the problems,
which are variants of merge sorting and distribution sorting. In
particular we show for
that the standard merge sorting algorithm
is an optimal external sorting method, up to a constant factor in the
number of I/Os. Our sorting algorithms use the same number of I/Os as
does the permutation phase of key sorting, except when the internal
memory size is extremely small, thus affirming the popular adage that
key sorting is not faster. We also give a simpler and more direct
derivation of Hong and Kung's lower bound for the FFT for the special
case
.
This paper presents an improved and optimized version of the random
sampling method from J. S. Vitter, “Faster Methods for Random
Sampling,” Communications of the ACM, 27(7), July 1984,
703-718. The object is to choose in sequential online
fashion a random sample of size from a universe of size
.
For reservoir methods, where
is not known in advance,
see a related paper.
We answer questions about the distribution of the maximum
size of queues and data structures as a function of time.
The concept of "maximum" occurs in many issues of resource
allocation. We consider several models of growth, including
general birth-and-death processes, the M/G/ model, and a
non-Markovian process (data structure) for processing
plane-sweep information in computational geometry, called
"hashing with lazy deletion" (HwLD). It has been shown that
HwLD is optimal in terms of expected time and dynamic space;
our results show that it is also optimal in terms of
expected preallocated space, up to a constant factor.
We take two independent and complementary approaches: first, in
Section 2, we use a variety of algebraic and analytical
techniques to derive exact formulas for the distribution of
the maximum queue size in stationary birth-and-death
processes and in a nonstationary model related to file
histories. The formulas allow numerical evaluation and some
asymptotics. In our second approach, in Section 3, we
consider the M/G/ model (which includes M/M/
as a
special case) and use techniques from the analysis of
algorithms to get optimal big-oh bounds on the expected
maximum queue size and on the expected maximum amount of
storage used by HwLD in excess of the optimal amount. The
techniques appear extendible to other models, such as M/M/1.
This paper develops two probabilistic methods that allow the analysis of the maximum data structure size encountered during a sequence of insertions and deletions in data structures such as priority queues, dictionaries, linear lists, and symbol tables, and in sweepline structures for geometry and Very-Large- Scale-Integration (VLSI) applications. The notion of the "maximum" is basic to issues of resource preallocation. The methods here are applied to combinatorial models of file histories and probabilistic models, as well as to a non-Markovian process (algorithm) for processing sweepline information in an efficient way, called "hashing with lazy deletion" (HwLD). Expressions are derived for the expected maximum data structure size that are asymptotically exact, that is, correct up to lower-order terms; in several cases of interest the expected value of the maximum size is asymptotically equal to the maximum expected size. This solves several open problems, including longstanding questions in queueing theory. Both of these approaches are robust and rely upon novel applications of techniques from the analysis of algorithms. At a high level, the first method isolates the primary contribution to the maximum and bounds the lesser effects. In the second technique the continuous-time probabilistic model is related to its discrete analog-the maximum slot occupancy in hashing.
We present efficient new randomized and deterministic methods for
transforming optimal solutions for a type of relaxed integer linear
program into provably good solutions for the corresponding -hard discrete optimization problem. Without any constraint
violation, the
-approximation problem for many problems of
this type is itself
-hard. Our methods provide polynomial-time
-approximations while attempting to minimize the packing
constraint violation.
Our methods lead to the first known approximation algorithms with
provable performance guarantees for the -median problem, the
tree pruning problem, and the generalized assignment
problem. These important problems have numerous applications to data
compression, vector quantization, memory-based learning, computer
graphics, image processing, clustering, regression, network location,
scheduling, and communication. We provide evidence via reductions that
our approximation algorithms are nearly optimal in terms of the packing
constraint violation. We also discuss some recent applications of our
techniques to scheduling problems.
In this paper we present approximation algorithms for median problems
in metric spaces and fixed-dimensional Euclidean space. Our
algorithms use a new method for transforming an optimal solution of
the linear program relaxation of the -median problem into a
provably good integral solution. This transformation technique is
fundamentally different from the methods of randomized and
deterministic rounding by Raghavan and the methods proposed the
authors' earlier work in the following way: Previous techniques never
set variables with zero values in the fractional solution to the value
of 1. This departure from previous methods is crucial for the success
of our algorithms.
We present new vector quantization algorithms based on the theory developed by the authors. The new approach is to formulate a vector quantization problem as a 0-1 integer linear program. We first solve its relaxed linear program by linear programming techniques. Then we transform the linear program solution into a provably good solution for the vector quantization problem. These methods lead to the first known polynomial-time full-search vector quantization codebook design algorithm and tree pruning algorithm with provable worst-case performance guarantees. We also introduce the notion of pseudo-random pruned tree-structured vector quantizers. Initial experimental results on image compression are very encouraging.
A memory-based learning system is an extended memory management system
that decomposes the input space either statically or dynamically into
subregions for the purpose of storing and retrieving functional
information. The main generalization techniques employed by memory-based
learning systems are the nearest-neighbor search, space decomposition
techniques, and clustering. Research on memory-based learning is still
in its early stage. In particular, there are very few rigorous
theoretical results regarding memory requirement, sample size, expected
performance, and computational complexity. In this paper, we propose a
model for memory-based learning and use it to analyze several
methods -- -covering, hashing, clustering, tree-structured
clustering, and receptive-fields -- for learning smooth functions. The
sample size and system complexity are derived for each method. Our
model is built upon the generalized PAC learning model of Haussler and
is closely related to the method of vector quantization in data
compression. Our main result is that we can build memory-based learning
systems using new clustering algorithms of Lin and Vitter to PAC-learn
in polynomial time using only polynomial storage in typical situations.
We provide the first optimal algorithms in terms of the number of
input/outputs (I/Os) required between internal memory and multiple
secondary storage devices for the problems of sorting, FFT, matrix
transposition, standard matrix multiplication, and related problems.
Our two-level memory model is new and gives a realistic treatment of
parallel block transfer, in which during a single I/O each of the
secondary storage devices can simultaneously transfer a contiguous
block of
records. The model pertains to a large-scale uniprocessor
system or parallel multiprocessor system with
disks. In addition,
the sorting, FFT, permutation network, and standard matrix
multiplication algorithms are typically optimal in terms of the amount
of internal processing time. The difficulty in developing optimal
algorithms is to cope with the partitioning of memory into
separate
physical devices. Our algorithms' performance can be significantly
better than those obtained by the well-known but nonoptimal technique of
disk striping. Our optimal sorting algorithm is randomized, but
practical; the probability of using more than
times the optimal
number of I/Os is exponentially small in
,
where
is the internal memory size.
In this paper we introduce parallel versions of two hierarchical memory
models and give optimal algorithms in these models for sorting, FFT, and
matrix multiplication. In our parallel models, there are memory
hierarchies operating simultaneously; communication among the
hierarchies takes place at a base memory level. Our optimal sorting
algorithm is randomized and is based upon the probabilistic partitioning
technique developed in the companion paper for optimal disk sorting in a
two-level memory with parallel block transfer. The probability of using
times the optimal running time is exponentially small in
.
We present an algorithm for sorting efficiently with parallel two-level memories. Our
main result is an elegant, easy-to-implement, optimal, detemzinistic algorithm for external sorting
with disk drives. This result answers in the affirmative the open problem posed by Vitter and
Shriver of whether an optimal algorithm exists that is deterministic. Our measure of performance
is the number of parallel input/output (I/0) operations, in which each of the
disks can
simultaneously transfer a block of
contiguous records. We assume that internal memory can
hold
records. Our algorithm sorts
records in the optimal bound of
deterministically, and thus it improves upon Vitter and Shriver's optimal randomized
algorithm as well as the well-known deterministic but nonoptimal technique of disk striping. It is
also practical to implement.
We present several efficient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Feig, and its parallelization P-UMH. We give optimal and nearly-optimal algorithms for a wide range of bandwidth degradations, including a parsimonious algorithm for constant bandwidth. We also develop optimal sorting algorithms for all bandwidths for other versions of UMH and P-UMH, including natural restrictions we introduce called RUMH and P-RUMH, which more closely correspond to current programming languages.
We present a load balancing technique that leads to an optimal
deterministic algorithm called Balance Sort for external sorting
on multiple disks. Our measure of performance is the number of
input/output (I/O) operations. In each I/O, each of the
disks can simultaneously transfer a block of data. Our
algorithm improves upon the randomized optimal algorithm of
Vitter and Shriver as well as the (non-optimal) commonly-used
technique of disk striping. It also improves upon our earlier
merge-based sorting algorithm in that it has smaller constants
hidden in the big-oh notation, and it is possible to implement
using only striped writes (but independent reads). In a
companion paper, we show how to modify the algorithm to achieve
optimal CPU time, even on parallel processors and parallel
memory hierarchies.
We present a practical deterministic load balancing strategy for distribution sort that is applicable to parallel disks and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single CPU, as described in the companion paper. However, the internal processing of Balance Sort does not seem parallelizable. In this paper, we develop an elegant variation that achieves full parallel speedup. The algorithms so derived are optimal for all parallel memory hierarchies with any type of a PRAM base-level interconnection and are either optimal or best-known for a hypercube interconnection. We show how to achieve optimal internal processing time as well as optimal number of I/Os in parallel two-level memories.
In this paper we introduce the notion of approximate data
structures, in which a small amount of error is tolerated in the
output. Approximate data structures trade error of approximation for
faster operation, leading to theoretical and practical speedups for a
wide variety of algorithms. We give approximate variants of the van
Emde Boas data structure, which support the same dynamic operations as
the standard van Emde Boas data structure, except that answers to
queries are approximate. The variants support all operations in
constant time provided the error of approximation is
, and in
time provided the error is
, for
elements in the data
structure.
We consider the tolerance of prototypical algorithms to approximate data
structures. We study in particular Prim's minimum spanning tree
algorithm, Dijkstra's single-source shortest paths algorithm, and an
on-line variant of Graham's convex hull algorithm. To obtain output
which approximates the desired output with the error of approximation
tending to zero, Prim's algorithm requires only linear time, Dijkstra's
algorithm requires
time, and the on-line variant of
Graham's algorithm requires constant amortized time per operation.
In this paper, we consider the problem of using disk blocks efficiently
in searching graphs that are too large to fit in internal memory. Our
model allows a vertex to be represented any number of times on the disk
in order to take advantage of redundancy. We give matching upper and
lower bounds for complete -ary trees and
-dimensional grid
graphs, as well as for classes of general graphs that intuitively
speaking have a close to uniform number of neighbors around each vertex.
We present a collection of new techniques for designing and analyzing efficient external-memory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include:
Our techniques apply to a number of problems, including list ranking, which we discuss in detail, finding Euler tours, expression-tree evaluation, centroid decomposition of a tree, least-common ancestors, minimum spanning tree verification, connected and biconnected components, minimum spanning forest, ear decomposition, topological sorting, reachability, graph drawing, and visibility representation.
In the design of algorithms for large-scale applications it is essential to consider the problem of minimizing I/O communication. Geographical information systems (GIS) are good examples of such large-scale applications as they frequently handle huge amounts of spatial data. In this paper we develop efficient new external-memory algorithms for a number of important problems involving line segments in the plane, including trapezoid decomposition, batched planar point location, triangulation, red-blue line segment intersection reporting, and general line segment intersection reporting. In GIS systems, the first three problems are useful for rendering and modeling, and the latter two are frequently used for overlaying maps and extracting information from them.
To solve these problems, we combine and modify in novel ways several of the previously known techniques for designing efficient algorithms for external memory. We also develop a powerful new technique that can be regarded as a practical external memory version of fractional cascading. Except for the batched planar point location problem, no algorithms specifically designed for external memory were previously known for these problems. Our algorithms for triangulation and line segment intersection partially answer previously posed open problems, while the batched planar point location algorithm improves on the previously known solution, which applied only to monotone decompositions. Our algorithm for the red-blue line segment intersection problem is provably optimal.
We consider the problem of sorting a file of records
on the
-disk model of parallel I/O in
which there are two sources of parallelism. Records are transferred
to and from disk concurrently in blocks of
contiguous records.
In each I/O operation, up to one block can be transferred to or from
each of the
disks in parallel. We propose a simple, efficient,
randomized mergesort algorithm called SRM that uses a
forecast-and-flush approach to overcome the inherent difficulties of
simple merging on parallel disks. SRM exhibits a limited use of
randomization and also has a useful deterministic version.
Generalizing the technique of forecasting, our
algorithm is able to read in, at any time, the “right” block from
any disk, and using the technique of flushing, our algorithm evicts,
without any I/O overhead, just the “right” blocks from memory to
make space for new ones to be read in. The disk layout of SRM is
such that it enjoys perfect write parallelism, avoiding fundamental
inefficiencies of previous mergesort algorithms. By analysis of
generalized maximum occupancy problems we are able to
derive an analytical upper bound on SRM's expected overhead valid
for arbitrary inputs.
The upper bound derived on expected I/O performance of SRM indicates
that SRM is provably better than disk-striped mergesort (DSM) for
realistic parameter values ,
, and
. Average-case
simulations show
further improvement on the analytical upper bound. Unlike previously
proposed optimal sorting algorithms, SRM outperforms
DSM even when the number
of parallel disks is small.
In this paper we address for the first time the I/O complexity of the
problem of sorting strings in external memory, which is a fundamental
component of many large-scale text applications. In the standard
unit-cost RAM comparison model, the complexity of sorting strings of
total length
is
. By analogy, in the external
memory (or I/O) model, where the internal memory has size
and the
block transfer size is
, it would be natural to guess that the I/O
complexity of sorting strings is
, but the known algorithms do not come even
close to achieving this bound. Our results show, somewhat
counterintuitively, that the I/O complexity of string sorting depends
upon the length of the strings relative to the block size. We first
consider a simple comparison I/O model, where one is not allowed to break
the strings into their characters, and we show that the I/O complexity of
string sorting in this model is
, where
is the total
length of all strings shorter than
and
is the number of strings
longer than
. We then consider two more general I/O comparison models
in which string breaking is allowed. We obtain improved algorithms and in
several cases lower bounds that match their I/O bounds. Finally, we
develop more practical algorithms without assuming the comparison model.
Slides for talk plus extra foils on dynamic memory allocation (gzip-compressed postscript)
External sorting is a fundamental operation in many large scale data processing systems not only for producing sorted output but also as a core subroutine in many operations. Technology trends indicate that developing techniques that effectively use multiple disks in parallel in order to speed up the performance of external sorting is of prime importance. The simple randomized merging (SRM) mergesort algorithm proposed in our earlier work is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth (in the new edition of The Art of Computer Programming, Vol. 3: Sorting and Searching) recently identified SRM (which he calls “randomized striping”) as the method of choice for sorting with parallel disks.
In this paper, we present an efficient implementation of SRM, based upon novel data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to efficiently read blocks of input runs during external merging.
We present the performance of SRM over a wide range of input sizes
and compare its performance with that of disk-striped
mergesort (DSM), the commonly used technique to
implement external mergesort on parallel disks. DSM consists
of using a standard mergesort algorithm in conjunction with
striped I/O for parallel disk access. SRM merges together
significantly more runs at a time compared with DSM, and thus it
requires fewer merge passes. We demonstrate in practical scenarios
that even though the streaming speeds for merging with DSM are a
little higher than those for SRM (since DSM merges fewer runs at a
time), sorting using SRM is significantly faster than with DSM,
since SRM requires fewer passes.
The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort, multiway partitioning of a file into several other files. and some potential multimedia streaming applications.
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
Slides for ICALP '99 talk (gzip-compressed postscript)
The data sets for many of today's computer applications are too large to fit within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of online data structures for external memory, some very old and some very new, such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), buffer trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way.
We consider the problem of devising external memory algorithms whose memory allocations can change dynamically and unpredictably at run-time. The investigation of “memory-adaptive” algorithms, which are designed to adapt to dynamically changing memory allocations, can be considered a natural extension of the investigation of traditional, non-adaptive external memory algorithms. Our study is motivated by high performance database systems and operating systems in which applications are prioritized and internal memory is dynamically allocated in accordance with the priorities. In such situations, external memory applications are expected to perform as well as possible for the current memory allocation. The computation must be reorganized to adapt to the sequence of memory allocations in an online manner.
In this paper we present a simple and natural dynamic memory allocation model. We define memory-adaptive external memory algorithms and specify what is needed for them to be dynamically optimal. Using novel techniques, we design and analyze dynamically optimal memory-adaptive algorithms for the problems of sorting, permuting, FFT, permutation networks, (standard) matrix multiplication and LU decomposition. We also present a dynamically optimal (in an amortized sense) memory-adaptive version of the buffer tree, a generic external memory data structure for a large number of batched dynamic applications. We show that a previously devised approach to memory-adaptive external mergesort is provably nonoptimal because of fundamental drawbacks. The lower bound proof techniques for sorting and matrix multiplication are fundamentally distinct techniques, and they are invoked by most other external memory lower bounds; hence we anticipate that the techniques presented here will apply to many external memory problems.
Many data sets to be sorted consist of a limited number of
distinct keys. Sorting such data sets can be thought of as
bundling together identical keys and having the bundles placed in
order; we therefore denote this as bundle sorting. We
describe an efficient algorithm for bundle sorting in external
memory that requires at most
disk
accesses, where
is the number of keys,
is the size of
internal memory,
is the number of distinct keys,
is the
transfer block size, and
. For moderately sized
, this
bound circumvents the
I/O lower
bound known for general sorting. We show that our algorithm is
optimal by proving a matching lower bound for bundle sorting. The
improved running time of bundle sorting over general sorting can
be significant in practice, as demonstrated by experimentation. An
important feature of the new algorithm is that it is executed
“in-place”, requiring no additional disk space.
Slides for a talk (Adobe pdf format)
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
Parallel independent disks can enhance the performance of external memory (EM) algorithms, but the programming task is often difficult. In this paper we develop randomized variants of distribution sort for use with parallel independent disks. We propose a simple variant called randomized cycling distribution sort (RCD) and prove that it has optimal expected I/O complexity. The analysis uses a novel reduction to a model with significantly fewer probabilistic interdependencies. Experimental evidence is provided to support its practicality. Other simple variants are also examined experimentally and appear to offer similar advantages to RCD. Based upon ideas in RCD we propose general techniques that transparently simulate algorithms developed for the unrealistic multihead disk model so that they can be run on the realistic parallel disk model. The simulation is optimal for two important classes of algorithms: the class of multipass algorithms, which make a complete pass through their data before accessing any element a second time, and the algorithms based upon the well-known distribution paradigm of EM computation.
Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines.
A key step toward developing better models is to quantify the performance effects of features not reflected in the models. This paper explores the effect of memory system features on sorting performance. We introduce a new cache-conscious sorting algorithm, R-MERGE, which achieves better performance in practice over algorithms that are superior in the theoretical models. R-MERGE is designed to minimize memory stall cycles rather than cache misses by considering features common to many system designs.
Parallel disks promise to be a cost effective means for achieving high bandwidth in applications involving massive data sets, but algorithms for parallel disks can be difficult to devise. To combat this problem, we define a useful and natural duality between writing to parallel disks and the seemingly more difficult problem of prefetching. We first explore this duality for applications involving read-once accesses using parallel disks. We get a simple linear time algorithm for computing optimal prefetch schedules and analyze the efficiency of the resulting schedules for randomly placed data and for arbitrary interleaved accesses to striped sequences. Duality also provides an optimal schedule for the integrated caching and prefetching problem, in which blocks can be accessed multiple times. Another application of this duality gives us the first parallel disk sorting algorithms that are provably optimal up to lower order terms. One of these algorithms is a simple and practical variant of multiway merge sort, addressing a question that has been open for some time.
This paper investigates the problem of high-level querying of multimedia data by imposing arbitrary domain-specific constraints among multimedia objects. We argue that the current structured query model, and the query-by-content model, are insufficient for many important applications, and we propose an alternative query framework that unifies and extends the previous two models. The proposed framework is based on the querying-by-concept paradigm, where the query is expressed simply in terms of concepts, regardless of the complexity of the underlying multimedia search engines. The query-by-concept paradigm was previously illustrated by the CAMEL system. The present paper builds upon and extends that work by adding arbitrary constraints and multiple levels of hierarchy in the concept representation model.
We consider queries simply as descriptions of virtual data sets, and that allows us to use the same unifying concept representation for query specification, as well as for data annotation purposes. We also identify some key issues and challenges presented by the new framework, and we outline possible approaches for overcoming them. In particular, we study the problems of concept representation, extraction, refinement, storage, and matching.
This paper investigates the problem of incremental joins of multiple
ranked data sets when the join condition is a list of arbitrary
user-defined predicates on the input tuples. This problem arises in
many important applications dealing with ordered inputs and multiple
ranked data sets, and requiring the top solutions. We use
multimedia applications as the motivating examples but the problem is
equally applicable to traditional database applications involving
optimal resource allocation, scheduling, decision making, ranking,
etc.
We propose an algorithm that enables querying of ordered
data sets by imposing arbitrary user-defined join predicates.
The basic version of the algorithm does not use any random access but
a
variation can exploit available indexes for efficient
random access based on the join predicates. A special case includes
the join scenario considered by Fagin for joins
based on identical keys, and in that case, our algorithms perform as
efficiently as Fagin's. Our main contribution, however, is the
generalization to join scenarios that were previously unsupported,
including cases where random access in the algorithm is not possible
due to lack of unique keys. In addition,
can support multiple join levels, or nested join hierarchies, which are the norm
for modeling multimedia data. We also give
-approximation
versions of both of the above algorithms. Finally, we give strong
optimality results for some of the proposed algorithms, and we study
their performance empirically.
In this paper we consider aggregate predicates and their support in
database systems. Aggregate predicates are the predicate equivalent
to aggregate functions in that they can be used to search for tuples
that satisfy some aggregate property over a set of tuples (as opposed
to simply computing an aggregate property over a set of tuples). The
importance of aggregate predicates is exemplified by many modern
applications that require ranked search, or top- queries. Such
queries are the norm in multimedia and spatial databases.
In order to support the concept of aggregate predicates in DBMS, we introduce several extensions in the query language and the database engine. Specifically, we extend the SQL syntax to handle aggregate predicates and work out the semantics of such extensions so that they behave correctly in the existing database model. We also propose a a new rk_SORT operator into the database engine, and study relevant indexing and query optimization issues.
Our approach provides several advantages, including enhanced usability and improved performance. By supporting aggregate predicates natively in the database engine, we are able to reuse existing indexing and query optimization techniques, without sacrificing generality or incurring the runtime overhead of database-external approaches. To the best of our knowledge, the proposed framework is the first to support user-defined indexing with aggregate predicates and search based upon user-defined ranking. We also provide empirical results from a simulation study that validates the effectiveness of our approach.
Parallel disks provide a cost effective way of speeding up
I/Os in applications that work with large amounts of data.
The main challenge is to achieve as much parallelism as
possible, using prefetching to avoid bottlenecks in disk
access.
Efficient algorithms have been developed for some particular
patterns of accessing the disk blocks. In this paper, we
consider general request sequences. When the request
sequence consists of unique block requests, the problem is
called prefetching and is a well-solved problem for
arbitrary request sequences. When the reference sequence can
have repeated references to the same block, we need to
devise an effective caching policy as well. While optimum
offline algorithms have been recently designed for the
problem, in the online case, no effective algorithm was
previously known.
Our main contribution is a deterministic online algorithm
threshold-LRU which achieves
competitive
ratio and a randomized online algorithm threshold-MARK which
achieves
competitive ratio for
the caching/prefetching problem on the parallel disk model
(PDM), where
is the number of disks,
is the size of fast
memory buffer, and
is the amount of lookahead
available in the request sequence. The best-known lower
bound on the competitive ratio is
for
lookahead
in both models. We also show that if the
deterministic online algorithm is allowed to have twice the
memory of the offline then a tight competitive ratio of
can be achieved. This problem generalizes the
well-known paging problem on a single disk to the parallel
disk model.
The emergence of extensible index structures, e.g., GiST
(Generalized Search Tree) and SP-GiST
(Space-Partitioning Generalized Search Tree),
calls for a set of extensible algorithms to support
different operations (e.g., insertion, deletion, and
search). Extensible bulk operations (e.g., bulk loading and
bulk
insertion) are of the same importance and need to be
supported in these index engines. In this paper, we propose
two extensible buffer-based algorithms for bulk operations
in the class of space-partitioning trees; a class of
hierarchical data structures that recursively decompose the
space into disjoint partitions. The main idea of these
algorithms is to build an in-memory tree of the target
space-partitioning index. Then, data items are recursively
partitioned into disk-based buffers using the in-memory
tree. Although the second algorithm is designed for bulk
insertion, it can be used in bulk loading as well. The
proposed extensible algorithms are implemented inside
SP-GiST; a framework for supporting the class of
space-partitioning trees. Both algorithms have I/O bound
, where
is the number of data items to be bulk
loaded/inserted,
is the number of tree nodes that can
fit in one disk page,
is the tree height in terms of
pages after applying a clustering algorithm. Experimental
results are provided to show the scalability and
applicability of the proposed algorithms for the class of
space-partitioning trees. A comparison of the two proposed
algorithms shows that the first algorithm performs better in
case of bulk loading. However the second algorithm is more
general and can be used for efficient bulk insertion.
One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of deviants from Jagadish et al as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem--univariate vs multivariate time series, optimal vs nearoptimal vs heuristic solutions, offline vs streaming--our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.
Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization. We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. The experiments show the validity of our framework and the accuracy of the proposed estimation model.
It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, it is only necessary to know whether the probability exceeds a given threshold; we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries.
In particular, we develop two index structures and associated algorithms to efficiently answer PTQs. The first index scheme is based on the idea of augmenting uncertainty information to an R-tree. We establish the difficulty of this problem by mapping one-dimensional intervals to a two-dimensional space, and show that the problem of interval indexing with probabilities is significantly harder than interval indexing which is considered a well-studied problem. To overcome the limitations of this R-tree based structure, we apply a technique we call variance-based clustering, where data points with similar degrees of uncertainty are clustered together. Our extensive index structure can answer the queries for various kinds of uncertainty pdfs, in an almost optimal sense. We conduct experiments to validate the superior performance of both indexing schemes.
Rank-aware query processing has emerged as a key requirement in modern applications. In these applications, efficient and adaptive evaluation of top-k queries is an integral part of the application semantics. In this article, we introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting physical property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators.We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans is key to the full integration of rank-join operators in real-world query processing engines.
Since optimal execution strategies picked by static query optimizers lose their optimality due to estimation errors and unexpected changes in the computing environment, we introduce several adaptive execution strategies for top-k queries that respond to these unexpected changes and costing errors. Our reactive reoptimization techniques change the execution plan at runtime to significantly enhance the performance of running queries. Since top-k query plans are usually pipelined and maintain a complex ranking state, altering the execution strategy of a running ranking query is an important and challenging task.
We conduct an extensive experimental study to evaluate the performance of the proposed framework. The experimental results are twofold: (1) we show the effectiveness of our cost-based approach of integrating ranking plans in dynamic programming cost-based optimizers; and (2) we show a significant speedup (up to 300%) when using our adaptive execution of ranking plans over the state-of-the-art mid-query reoptimization strategies.
In an uncertain database, each data item is modeled as a range associated with a probability density function. Previous works for this kind of data have focused on simple queries such as range and nearest-neighbor queries. Queries that join multiple relations have not been addressed in earlier work despite the significance of joins in databases. In this paper, we address probabilistic join over uncertain data, essentially a query that augments the results with probability guarantees to indicate the likelihood of each join tuple being part of the result. We extend the notion of join operators, such as equality and inequality, for uncertain data. We also study the performance of probabilistic join. We observe that a user may only need to know whether the probability of the results exceeds a given threshold, instead of the precise probability value. By incorporating this constraint, it is possible to achieve much better performance. In particular, we develop three sets of optimization techniques, namely item-level, page-level and index-level pruning, for different join operators. These techniques facilitate pruning with little space and time overhead, and are easily adapted to most join algorithms. We verify the performance of these techniques experimentally.
We present a framework to dynamize succinct data structures,
to encourage their use over non-succinct versions in a wide
variety of important application areas. Our framework can
dynamize most state-of-the-art succinct data structures for
dictionaries, ordinal trees, labeled trees, and text
collections. Of particular note is its direct application to
XML indexing structures that answer
queries. Our
framework focuses on achieving information-theoretically
optimal space along with near-optimal update/query bounds.
As the main part of our work, we consider the following
problem central to text indexing: Given a text over an
alphabet
, construct a compressed data structure
answering the queries
,
, and
for a symbol
. Many data
structures consider these queries for static text
. We
build on these results and give the best known query bounds
for the dynamic version of this problem, supporting
arbitrary insertions and deletions of symbols in
.
Specifically, with an amortized update time of
, any static succinct data structure
for
, taking
time for queries, can be converted by our
framework into a dynamic succinct data structure that
supports
,
,
and
queries in
time, for any constant
. When
, we achieve
query times. Our update/query bounds are near-optimal
with respect to the lower bounds.
We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called Geometric Burrows-Wheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the best known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching.
Run-Length-Encoding (RLE) is a data compression technique
that is used in various applications, e.g., biological se-
quence databases, multimedia, and facsimile transmission.
One of the main challenges is how to operate, e.g.,
indexing, searching, and retrieval, on the compressed data
without decompressing it. In this paper, we present the
String B-tree for Compressed sequences, termed the SBC-tree,
for indexing and searching RLE-compressed sequences of
arbitrary length. The SBC-tree is a two-level index
structure based on the well-known String B-tree and a
3-sided range query structure. The SBC-tree supports
substring as well as prefix matching, and range search
operations over RLE-compressed sequences. The SBC-tree has
an optimal external-memory space complexity of pages,
where
is the total length of the compressed sequences, and
is the disk page size. The insertion and deletion of all
suffixes of a compressed sequence of length m takes
I/O operations. Substring matching, prefix
matching, and range search execute in an optimal
I/O operations, where
is the length of the
compressed query pattern and
is the query output size. We
present also two variants of the SBC-tree: the SBC-tree that
is based on an R-tree instead of the 3-sided structure, and
the one-level SBC-tree that does not use a two-dimensional
index.
These variants do not have provable worst-case theoretical
bounds for search operations, but perform well in practice.
The SBC-tree index is realized inside PostgreSQL in the
context of a biological protein database
application. Performance results illustrate that using the
SBC-tree to index
RLE-compressed sequences achieves up to an order of
magnitude reduction in storage, up to 30% reduction in I/Os
for
the insertion operations, and retains the optimal search
performance achieved by the String B-tree over the uncompressed sequences.
We consider the natural extension of the well-known single
disk caching problem to the parallel disk I/O model (PDM)
[17]. The main challenge is to achieve as much parallelism
as possible and avoid I/O bottlenecks. We are given a fast
memory (cache) of size memory blocks along with a
request sequence
where
each block
resides on one of
disks. In each
parallel I/O step, at most one block from each disk can be
fetched. The task is to serve in the minimum number of
parallel I/Os. Thus, each I/O is analogous to a page fault.
The difference here is that during each page fault, up to
blocks can be brought into memory, as long as all of the new
blocks entering the memory reside on different disks. The
problem has a long history. Note that this problem is
non-trivial even if all requests in
are unique.
This restricted version is called read-once. Despite the
progress in the online ver- sion and read-once version, the
general online problem still remained open. Here, we provide
comprehensive results with a full general solution for the
problem with asymptotically tight competitive ratios.
To exploit parallelism, any parallel disk algorithm needs a
certain amount of lookahead into future requests. To provide
effective caching, an online algorithm must achieve
competitive ratio. We show a lower bound that states, for
lookahead
, that any online algorithm must be
-competitive. For lookahead
greater than
, where
is a constant, the tight
upper bound of
on competitive ratio is
achieved by our algorithm SKEW. The previous algorithm tLRU
was
competitive and this was also shown to
be tight for an LRU-based strategy. We achieve the tight
ratio using a fairly different strategy than LRU. We also
show tight results for randomized algorithms against
oblivious adversary and give an algorithm achieving better
bounds in the resource augmentation model.
Slides for a talk (Adobe pdf format)
Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this book we discuss the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory.
For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss prefetching, distribution, and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as finding intersections and constructing convex hulls) and graphs (such as list ranking, connected components, topological sorting, and shortest paths). In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide a convenient means in online data structures to make effective use of the data accessed from disk. We also reexamine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length (e.g., text strings), when the internal data representations are compressed, or when the allocated amount of internal memory can change dynamically.
Programming tools and environments are available for simplifying the EM programming task. During the course of the book, we report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than methods currently used in practice.
This book is an expanded version of an earlier survey article.
This paper revisits the problem of indexing a text for
approximate string matching. Specifically, given a text of
length
and a positive integer
, we want to construct an
index of
such that for any input pattern
, we can find
all its
-error matches in
efficiently. This problem is
well-studied in the internal-memory setting. Here, we extend
some of these recent results to external-memory solutions,
which are also cache-oblivious. Our first index occupies
disk pages and finds all
-error matches
with I/Os, where
denotes the number of words in a disk
page. To the best of our knowledge, this index is the first
external-memory data structure that does not require
I/Os. The second index reduces the space to
disk pages, and the I/O complexity is
.
The past few years have witnessed several exciting results
on compressed representation of a string that supports
efficient pattern matching, and the space complexity has been
reduced to
bits, where
denotes the
th-order empirical entropy of
, and
is the
size of the alphabet. In this paper we study compressed
representation for another classical problem of string
indexing, which is called dictionary matching in the
literature. Precisely, a collection
of strings (called
patterns) of total length
is to be indexed so that given a
text
, the occurrences of the patterns in
can be found
efficiently. In this paper we show how to exploit a sampling
technique to compress the existing O(n)-word index to an
-bit index with only a small sacrifice
in search time.
Current data structures for searching large string collections are limited in that they either fail to achieve minimum space or they cause too many cache misses. In this paper, we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and storable in compressed space. The widely known frontcoding scheme is one example of linearization; it is at the core of Prefix B-trees and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-effective heuristic without efficient search support.
In this paper, we introduce new insights on front-coding and other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. The second contribution of this paper engineers these linearizations to design a novel dictionary encoding scheme that achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries and obtain a succinct variant, whose space is close to the information-theoretic minimum.
We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry.We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalence between data structural problems in geometric range searching and text pattern matching. This allows us to apply the lower bounds known in the field of orthogonal range searching to the problems in compressed text indexing. In addition, we give the first succinct (compact) index for I/O-efficient pattern matching in external memory, and show how this index can be further improved to achieve higher-order entropy compressed space.
Given a set
of
strings of
total length
, our task is to report the “most
relevant” strings for a given query pattern
. This
involves somewhat more advanced query functionality than the
usual pattern matching, as some notion of “most relevant”
is involved. In information retrieval literature, this task
is best achieved by using inverted indexes. However,
inverted indexes work only for some predefined set of
patterns. In the pattern matching community, the most
popular pattern-matching data structures are suffix trees
and suffix arrays. However, a typical suffix tree search
involves going through all the occurrences of the pattern
over the entire string collection, which might be a lot more
than the required relevant documents.
The first formal framework to study such kind of retrieval
problems was given by Muthukrishnan. He considered two
metrics for relevance: frequency and proximity. He took a
threshold-based approach on these metrics and gave data
structures taking words of space. We study
this problem in a slightly different framework of reporting
the top
most relevant documents (in sorted order) under
similar and more general relevance metrics. Our framework
gives linear space data structure with optimal query times
for arbitrary score functions. As a corollary, it improves
the space utilization for the problems considered by
Muthukrishnan while maintaining optimal query performance.
We also develop compressed variants of these data structures
for several specific relevance metrics.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new
short read aligner named -RA (PSI-RA: parallel sparse index read aligner). The motivation in using
SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption
of the index based on the available memory of the machine and the minimum length of the arriving
pattern queries. Although SSAs have been studied before for exact matching of short reads, an
elegant way of approximate matching capability was missing. We provide this by defining the
rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors
are more probable.
-RA supports any number of mismatches in aligning reads. We give
comparisons with some of the well-known short read aligners, and show that indexing a genome with
SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions: -RA is expected to serve as a valuable tool in the alignment of short reads generated
by the next generation high-throughput sequencing technology.
-RA is very fast in exact matching
and also supports rightmost approximate matching. The SSA structure that
-RA is built on naturally
incorporates the modern multicore architecture and thus further speed-up can be gained. All the
information, including the source code of
-RA, can be downloaded at: http://www.busillis.com/o_kulekci/PSIRA.zip.
In recent years, there has been increasing interest in planted motif search (PMS) with applications to
discovering significant segments in biological sequences. However, there has been little discussion about PMS over large
alphabets. This paper focuses on motif stem search (MSS), which was recently introduced to search motifs on large-alphabet
inputs. A motif stem is an
-length string with some wildcards. The goal of the MSS problem is to find a set of stems that
represents a superset of all
motifs present in the input sequences, and the superset is expected to be as small as possible.
The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular
expressions. (2) We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an
efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous algorithms, StemFinder
runs much faster and first solves the (17, 8), (19, 9) and (21, 10) challenging instances on protein sequences; moreover,
StemFinder reports fewer stems which represent a smaller superset of all
motifs. StemFinder is freely available at
http://sites.google.com/site/feqond/stemfinder.
In this paper we describe compressed indexes that support pattern matching queries for strings
with wildcards. We present a data structure that uses
bits for any
and reports
all occ occurrences of a wildcard string in
time, where
is the alphabet
size,
is the number of alphabet symbols, and
is the number of wildcard symbols in the
query string. We also present an
-bit index with
query time and
an
-bit index with
query time. These are the first
data structures for this problem that need only
bits of space.
In this paper, we develop a simple and practical storage scheme for
compressed suffix arrays (CSA). Our CSA can be constructed in linear time and
needs
bits of space simultaneously for any
and any constant
, where
denotes the
-th order entropy. We compare the
performance of our method with two established compressed indexing methods,
viz. the FM-index and Sadakane's CSA. Experiments on the Canterbury Corpus and
the Pizza&Chili Corpus show significant advantages of our algorithm over two
other indexes in terms of compression and query time. Our storage scheme
achieves better performance on all types of data present in these two corpora,
except for evenly distributed data, such as DNA. The source code for our CSA is
available online.
Given an array A[1...n] of n distinct elements from the set
1, 2, ..., n a range maximum query RMQ(a, b) returns the highest
element in A[a...b] along with its position. In this paper, we study a
generalization of this classical problem called Categorical Range
Maxima Query (CRMQ) problem, in which each element A[i] in the array
has an associated category (color) given by C[i]
. A query then
asks to report each distinct color c appearing in C[a...b] along with
the highest element (and its position) in A[a...b] with color c. Let
pc denote the position of the highest element in A[a...b] with color
c. We investigate two variants of this problem: a threshold version
and a top-k version. In threshold version, we only need to output the
colors with A[pc] more than the input threshold
, whereas top-k
variant asks for k colors with the highest A[pc] values. In the word
RAM model, we achieve linear space structure along with O(k) query
time, that can report colors in sorted order of A[.]. In external
memory, we present a data structure that answers queries in optimal
O(1+k/B) I/O's using almost-linear O(n log* n) space, as well as a
linear space data structure with O(log* n + k/B) query I/Os. Here k
represents the output size, log* n is the iterated logarithm of n and
B is the block size. CRMQ has applications to document retrieval and
categorical range reporting - giving a one-shot framework to obtain
improved results in both these problems. Our results for CRMQ not only
improve the existing best known results for three-sided categorical
range reporting but also overcome the hurdle of maintaining color
uniqueness in the output set.
The planted motif discovery has been successfully
used to locate transcription factor binding sites in dozens
of promoter sequences over the past decade. However, there has
not been enough work done in identifying
motifs in the
next-generation sequencing (ChIP-seq) data sets, which contain
thousands of input sequences and thereby bring new challenge
to make a good identification in reasonable time. To cater this
need, we propose a new planted
motif discovery algorithm
named MCES, which identifies motifs by mining and combining
emerging substrings. Specially, to handle larger data sets, we
design a MapReduce-based strategy to mine emerging substrings
distributedly. Experimental results on the simulated data show
that i) MCES is able to identify
motifs efficiently and effectively
in thousands to millions of input sequences, and runs faster
than the state-of-the-art
motif discovery algorithms, such as
F-motif and TraverStringsR; ii) MCES is able to identify motifs
without known lengths, and has a better identification accuracy
than the competing algorithm CisFinder. Also, the validity of
MCES is tested on real data sets. MCES is freely available at
http://sites.google.com/site/feqond/mces.
The planted motif search (PMS) is an important yet
challenging problem in computational biology. Pattern-driven PMS
algorithms usually use
out of
input sequences as reference
sequences to generate candidate motifs, and they can find all the
motifs in the input sequences. However, most of them simply take
the first
sequences in the input as reference sequences without
elaborate selection processes, and thus they may exhibit sharp
fluctuations in running time, especially for large alphabets.
In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100.
The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.
Graph similarity search has received considerable attention in many applications, such as bioinformatics, data mining, pattern recognition, and social networks. Existing methods for this problem have limited scalability because of the huge amount of memory they consume when handling very large graph databases with millions or billions of graphs.
In this paper, we study the problem of graph similarity search under the graph edit distance constraint. We present a space-efficient index structure based upon the q-gram tree that incorporates succinct data structures and hybrid encoding to achieve improved query time performance with minimal space usage. Specifically, the space usage of our index requires only 5%-15% of the previous state-of-the-art indexing size on the tested data while at the same time achieving 2-3 times acceleration in query time with small data sets. We also boost the query performance by augmenting the global filter with range search, which allows us to perform a query in a reduced region. In addition, we propose two effective filters that combine degree structures and label structures. Extensive experiments demonstrate that our proposed approach is superior in space and competitive in filtering to the state-of-the-art approaches. To the best of our knowledge, our index is the first in-memory index for this problem that successfully scales to cope with the large dataset of 25 million chemical structure graphs from the PubChem dataset.
In this paper we present some practical modifications of the higher-order entropy-compressed text indexing method of Foschini et al. [6] based upon the Burrows-Wheeler transform and the FM-index. Our method, called FM-Adaptive, applies a wavelet tree to the entire BWT. It partitions each bit vector of nodes in the wavelet tree into blocks and applies the hybrid encoding along with run-length Gamma code rather than the fixed-length code of [14] to each block while explores data-aware compression. FM-Adaptive retains the theoretical performance of previous work and introduces some improvements in practice. At the same time, broad experiments indicate that our index achieves superior performance, especially in terms of compression, in comparison to the state-of-the-art indexing techniques. The source code is available online.
Next generation sequencing technologies generate enormous amount of short reads, which poses a significant computational challenge for short read alignment. Furthermore, because of sequence polymorphisms in a population, repetitive sequences, and sequencing errors, there still exist difficulties in correctly aligning all reads. We propose a space-efficient compressed suffix array-based method for short read alignment (CS2A) whose space achieves the high-order empirical entropy of the input string. Unlike BWA that uses two bits to represent a nucleotide, suitable for constant-sized alphabets, our encoding scheme can be applied to the string with any alphabet set. In addition, we present approximate pattern matching on compressed suffix array (CSA) for short read alignment. Our CS2A supports both mismatch and gapped alignments for single-end and paired-end reads mapping, being capable of efficiently aligning short sequencing reads to genome sequences. The experimental results show that CS2A can compete with the popular aligners in memory usage and mapping accuracy. The source code is available online.
Many real-world applications, such as bioinformatics, data mining, pattern recognition, and social network analysis, benefit from efficient solutions for the graph similarity search problem. Existing methods have limited scalability when they handle the large graph databases, for example, those with millions or billions of graphs that cannot fit in main memory. In this paper, we study the problem of graph similarity search under the graph edit distance constraint in external memory. We present an efficient framework for arbitrary q-gram based representations of a graph. Specifically, we propose a q-gram matrix index stored in hybrid layout in external memory to achieve efficient query processing, by converting the q-gram counting filter into a sparse matrix-vector multiplication (SpMV) problem. Furthermore, we also boost the query performance by transforming the global filter to a two-dimensional query rectangle, which allows us to perform a query in a reduced region, significantly reducing the number of query I/Os in practice. Extensive experiments on real datastes confirm that: (1) our method can compete with the state-of-the-art in-memory methods in index size and filtering ability, and outperform them on scalability of coping with the PubChem dataset including 25 million chemical structure graphs. (2) compared with the popular q-gram-based external inverted index, our external index structure needs much fewer number of query I/Os on the PubChem dataset.
Classification and recognition of graph data are crucial problems in many fields, such as bioinformatics, chemoinformatics, and data mining. In graph kernel-based classification methods, the similarity among substructures is not fully considered; in addition, poorly discriminative substructures will affect the graph classification accuracy. To improve the graph classification accuracy, we propose a feature reduction algorithm based on semantic similarity for graph classification in this paper. In the algorithm, we first learn vector representations of subtree patterns using neural language models and then merge semantically similar subtree patterns into a new feature. We then provide a new feature discrimination score to select highly discriminative features. Comprehensive experiments on real datasets demonstrate that the proposed algorithm achieves a significant improvement in classification accuracy over compared graph classification methods.
The graph edit distance (GED) is a well-established distance measure widely used in many applications, such as bioinformatics, data mining, pattern recognition, and graph classification. However, existing solutions for computing the GED suffer from several drawbacks: large search spaces, excessive memory requirements, and many expensive backtracking calls. In this paper, we present BSS GED, a novel vertexbased mapping method that calculates the GED in a reduced search space created by identifying invalid and redundant mappings. BSS GED employs the beam-stack search paradigm, a widely utilized search algorithm in AI, combined with two specially designed heuristics to improve the GED computation, achieving a trade-off between memory utilization and expensive backtracking calls. Through extensive experiments, we demonstrate that BSS GED is highly efficient on both sparse and dense graphs and outperforms the state-of-the-art methods. Furthermore, we apply BSS GED to solve the well-investigated graph similarity search problem. The experimental results show that this method is dozens of times faster than state-of-the-art graph similarity search methods.
Keywords: Graph Edit Distance, Reduced Search Space, Beam-stack Search, Heuristics, Graph Similarity Search
The development of the next-generation, high-throughput sequencing technologies dramatically reduces the cost of the next-generation sequencing (NGS) data production, thereby leading to the explosive growth in the NGS data.
In this paper, we focus upon the important problem of indexing and searching highly repetitive
DNA sequence collections. Given a collection
of
sequences
of length
each, we can represent
succinctly in
bits
using
time,
where
is the
th-order empirical entropy of the
sequence
that is used as
the reference sequence,
is the total number of variations between
and the sequences in
,
and
is a small fixed constant.
We can restore the length-
substring
of
in
time
and report the
occurrences where
occurs in
in
time.
In addition, we propose a method to find the variations between
and the sequences in
, with which we can build succinct structures
to enable fast search.
For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed
method has significant advantages in space usage and retrieval time over the current state-of-the-art methods.
The source code is available online.
The Gromov-Hausdorff distance () proves to be a useful
distance measure between shapes. In order to approximate
for
compact subsets
, we look into its relationship with
, the
infimum Hausdorff distance under Euclidean isometries. As already
known for dimension
, the
cannot be bounded above by a
constant factor times
. For
, however, we prove that
. We also show that the bound is tight. In effect, this
gives rise to an
-time algorithm to approximate
with an
approximation factor of
.