We examine a version of the dynamic dictionary problem in which stored items have expiration times and can be removed from the dictionary once they have expired. We show that under several reasonable assumptions about the distribution of the items, hashing with lazy deletion uses little more space than methods that use eager deletion. The simple algorithm suggested by this observation was used in a program for analyzing integrated circuit artwork.
We answer questions about the distribution of the maximum
size of queues and data structures as a function of time.
The concept of "maximum" occurs in many issues of resource
allocation. We consider several models of growth, including
general birth-and-death processes, the M/G/ model, and a
non-Markovian process (data structure) for processing
plane-sweep information in computational geometry, called
"hashing with lazy deletion" (HwLD). It has been shown that
HwLD is optimal in terms of expected time and dynamic space;
our results show that it is also optimal in terms of
expected preallocated space, up to a constant factor.
We take two independent and complementary approaches: first, in
Section 2, we use a variety of algebraic and analytical
techniques to derive exact formulas for the distribution of
the maximum queue size in stationary birth-and-death
processes and in a nonstationary model related to file
histories. The formulas allow numerical evaluation and some
asymptotics. In our second approach, in Section 3, we
consider the M/G/ model (which includes M/M/
as a
special case) and use techniques from the analysis of
algorithms to get optimal big-oh bounds on the expected
maximum queue size and on the expected maximum amount of
storage used by HwLD in excess of the optimal amount. The
techniques appear extendible to other models, such as M/M/1.
This paper develops two probabilistic methods that allow the analysis of the maximum data structure size encountered during a sequence of insertions and deletions in data structures such as priority queues, dictionaries, linear lists, and symbol tables, and in sweepline structures for geometry and Very-Large- Scale-Integration (VLSI) applications. The notion of the "maximum" is basic to issues of resource preallocation. The methods here are applied to combinatorial models of file histories and probabilistic models, as well as to a non-Markovian process (algorithm) for processing sweepline information in an efficient way, called "hashing with lazy deletion" (HwLD). Expressions are derived for the expected maximum data structure size that are asymptotically exact, that is, correct up to lower-order terms; in several cases of interest the expected value of the maximum size is asymptotically equal to the maximum expected size. This solves several open problems, including longstanding questions in queueing theory. Both of these approaches are robust and rely upon novel applications of techniques from the analysis of algorithms. At a high level, the first method isolates the primary contribution to the maximum and bounds the lesser effects. In the second technique the continuous-time probabilistic model is related to its discrete analog-the maximum slot occupancy in hashing.
Parallel algorithms for several graph and geometric problems are
presented, including transitive closure and topological sorting in
planar -graphs, preprocessing planar subdivisions for point location
queries, and construction of visibility representations and drawings of
planar graphs. Most of these algorithms achieve optimal
running time using
processors in the EREW PRAM model,
being the number of vertices.
Fractional cascading is a technique designed to allow efficient
sequential search in a graph with catalogs of total size . The
search consists of locating a key in the catalogs along a path. In this
paper we show how to preprocess a variety of fractional cascaded data
structures whose underlying graph is a tree so that searching can be
done efficiently in parallel. The preprocessing takes
time
with
processors on an EREW PRAM. For a balanced binary tree
cooperative search along root-to-leaf paths can be done in
time using
processors on a CREW PRAM. Both of these
time/processor constraints are optimal. The searching in the fractional
cascaded data structure can be either explicit, in which the search path
is specified before the search starts, or implicit, in which the
branching is determined at each node. We apply this technique to a
variety of geometric problems, including point location, range search,
and segment intersection search.
In this paper we present approximation algorithms for median problems
in metric spaces and fixed-dimensional Euclidean space. Our
algorithms use a new method for transforming an optimal solution of
the linear program relaxation of the -median problem into a
provably good integral solution. This transformation technique is
fundamentally different from the methods of randomized and
deterministic rounding by Raghavan and the methods proposed the
authors' earlier work in the following way: Previous techniques never
set variables with zero values in the fractional solution to the value
of 1. This departure from previous methods is crucial for the success
of our algorithms.
In this paper we give a practical and efficient output-sensitive
algorithm for constructing the display of a polyhedral terrain. It runs
in
time and uses
space, where
is the size of the final display, and
is a (very slowly
growing) functional inverse of Ackermann's function. Our implementation
is especially simple and practical, because we try to take full
advantage of the specific geometrical properties of the terrain. The
asymptotic speed of our algorithm has been improved upon theoretically
by other authors, but at the cost of higher space usage and/or high
overhead and complicated code. Our main data structure maintains an
implicit representation of the convex hull of a set of points that can
be dynamically updated in
time. It is especially simple
and fast in our application since there are no rebalancing operations
required in the tree.
Keywords: display, hidden-line elimination, polyhedral terrain, output-sensitive, convex hull.
In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important large-scale problems in computational geometry. Our algorithms are optimal for a wide range of two-level and hierarchical multilevel memory models, including parallel models. The algorithms are optimal in terms of both I/O cost and internal computation.
Our results are built on four fundamental techniques: distribution
sweeping, a generic method for externalizing plane-sweep
algorithms; persistent B-trees, for which we have both on-line
and off-line methods; batch filtering, a general method for
performing simultaneous external-memory searches in any data
structure that can be modeled as a planar layered dag; and
external marriage-before-conquest, an external-memory analog of the
well-known technique of Kirkpatrick and Seidel. Using these techniques
we are able to solve a very large number of problems in computational
geometry, including batched range queries, 2-d and 3-d convex hull
construction, planar point location, range queries, finding all nearest
neighbors for a set of planar points, rectangle intersection/union
reporting, computing the visibility of segments from a point, performing
ray-shooting queries in constructive solid geometry (CSG) models, as
well as several geometric dominance problems.
These results are significant because large-scale problems involving geometric data are ubiquitous in spatial databases, geographic information systems (GIS), constraint logic programming, object oriented databases, statistics, virtual reality systems, and graphics. This work makes a big step, both theoretically and in practice, towards the effective management and manipulation of geometric data in external memory, which is an essential component of these applications.
We examine I/O-efficient data structures that provide
indexing support for new data models. The database languages of these
models include concepts from constraint programming (e.g., relational
tuples are generalized to conjunctions of constraints) and from
object-oriented programming (e.g., objects are organized in class
hierarchies). Let be the size of the database,
the number of
classes,
the page size on secondary storage, and
the size of
the output of a query. (1) Indexing by one attribute in many
constraint data models is equivalent to external dynamic interval
management, which is a special case of external dynamic 2-dimensional
range searching. We present a semi-dynamic data structure for this
problem that has worst-case space
pages, query I/O time
and
amortized
insert I/O time. Note that, for the static version of this problem,
this is the first worst-case optimal solution. (2) Indexing by one
attribute and by class name in an object-oriented model, where objects
are organized as a forest hierarchy of classes, is also a special case
of external dynamic 2-dimensional range searching. Based on this
observation, we first identify a simple algorithm with good worst-case
performance, query I/O time
, update I/O time
and space
pages for the class
indexing problem. Using the forest structure of the class hierarchy
and techniques from the constraint indexing problem, we improve its
query I/O time to
.
We define a new complexity measure, called object complexity, for hidden-surface elimination algorithms. This model is more appropriate than the standard scene complexity measure used in computational geometry for predicting the performance of these algorithms on current graphics rendering systems.
We also present an algorithm to determine the set of visible windows in
3-D scenes consisting of isothetic windows. It takes time
, which is optimal. The algorithm solves in the object
complexity model the same problem that Bern addressed for the standard
scene complexity model.
We consider the practical problem of constructing binary space
partitions (BSPs) for a set of
orthogonal, non-intersecting,
two-dimensional rectangles in three-dimensional Euclidean space such
that the aspect ratio
of each rectangle in
is at most
, for some constant
. We present an
-time algorithm
to build a binary space partition of size
for
. We also show that if
of the
rectangles in
have
aspect ratios greater than
, we can construct a BSP of
size
for
in
time. The constants of proportionality in the
big-oh terms are linear in
. We extend these results to
cases in which the input contains non-orthogonal or intersecting objects.
We present a new approach to designing data structures for the
important problem of external-memory range searching in two and three
dimensions. We construct data structures for answering range queries
in
I/O operations, where
is the number of points in the data structure,
is the I/O
block size, and
is the number of points in the answer to the
query. We base our data structures on the novel concept of
-approximate boundaries, which are manifolds that partition space
into regions based on the output size of queries at points within the
space.
Our data structures answer a longstanding open problem by providing
three dimensional results comparable to those provided by Sairam and
Ramaswamy for the two dimensional case, though completely new
techniques are used. Ours is the first 3-D range search data
structure that simultaneously achieves both a base- logarithmic
search overhead (namely,
) and a fully
blocked output component (namely,
). This gives us an overall
I/O complexity extremely close to the well-known lower bound of
. The space usage is more than linear by
a logarithmic or polylogarithmic factor, depending on type of range
search.
We present a space- and I/O-optimal external-memory data structure for
answering stabbing queries on a set of dynamically maintained intervals.
Our data structure settles an open problem in databases and I/O algorithms
by providing the first optimal external-memory solution to the dynamic
interval management problem, which is a special case of 2-dimensional range
searching and a central problem for object-oriented and temporal databases
and for constraint logic programming. Our data structure simultaneously
uses optimal linear space (that is, blocks of disk space) and
achieves the optimal
I/O query bound and
I/O update bound, where
is the I/O block size and
the number of
elements in the answer to a query. Our structure is also the first optimal
external data structure for a 2-dimensional range searching problem that
has worst-case as opposed to amortized update bounds. Part of the data
structure uses a novel balancing technique for efficient worst-case
manipulation of balanced trees, which is of independent interest.
Large-scale problems involving geometric data arise in numerous settings, and severe communication bottlenecks can arise in solving them. Work is needed in the development of I/O-efficient algorithms, as well as those that effectively utilize hierarchical memory. In order for new algorithms to be implemented efficiently in practice, the machines they run on must support fundamental external-memory operations. We discuss several advantages offered by TPIE (Transparent Parallel I/O Programming Environment) to enable I/O-efficient implementations.
General Terms: Algorithms, Design, Languages, Performance, Theory. Additional Key Words and Phrases: computational geometry, I/O, external memory, secondary memory, communication, disk drive, parallel disks.
We present the first systematic comparison of the performance of algorithms that construct Binary Space Partitions for orthogonal rectangles in three-dimensional Euclidean space. We compare known algorithms with our implementation of a variant of a recent algorithm of Agarwal et al. We show via an empirical study that their algorithm constructs BSPs of near-linear size in practice and performs better than most of the other algorithms in the literature.
We describe the first known algorithm for efficiently maintaining a
Binary Space Partition (BSP) for continuously moving segments in
the plane. Under reasonable assumptions on the motion, we show that
the total number of times the BSP changes is
, and that we can
update the BSP in
expected time per change. We also
consider the problem of constructing a BSP for
triangles in
three-dimensional Euclidean space. We present a randomized algorithm
that constructs a BSP of expected size
in
expected time. We also describe a deterministic algorithm that
constructs a BSP of size
and height
in
time, where
is the number of
intersection points between the edges of the projections of the
triangles onto the
-plane.
For a polyhedral terrain, the contour at -coordinate
is
defined to be the intersection of the
plane
with the terrain. In this paper, we study the
contour-line extraction problem, where we want to preprocess
the terrain into a data structure so that given a query
-coordinate
, we can report the
-contour quickly. This problem
is central to geographic information systems (GIS),
where terrains are often stored as Triangular Irregular Networks
(TINs). We present an I/O-optimal algorithm for this problem which
stores a terrain with
vertices using
blocks,
where
is the size of a disk block, so that for any query
,
the
-contour can be computed using
I/O operations, where
denotes the size of the
-contour.
We also present an improved algorithm for a more general problem of blocking bounded-degree planar graphs such as TINs (i.e., storing them on disk so that any graph traversal algorithm can traverse the graph in an I/O-efficient manner). We apply it to two problems that arise in GIS.
We describe a powerful framework for designing efficient batch algorithms for certain large-scale dynamic problems that must be solved using external memory. The class of problems we consider, which we call colorable external-decomposable problems, include rectangle intersection, orthogonal line segment intersection, range searching, and point location. We are particularly interested in these problems in two and higher dimensions. They have numerous applications in geographic information systems (GIS), spatial databases, and VLSI and CAD design. We present simplified algorithms for problems previously solved by more complicated approaches (such as rectangle intersection), and we present efficient algorithms for problems not previously solved in an efficient way (such as point location and higher-dimensional versions of range searching and rectangle intersection).
We give experimental results concerning the running time for our approach applied to the red-blue rectangle intersection problem, which is a key component of the extremely important database operation spatial join. Our algorithm scales well with the problem size, and for large problems sizes it greatly outperforms the well-known sweepline approach.
We show how to preprocess a set of points in
-dimensional
Euclidean space to get an external memory data structure that
efficiently supports linear-constraint queries. Each query is in the
form of a linear constraint
; the data
structure must report all the points of
that satisfy the query.
(This problem is called halfspace range searching in the computational
geometry literature.) Our goal is to minimize the number of disk
blocks required to store the data structure and the number of disk
accesses (I/Os) required to answer a query. For
, we present
the first near-linear size data structures that can answer
linear-constraint queries using an optimal number of I/Os. We also
present a linear-size data structure that can answer queries
efficiently in the worst case. We combine these two approaches to
obtain tradeoffs between space and query time. Finally, we show that
some of our techniques extend to higher dimensions.
In this paper, we examine the spatial join problem. In particular, we focus on the case when neither of the inputs is indexed. We present a new algorithm, Scalable Sweep-based Spatial Join (SSSJ), that is based on the distribution-sweeping technique recently proposed in computational geometry, and that is the first to achieve theoretically optimal bounds on internal computation time as well as I/O transfers. We present experimental results based on an efficient implementation of the SSSJ algorithm, and compare it to the state-of-the-art Partition-Based Spatial-Merge (PBSM) algorithm of Patel and DeWitt.
Our SSSJ algorithm performs an initial sorting step along the vertical
axis, after which we use the distribution-sweeping technique to partition
the input into a number of vertical strips, such that the data in each
strip can be efficiently processed by an internal-memory sweepline
algorithm. A key observation that allowed us to greatly improve the
practical performance of our algorithm is that in most sweepline algorithms
not all input data is needed in main memory at the same time. In our
initial experiments, we observed that on real-life two-dimensional spatial
data sets of size , the internal-memory sweepline algorithm requires
only
memory space. This behavior (also known as the
square-root rule in the VLSI literature) implies that for real-life
two-dimensional data sets, we can bypass the vertical partitioning step and
directly perform the sweepline algorithm after the initial external sorting
step. We implemented SSSJ such that partitioning is only done when it is
detected that that the sweepline algorithm exhausts the internal
memory. This results in an algorithm that not only is extremely efficient
for real-life data but also offers guaranteed worst-case bounds and
predictable behavior on skewed and/or bad input data: Our experiments show
that SSSJ performs at least
better than PBSM on real-life data sets,
and that it robustly handles skewed data on which PBSM suffers a serious
performance degeneration.
As part of our experimental work we experimented with a number of different
techniques for performing the internal sweepline. By using an efficient
partitioning heuristic, we were able to speed up the internal sweeping used
by PBSM by a factor of over on the average for real-life data sets. The
resulting improved PBSM then performs approximately
better than SSSJ
on the real-life data we used, and it is thus a good choice of algorithm
when the data is known not to be too skewed.
In this paper, we develop a simple technique for constructing a Binary Space Partition (BSP) for a set of orthogonal rectangles in three-dimensions. Our algorithm has the novel feature that it tunes its performance to the geometric properties of the rectangles, e.g., their aspect ratios.
We have implemented our algorithm and tested its performance on real data sets. We have also systematically compared the performance of our algorithm with that of other techniques presented in the literature. Our studies show that our algorithm constructs BSPs of near-linear size and small height in practice, has fast running times, and answers queries efficiently. It is a method of choice for constructing BSPs for orthogonal rectangles.
In recent years there has been an upsurge of interest in spatial databases. A major issue is how to efficiently manipulate massive amounts of spatial data stored on disk in multidimensional spatial indexes (data structures). Construction of spatial indexes (bulk loading) has been researched intensively in the database community. The continuous arrival of massive amounts of new data make it important to efficiently update existing indexes (bulk updating).
In this article we present a simple technique for performing bulk update and query operations on multidimensional indexes. We present our technique in terms of the so-called R-tree and its variants, as they have emerged as practically efficient indexing methods for spatial data. Our method uses ideas from the buffer tree lazy buffering technique and fully utilizes the available internal memory and the page size of the operating system. We give a theoretical analysis of our technique, showing that it is efficient both in terms of I/O communication, disk storage, and internal computation time. We also present the results of an extensive set of experiments showing that in practice our approach performs better than the previously best known bulk update methods with respect to update time, and that it produces a better quality index in terms of query performance. One important novel feature of our technique is that in most cases it allows us to perform a batch of updates and queries simultaneously. To be able to do so is essential in environments where queries have to be answered even while the index is being updated and reorganized.
We present an efficient external-memory dynamic data structure for point
location in monotone planar subdivisions. Our data structure uses
disk blocks to store a monotone subdivision of size
, where
is the size of a disk block. It supports queries in
I/Os (worst-case) and updates in
I/Os (amortized).
We also propose a new variant of -trees, called level-balanced
-trees, which allow insert, delete, merge, and split operations in
I/Os (amortized),
, even if each node stores a pointer to its parent.
Here
is the size of main memory. Besides being essential to our
point-location data structure, we believe that level-balanced
B-trees are of significant independent interest. They can, for
example, be used to dynamically maintain a planar st-graph using
I/Os
(amortized) per update, so that reachability queries can be answered in
I/Os (worst case).
In this paper we consider the problem of constructing
planar orthogonal grid drawings (or more simply,
layouts) of graphs, with the goal of minimizing the number of
bends along the edges. We present optimal parallel algorithms
that construct graph layouts with maximum edge length,
area, and at most
bends (for biconnected graphs) and
bends (for simply connected graphs).
All three of these quality measures for the layouts are optimal in the
worst case for biconnected graphs.
The algorithm runs on a CREW PRAM
in
time with
processors, thus achieving optimal
time and processor utilization.
Applications include VLSI layout, graph drawing, and wireless communication.
In this paper we settle several longstanding open problems in theory of indexability and external orthogonal range searching. In the first part of the paper, we apply the theory of indexability to the problem of two-dimensional range searching. We show that the special case of 3-sided querying can be solved with constant redundancy and access overhead. From this, we derive indexing schemes for general 4-sided range queries that exhibit an optimal tradeoff between redundancy and access overhead.
In the second part of the paper, we develop dynamic external memory data
structures for the two query types. Our structure for 3-sided queries
occupies disk blocks, and it supports insertions and deletions
in
I/Os and queries in
I/Os, where
is the disk block size,
is the number of points, and
is the
query output size. These bounds are optimal. Our structure for general
(4-sided) range searching occupies
disk blocks and answers queries in
I/Os, which are optimal. It also supports updates in
I/Os.
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
Slides for ICALP '99 talk (gzip-compressed postscript)
The data sets for many of today's computer applications are too large to fit within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of online data structures for external memory, some very old and some very new, such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), buffer trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way.
Most spatial join algorithms either assume the existence of a spatial index structure that is traversed during the join process, or solve the problem by sorting, partitioning, or on-the-fly index construction. In this paper, we develop a simple plane-sweeping algorithm that unifies the index-based and non-index based approaches. This algorithm processes indexed as well as non-indexed inputs, extends naturally to multi-way joins, and can be built easily from a few standard operations. We present the results of a comparative study of the new algorithm with several index-based and non-index based spatial join algorithms. We consider a number of factors, including the relative performance of CPU and disk, the quality of the spatial indexes, and the sizes of the input relations. An important conclusion from our work is that using an index-based approach whenever indexes are available does not always lead to the best execution time, and hence we propose the use of a simple cost model to decide when to follow an index-based approach.
The potential and use of Geographic Information Systems (GIS) is rapidly increasing due to the increasing availability of massive amounts of geospatial data from projects like NASA's Mission to Planet Earth. However, the use of these massive datasets also exposes scalability problems with existing GIS algorithms. These scalability problems are mainly due to the fact that most GIS algorithms have been designed to minimize internal computation time, while I/O communication often is the bottleneck when processing massive amounts of data.
In this paper, we consider I/O-efficient algorithms for problems
on grid-based terrains. Detailed grid-based terrain data is
rapidly becoming available for much of the earth's surface. We
describe
I/O algorithms for
several problems on
by
grids for which only
algorithms were previously known. Here
denotes the size
of the main memory and
the size of a disk block.
We demonstrate the practical merits of our work by comparing the empirical performance of our new algorithm for the flow accumulation problem with that of the previously best known algorithm. Flow accumulation, which models flow of water through a terrain, is one of the most basic hydrologic attributes of a terrain. We present the results of an extensive set of experiments on real-life terrain datasets of different sizes and characteristics. Our experiments show that while our new algorithm scales nicely with dataset size, the previously known algorithm “breaks down” once the size of the dataset becomes bigger than the available main memory. For example, while our algorithm computes the flow accumulation for the Appalachian Mountains in about three hours, the previously known algorithm takes several weeks.
The problem of content-based image searching has received considerable attention in the last few years. Thousands of images are now available on the internet, and many important applications require searching of images in domains such as E-commerce, medical imaging, weather prediction, satellite imagery, and so on. Yet, content-based image querying is still largely unestablished as a mainstream field, nor is it widely used by search engines. We believe that two of the major hurdles for this poor acceptance are poor retrieval quality and usability.
In this paper, we introduce the CAMEL system--an acronym for Concept Annotated iMagE Libraries--as an effort to address both of the above problems. The CAMEL system provides and easy-to-use, and yet powerful, text-only query interface, which allows users to search for images based on visual concepts, identified by specifying relevant keywords. Conceptually, CAMEL annotates images with the visual concepts that are relevant to them. In practice, CAMEL defines visual concepts by looking at sample images off-line and extracting their relevant visual features. Once defined, such visual concepts can be used to search for relevant images on the fly, using content-based search methods. The visual concepts are stored in a Concept Library and are represented by an associated set of wavelet features, which in our implementation were extracted by the WALRUS image querying system. Even though the CAMEL framework applies independently of the underlying query engine, for our prototype we have chosen WALRUS as a back-end, due to its ability to extract and query with image region features.
CAMEL improves retrieval quality because it allows experts to build very accurate representations of visual concepts that can be used even by novice users. At the same time, CAMEL improves usability by supporting the familiar text-only interface currently used by most search engines on the web. Both improvements represent a departure from traditional approaches to improving image query systems--instead of focusing on query execution, we emphasize query specification by allowing simpler and yet more precise query specification.
Slides for a talk (Adobe pdf format)
This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.
It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, it is only necessary to know whether the probability exceeds a given threshold; we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries.
In particular, we develop two index structures and associated algorithms to efficiently answer PTQs. The first index scheme is based on the idea of augmenting uncertainty information to an R-tree. We establish the difficulty of this problem by mapping one-dimensional intervals to a two-dimensional space, and show that the problem of interval indexing with probabilities is significantly harder than interval indexing which is considered a well-studied problem. To overcome the limitations of this R-tree based structure, we apply a technique we call variance-based clustering, where data points with similar degrees of uncertainty are clustered together. Our extensive index structure can answer the queries for various kinds of uncertainty pdfs, in an almost optimal sense. We conduct experiments to validate the superior performance of both indexing schemes.
In an uncertain database, each data item is modeled as a range associated with a probability density function. Previous works for this kind of data have focused on simple queries such as range and nearest-neighbor queries. Queries that join multiple relations have not been addressed in earlier work despite the significance of joins in databases. In this paper, we address probabilistic join over uncertain data, essentially a query that augments the results with probability guarantees to indicate the likelihood of each join tuple being part of the result. We extend the notion of join operators, such as equality and inequality, for uncertain data. We also study the performance of probabilistic join. We observe that a user may only need to know whether the probability of the results exceeds a given threshold, instead of the precise probability value. By incorporating this constraint, it is possible to achieve much better performance. In particular, we develop three sets of optimization techniques, namely item-level, page-level and index-level pruning, for different join operators. These techniques facilitate pruning with little space and time overhead, and are easily adapted to most join algorithms. We verify the performance of these techniques experimentally.
We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called Geometric Burrows-Wheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the best known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching.
Slides for a talk (Adobe pdf format)
Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this book we discuss the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory.
For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss prefetching, distribution, and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as finding intersections and constructing convex hulls) and graphs (such as list ranking, connected components, topological sorting, and shortest paths). In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide a convenient means in online data structures to make effective use of the data accessed from disk. We also reexamine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length (e.g., text strings), when the internal data representations are compressed, or when the allocated amount of internal memory can change dynamically.
Programming tools and environments are available for simplifying the EM programming task. During the course of the book, we report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than methods currently used in practice.
This book is an expanded version of an earlier survey article.
We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry.We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalence between data structural problems in geometric range searching and text pattern matching. This allows us to apply the lower bounds known in the field of orthogonal range searching to the problems in compressed text indexing. In addition, we give the first succinct (compact) index for I/O-efficient pattern matching in external memory, and show how this index can be further improved to achieve higher-order entropy compressed space.
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident.
Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory.
In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs. I/O efficiency trade-offs for our structures.
Given a set
of
strings of
total length
, our task is to report the “most
relevant” strings for a given query pattern
. This
involves somewhat more advanced query functionality than the
usual pattern matching, as some notion of “most relevant”
is involved. In information retrieval literature, this task
is best achieved by using inverted indexes. However,
inverted indexes work only for some predefined set of
patterns. In the pattern matching community, the most
popular pattern-matching data structures are suffix trees
and suffix arrays. However, a typical suffix tree search
involves going through all the occurrences of the pattern
over the entire string collection, which might be a lot more
than the required relevant documents.
The first formal framework to study such kind of retrieval
problems was given by Muthukrishnan. He considered two
metrics for relevance: frequency and proximity. He took a
threshold-based approach on these metrics and gave data
structures taking words of space. We study
this problem in a slightly different framework of reporting
the top
most relevant documents (in sorted order) under
similar and more general relevance metrics. Our framework
gives linear space data structure with optimal query times
for arbitrary score functions. As a corollary, it improves
the space utilization for the problems considered by
Muthukrishnan while maintaining optimal query performance.
We also develop compressed variants of these data structures
for several specific relevance metrics.
Slides for CPM '10 keynote talk (Adobe pdf)
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays
(SSAs) and propose a new short read aligner named -RA (PSI-RA:
parallel sparse index read aligner). The motivation in using SSAs is
the ability to trade memory against time. It is possible to fine tune
the space consumption of the index based on the available memory of
the machine and the minimum length of the arriving pattern queries.
Although SSAs have been studied before for exact matching of short
reads, an elegant way of approximate matching capability was missing.
We provide this by defining the rightmost mismatch criteria that
prioritize the errors towards the end of the reads, where errors are
more probable.
-RA supports any number of mismatches in aligning
reads. We give comparisons with some of the well-known short read
aligners, and show that indexing a genome with SSA is a good
alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions: -RA is expected to serve as a valuable tool in the alignment of
short reads generated by the next generation high-throughput
sequencing technology.
-RA is very fast in exact matching and
also supports rightmost approximate matching. The SSA structure that
-RA is built on naturally incorporates the modern multicore
architecture and thus further speed-up can be gained. All the
information, including the source code of
-RA, can be downloaded
at http://www.busillis.com/o_kulekci/PSIRA.zip.
The wavelet tree data structure is a space-efficient technique
for rank and select queries that generalizes from binary characters to
an arbitrary multicharacter alphabet. It has become a key tool in
modern full-text indexing and data compression because of its
capabilities in compressing, indexing, and searching. We present a
comparative study of its practical performance regarding a wide range
of options on the dimensions of different coding schemes and tree
shapes. Our results are both theoretical and experimental:
(1) We show that the run-length coding size of wavelet trees
achieves the 0-order empirical entropy size of the original string
with leading constant 1, when the string's 0-order empirical entropy
is asymptotically less than the logarithm of the alphabet size. This
result complements the previous works that are dedicated to analyzing
run-length
-encoded wavelet trees. It also reveals the
scenarios when run-length
encoding becomes practical.
(2) We introduce a full generic package of wavelet trees for a wide
range of options on the dimensions of coding schemes and tree shapes.
Our experimental study reveals the practical performance of the
various modifications.
Let
be a text of
total length
, where characters of each
are chosen from an
alphabet
of size
, and
denotes a wildcard
symbol. The text indexing with wildcards problem is to index
such
that when we are given a query pattern
, we can locate the
occurrences of
in
efficiently. This problem has been applied
in indexing genomic sequences that contain single-nucleotide
polymorphisms (SNP) because SNP can be modeled as wildcards. Recently
Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes
for this problem. In this paper, we present the first compressed
index for this problem, which takes only
bits space, where
is the
th-order
empirical entropy (
) of
.
Given an array A[1...n] of n distinct elements from the set
1, 2, ..., n a range maximum query RMQ(a, b) returns the highest
element in A[a...b] along with its position. In this paper, we study a
generalization of this classical problem called Categorical Range
Maxima Query (CRMQ) problem, in which each element A[i] in the array
has an associated category (color) given by C[i]
. A query then
asks to report each distinct color c appearing in C[a...b] along with
the highest element (and its position) in A[a...b] with color c. Let
pc denote the position of the highest element in A[a...b] with color
c. We investigate two variants of this problem: a threshold version
and a top-k version. In threshold version, we only need to output the
colors with A[pc] more than the input threshold
, whereas top-k
variant asks for k colors with the highest A[pc] values. In the word
RAM model, we achieve linear space structure along with O(k) query
time, that can report colors in sorted order of A[.]. In external
memory, we present a data structure that answers queries in optimal
O(1+k/B) I/O's using almost-linear O(n log* n) space, as well as a
linear space data structure with O(log* n + k/B) query I/Os. Here k
represents the output size, log* n is the iterated logarithm of n and
B is the block size. CRMQ has applications to document retrieval and
categorical range reporting - giving a one-shot framework to obtain
improved results in both these problems. Our results for CRMQ not only
improve the existing best known results for three-sided categorical
range reporting but also overcome the hurdle of maintaining color
uniqueness in the output set.
Chien et al. [1, 2] introduced the geometric Burrows-Wheeler transform (GBWT) as the
first succinct text index for I/O-efficient pattern matching in external memory; it operates
by transforming a text into point set
in the two-dimensional plane. In this paper
we introduce a practical succinct external memory text index, called mKD-GBWT. We
partition
into
subregions by partitioning the x-axis into
intervals using the suffix
ranges of characters of
and partitioning the y-axis into
intervals using characters of
,
where
is the alphabet size of
. In this way, we can represent a point using fewer
bits and perform a query in a reduced region so as to improve the space usage and I/Os
of GBWT in practice. In addition, we plug a crit-bit tree into each node of string B-trees
to represent variable-length strings stored. Experimental results show that mKD-GBWT
provides significant improvement in space usage compared with the state-of-the-art indexing
techniques. The source code is available online [3].
The Gromov-Hausdorff distance () proves to be a useful
distance measure between shapes. In order to approximate
for
compact subsets
, we look into its relationship with
, the
infimum Hausdorff distance under Euclidean isometries. As already
known for dimension
, the
cannot be bounded above by a
constant factor times
. For
, however, we prove that
. We also show that the bound is tight. In effect, this
gives rise to an
-time algorithm to approximate
with an
approximation factor of
.