 
 
 
 
 
   
We introduce and analyze a new one-pass algorithm for constructing
dynamic Huffman codes and also analyze the one-pass algorithm due to
Faller, Gallager, and Knuth.  In each algorithm, both the sender and the
receiver maintain equivalent dynamically varying Huffman trees, and the
coding is done in real time.  We show that the number of bits used by
the new algorithm to encode a message containing  letters
is
 letters
is  bits more than that used by the conventional two-pass Huffman scheme,
independent of the alphabet size.  This is best possible in the worst
case, for any one-pass Huffman method.  Tight upper and lower bounds are
derived.  Empirical tests show that the encodings produced by the new
algorithm are shorter than those of the other one-pass algorithm and,
except for long messages, are shorter than those of the two-pass method.
The new algorithm is well-suited for online encoding/decoding in data
networks and for file compression.
 bits more than that used by the conventional two-pass Huffman scheme,
independent of the alphabet size.  This is best possible in the worst
case, for any one-pass Huffman method.  Tight upper and lower bounds are
derived.  Empirical tests show that the encodings produced by the new
algorithm are shorter than those of the other one-pass algorithm and,
except for long messages, are shorter than those of the two-pass method.
The new algorithm is well-suited for online encoding/decoding in data
networks and for file compression.
We present a Pascal implementation of the one-pass algorithm for
constructing dynamic Huffman codes that is described and analyzed in a
companion paper [Vitter, 1987].  The program runs in real time; that is,
the processing time for each letter of the message is proportional to
the length of its codeword.  The number of bits used to encode a message
of  letters is less than
 letters is less than  bits more than that used by the
well-known two-pass algorithm.  This is best possible for any one-pass
Huffman scheme.  In practice it uses fewer bits than all other Huffman
schemes.  The algorithm has applications in file compression and network
transmission.
 bits more than that used by the
well-known two-pass algorithm.  This is best possible for any one-pass
Huffman scheme.  In practice it uses fewer bits than all other Huffman
schemes.  The algorithm has applications in file compression and network
transmission.
We present a new method for error modeling applicable to the MLP algorithm for hierarchical lossless image compression. This method, based on a concept called the variability index, provides accurate models for pixel prediction errors without requiring explicit transmission of the models. We also use the variability index to show that prediction errors do not always follow the Laplace distribution, as is commonly assumed; replacing the Laplace distribution with a more general symmetric exponential distribution further improves compression. We describe a new compression measurement called compression gain, and we give experimental results showing that the MLP method using the variability index technique for error modeling gives significantly more compression gain than other methods in the literature.
Caching and prefetching are important mechanisms for speeding up access
time to data on secondary storage.  Recent work in competitive online
algorithms has uncovered several promising new algorithms for caching.
In this paper, we apply a form of the competitive philosophy for the
first time to the problem of prefetching to develop an optimal universal
prefetcher in terms of fault ratio, with particular applications to
large-scale databases and hypertext systems.  Our algorithms for
prefetching are novel in that they are based on data compression
techniques that are both theoretically optimal and good in practice.
Intuitively, in order to compress data effectively, you have to be able
to predict future data well, and thus good data compressors should be
able to predict well for purposes of prefetching. We show for powerful
models such as Markov sources and  th order Markov sources that the
page fault rates incurred by our prefetching algorithms are optimal in
the limit for almost all sequences of page accesses.
th order Markov sources that the
page fault rates incurred by our prefetching algorithms are optimal in
the limit for almost all sequences of page accesses.
Response time delays caused by I/O is a major problem in many systems and database applications. Prefetching and cache-replacement methods are attracting renewed attention because of their success in avoiding costly I/Os. Prefetching can be looked upon as a type of online sequential prediction, where the predictions must be accurate as well as made in a computationally efficient way. Unlike other online problems, prefetching cannot admit a competitive analysis, since the optimal offline prefetcher incurs no cost when it knows the future page requests. Previous analytical work on prefetching by Vitter and Krishnan consisted of modeling the user as a probabilistic Markov source.
In this paper, we look at the much stronger form of worst-case analysis and derive a randomized algorithm that we prove analytically converges almost surely to the optimal fault rate in the worst case for every sequence of page request with respect to the important class of finite state prefetchers. In particular, we make no assumption about how the sequence of page requests is generated. This analysis model can be looked upon as a generalization of the competitive framework, in that it compares an online algorithm in a worst-case manner over all sequences against a powerful yet non-clairvoyant opponent. We simultaneously achieve the computational goal of implementing our prefetcher in optimal constant expected time per prefetched page, using the optimal dynamic discrete random variate generator of Matias, Vitter, and Ni.
We show that high-resolution images can be encoded and decoded efficiently in parallel. We present an algorithm based on the hierarchical MLP method, used either with Huffman coding or with a new variant of arithmetic coding called quasi-arithmetic coding. The coding step can be parallelized, even though the codes for different pixels are of different lengths; parallelization of the prediction and error modeling components is straightforward.
We present new vector quantization algorithms based on the theory developed by the authors. The new approach is to formulate a vector quantization problem as a 0-1 integer linear program. We first solve its relaxed linear program by linear programming techniques. Then we transform the linear program solution into a provably good solution for the vector quantization problem. These methods lead to the first known polynomial-time full-search vector quantization codebook design algorithm and tree pruning algorithm with provable worst-case performance guarantees. We also introduce the notion of pseudo-random pruned tree-structured vector quantizers. Initial experimental results on image compression are very encouraging.
We give a detailed algorithm for fast text compression. Our algorithm, related to the PPM method, simplifies the modeling phase by eliminating the escape mechanism, and speeds up coding by using a combination of quasi-arithmetic coding and Rice coding. We provide details of the use of quasi-arithmetic code tables, and analyze their compression performance. Our Fast PPM method is shown experimentally to be almost twice as fast as the PPMC method, while giving comparable compression.
We present a new method for lossless image compression that gives compression comparable to JPEG lossless mode with about five times the speed. Our method, called FELICS, is based on a novel use of two neighboring pixels for both prediction and error modeling. For coding we use single bits, adjusted binary codes, and Golomb or Rice codes. For the latter we present and analyze a provably good method for estimating the single coding parameter. (Note: This method is the foundation for the subsequently developed state-of-the-art methods now used for lossless image compression.)
We present a method for progressive lossless compression of still grayscale images that combines the speed of our earlier FELICS method with the progressivity of our earlier MLP method. We use MLP's pyramid-based pixel sequence, and image and error modeling and coding based on that of FELICS. In addition, we introduce a new prefix code with some advantages over the previously used Golomb and Rice codes. Our new progressive method gives compression ratios and speeds similar to those of non-progressive FELICS and those of JPEG lossless mode, also a non-progressive method.
The image model in Progressive FELICS is based on a simple function of four nearby pixels. We select two of the four nearest known pixels, using the two with the middle (non-extreme) values. Then we code the pixel's intensity relative to the selected pixels, using single bits, adjusted binary codes, and simple prefix codes like Golomb codes, Rice codes, or the new family of prefix codes introduced here. We estimate the coding parameter adaptively for each context, the context being the absolute value of the difference of the predicting pixels; we adjust the adaptation statistics at the beginning of each level in the progressive pixel sequence.
Arithmetic coding provides an effective mechanism for removing redundancy in the encoding of data. We show how arithmetic coding works and describe an efficient implementation that uses table lookup as a fast alternative to arithmetic operations. The reduced-precision arithmetic has a provably negligible effect on the amount of compression achieved. We can speed up the implementation further by use of parallel processing. We discuss the role of probability models and how they provide probability information to the arithmetic coder. We conclude with perspectives on the comparative advantages and disadvantages of arithmetic coding.
We compare methods for choosing motion vectors for motion-compensated
video compression.  Our primary focus is on videophone and
videoconferencing applications, where very low bit rates are necessary,
where the motion is usually limited, and where the frames must be coded
in the order they are generated.  We provide evidence, using established
benchmark videos of this type, that choosing motion vectors to minimize
codelength subject to (implicit) constraints on quality yields
substantially better rate-distortion tradeoffs than minimizing notions
of prediction error.  We illustrate this point using an algorithm within
the  standard.  We show that using quadtrees to code the motion
vectors in conjunction with explicit codelength minimization yields
further improvement.  We describe a 
dynamic-programming algorithm for choosing a quadtree to minimize the
codelength.
 standard.  We show that using quadtrees to code the motion
vectors in conjunction with explicit codelength minimization yields
further improvement.  We describe a 
dynamic-programming algorithm for choosing a quadtree to minimize the
codelength. 
Motivated by the desire to find text compressors that compress better than existing dictionary methods, but run faster than PPM implementations, we describe methods for text compression using multiple dictionaries, one for each context of preceding characters, where the contexts have varying lengths. The context to be used is determined using an escape mechanism similar to that of PPM methods. We describe modifications of three popular dictionary coders along these lines and experiments evaluating their efficacy using the text files in the Calgary corpus. Our results suggest that modifying LZ77 along these lines yields an improvement in compression of about 4%, that modifying LZFG yields a compression improvement of about 8%, and that modifying LZW in this manner yields an average improvement on the order of 12%.
We present and compare methods for choosing motion vectors for
motion-compensated video coding.  Our primary focus is on videophone and
videoconferencing applications, where very low bit rates are necessary,
where motion is usually limited, and where frames must be coded in the
order they are generated.  We provide evidence, using established
benchmark videos typical of these applications, that choosing motion
vectors explicitly to minimize rate, subject to implicit constraints on
distortion, yields better rate-distortion tradeoffs than minimizing
notions of prediction error.  Minimizing a linear combination of rate
and distortion results in further rate-distortion improvements.  Using a
heuristic function of the prediction error and the motion vector
codelength results in compression performance comparable to the more
computationally intensive coders while running much faster.  We
incorporate these ideas into coders that operate within the  standard.
standard.
We present and compare methods for choosing motion vectors for block-based motion-compensated video coding. The primary focus is on videophone and video-conferencing applications, where low bit rates are necessary, where motion is usually limited, and where the amount of computation is also limited. In a typical block-based motion-compensated video coding system, motion vectors are transmitted along with a lossy encoding of the residuals. As the bit rate decreases, the proportion required to transmit the motion vectors increases. We provide experimental evidence that choosing motion vectors explicitly to minimize rate (including motion vector coding), subject to implicit constraints on distortion, yields better rate-distortion tradeoffs than minimizing some measure of prediction error. Minimizing a combination of rate and distortion yields further improvements. Although these explicit-minimization schemes are computationally intensive, they provide invaluable insight which we use to develop practical algorithms. We show that minimizing a simple heuristic function of the prediction error and the motion vector code-length results in rate-distortion performance comparable to explicit-minimization schemes while being computationally feasible. Experimental results are provided for coders that operate within the H.261 standard.
We consider re-representing the alphabet so that a representation of a character reflects its properties as a predictor of future text. This enables us to use an estimator from a restricted class to map contexts to predictions of upcoming characters. We describe an algorithm that uses this idea in conjunction with neural networks. The performance of this implementation is compared to other compression methods, such as UNIX compress, gzip, PPMC, and an alternative neural network approach.
Video belongs to a class of information called continuous media. Continuous media is characterized by the essentially continuous manner in which the information is presented. This is in contrast to discrete media, in which there is no essential temporal component. Text, images, and graphics are examples of discrete media, while movies, sound, and computer animation are examples of continuous media. Even though a slide show is a time-based presentation of images, it is not a continuous medium since each image is viewed as an individual item. On the other hand, a video clip, while also consisting of a sequence of images, is a continuous medium since each image is perceived in the context of past and future images.
With continuous media, therefore, the temporal dimension becomes important. For example, a video sequence compressed with a constant image quality for every frame is often more desirable than one in which the image quality varies noticeably over time. However, because the compressibility of individual frames varies over time, maintaining a constant image quality results in a variation in coding rate over time. The process of controlling the coding rate to meet the requirements of a transmissions channel or storage device, while maintaining a desired level of quality, is called bit rate control. In this monograph, we focus on the rate control of compressed video. Specifically, we present a new framework for allocating bits to the compression of pictures in a video sequence.
Existing optimal rate control techniques typically regulate the coding rate to minimize a sum-distortion measure. While these techniques can leverage the wealth of tools from least-mean-square optimization theory, they do not guarantee constant-quality video, an objective often mentioned in the literature. In this book, we develop a framework that casts rate control as a resource allocation problem with continuous variables, nonlinear constraints, and a novel lexicographic optimality criterion that is motivated for uniform video quality. With the lexicographic criterion, we propose a new concept of coding efficiency to better reflect the constancy in quality that is generally desired from a video coder.
Rigorous analysis within this framework reveals a set of necessary and sufficient conditions for optimality for coding at both constant and variable bit rates. With these conditions, we are able to construct polynomial-time algorithms for optimal bit rate control. Experimental implementations of these algorithms confirm the theoretical analysis and produce encodings that are more uniform in quality than that achieved with existing rate control methods. As evidence of the generality and flexibility of the framework, we show how to extend the framework to allocate bits among multiple variable bit rate bitstreams that are to be transmitted over a common constant bit rate channel and to encompass the case of discrete variables.
Slides for talk (Adobe pdf format)
The proliferation of online text, such as on the World Wide Web and
  in databases, motivates the need for space-efficient index methods
  that support fast search.  Consider a text  of
 of  binary symbols
  to index.  Given any query pattern
 binary symbols
  to index.  Given any query pattern  of
 of  binary symbols, the
  goal is to search for
 binary symbols, the
  goal is to search for  in
 in  quickly, with
 quickly, with  being fully
  scanned only once, namely, when the index is created. All indexing
  schemes published in the last thirty years support searching in
 being fully
  scanned only once, namely, when the index is created. All indexing
  schemes published in the last thirty years support searching in
   worst-case time and require
 worst-case time and require  memory words (or
 memory words (or
  
 bits), which is significantly larger than the
  text itself. In this paper we provide a breakthrough both in
  searching time and index space under the same model of computation
  as the one adopted in previous work. Based upon new compressed
  representations of suffix arrays and suffix trees, we construct an
  index structure that occupies only
 bits), which is significantly larger than the
  text itself. In this paper we provide a breakthrough both in
  searching time and index space under the same model of computation
  as the one adopted in previous work. Based upon new compressed
  representations of suffix arrays and suffix trees, we construct an
  index structure that occupies only  bits and compares
  favorably with inverted lists in space.  We can search any binary
  pattern
 bits and compares
  favorably with inverted lists in space.  We can search any binary
  pattern  , stored in
, stored in  words, in only
 words, in only  time.
 time.
Specifically, searching takes  time for
 time for  , and
, and
  
 time for
 time for 
 and any fixed
  and any fixed 
 . That is, we achieve optimal
. That is, we achieve optimal
   search time for sufficiently large
 search time for sufficiently large 
 .
  We can list all the
.
  We can list all the  pattern occurrences in optimal
 pattern occurrences in optimal
  
 additional time when
 additional time when 
 or when
  or when 
 ; otherwise, listing takes
; otherwise, listing takes 
  
 additional time.
 additional time.
We present a novel implementation of compressed suffix arrays exhibiting
new tradeoffs between search time and space occupancy for a given text
(or sequence) of  symbols over an alphabet
 symbols over an alphabet  , where each
symbol is encoded by
, where each
symbol is encoded by  bits.  We show that compressed
suffix arrays use just
 bits.  We show that compressed
suffix arrays use just 
 bits, 
while retaining full text indexing functionalities, such as searching
any pattern sequence of length
 bits, 
while retaining full text indexing functionalities, such as searching
any pattern sequence of length  in
 in 
 time.  The term
 
time.  The term 
 denotes the
 denotes the  th-order empirical
entropy of the text, which means that our index is nearly optimal in
space apart from lower-order terms, achieving asymptotically the
empirical entropy of the text (with a multiplicative constant 1). 
If the text is highly compressible so that
th-order empirical
entropy of the text, which means that our index is nearly optimal in
space apart from lower-order terms, achieving asymptotically the
empirical entropy of the text (with a multiplicative constant 1). 
If the text is highly compressible so that  and the alphabet
size is small, we obtain a text index with
 and the alphabet
size is small, we obtain a text index with  search time that
requires only
 search time that
requires only  bits.  We also report further results and tradeoffs
on high-order entropy-compressed text indexes.
 bits.  We also report further results and tradeoffs
on high-order entropy-compressed text indexes.
We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees comparable to previous work, yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size -- without requiring a separate instance of the text -- and support fast and powerful searches. To our knowledge, this is the best known method in terms of space for fast searching. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows-Wheeler transform), achieving a slightly better compression ratio than bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.
We report on a simple encoding format called wzip for
decompressing block-sorting transforms, such as the Burrows-Wheeler
Transform (BWT). Our compressor uses the simple notions of gamma
encoding and RLE organized with a wavelet tree to achieve a slightly
better compression ration than bzip2 in less time. In fact, our
compression/decompression time is dependent upon  , the empirical
, the empirical
 th order entropy. Another key contribution of our compressor is its
simplicity. Our compressor can also operate as a full-text index with a
small amount of data, while still preserving backward compatibility with
just the compressor.
th order entropy. Another key contribution of our compressor is its
simplicity. Our compressor can also operate as a full-text index with a
small amount of data, while still preserving backward compatibility with
just the compressor.
We propose measures for compressed data
structures, in which space usage is measured in a data-aware manner. In
particular, we consider the fundamental dictionary problem on
set data, where the task is to construct a data structure to
represent a set  of
 of  items out of a universe
 items out of a universe 
 and support various queries on
 and support various queries on  . We use a well-known
data-aware measure for set data called gap to bound the space of
our data structures. We describe a novel dictionary structure taking
. We use a well-known
data-aware measure for set data called gap to bound the space of
our data structures. We describe a novel dictionary structure taking
 bits. Under the RAM
model, our dictionary supports membership, rank, select, and predecessor
queries in nearly optimal time, matching the time bound of Andersson and
Thorup's predecessor structure, while simultaneously improving upon their
space usage. Our dictionary structure uses exactly gap bits in the
leading term (i.e., the constant factor is
 bits. Under the RAM
model, our dictionary supports membership, rank, select, and predecessor
queries in nearly optimal time, matching the time bound of Andersson and
Thorup's predecessor structure, while simultaneously improving upon their
space usage. Our dictionary structure uses exactly gap bits in the
leading term (i.e., the constant factor is  ) and answers queries in
near-optimal time. When seen from the worst case perspective, we present
the first
) and answers queries in
near-optimal time. When seen from the worst case perspective, we present
the first  -bit dictionary structure which supports these
queries in near-optimal time under RAM model. We also build a dictionary
which requires the same space and supports membership, select, and
partial rank queries even more quickly in
-bit dictionary structure which supports these
queries in near-optimal time under RAM model. We also build a dictionary
which requires the same space and supports membership, select, and
partial rank queries even more quickly in  time. To the
best of our knowledge, this is the first of a kind result which achieves
data-aware space usage and retains near-optimal time.
 time. To the
best of our knowledge, this is the first of a kind result which achieves
data-aware space usage and retains near-optimal time.
We present an experimental study of the space-time tradeoffs for the dictionary problem. Our primary goal is to reduce the space requirement for storing a dictionary data structure. Many compression schemes have been developed for dictionaries, which fall generally in the categories of combinatorial encodings and data-aware methods and still support queries efficiently. We show that for many real-world datasets, data-aware methods lead to a worthwhile compression over combinatorial methods. Additionally, we design a new data-aware building block structure called BSGAP that presents improvements over other data-aware methods.
We present a unified algorithmic framework to obtain nearly
optimal space 
bounds for text compression and compressed text indexing,
apart from 
lower-order terms. For a text  of
 of  symbols drawn from
an 
alphabet
 symbols drawn from
an 
alphabet  , our bounds are stated in terms of the
, our bounds are stated in terms of the
 th-order 
empirical entropy of the text,
th-order 
empirical entropy of the text,  . In particular, we
provide a tight 
analysis of the Burrows-Wheeler transform (BWT)
establishing a bound of
. In particular, we
provide a tight 
analysis of the Burrows-Wheeler transform (BWT)
establishing a bound of 
 bits, where
 bits, where  denotes
the 
asymptotic number of bits required to store the empirical
statistical 
model for contexts of order up to
 denotes
the 
asymptotic number of bits required to store the empirical
statistical 
model for contexts of order up to  appearing
in
 appearing
in  . Using the same 
framework, we also obtain an implementation of the
compressed suffix array 
(CSA) which achieves
. Using the same 
framework, we also obtain an implementation of the
compressed suffix array 
(CSA) which achieves 
 bits of space while still retaining competitive
full-text indexing 
functionality.
 bits of space while still retaining competitive
full-text indexing 
functionality.
The novelty of the proposed framework lies in its use of the
finite set model instead of the empirical probability model
(as in previous work), giving us new insight into the design
and analysis of our algorithms. For example, we show that
our analysis gives improved bounds since 
 ,
where
,
where 
 and
 and 
 do not depend on the
text length
 do not depend on the
text length  , while
, while 
 is the modified
 is the modified
 th-order empirical entropy of
th-order empirical entropy of  . Moreover, we show a
strong relationship between a compressed full-text index and
the succinct dictionary problem. We also examine the
importance of lower-order terms, as these can dwarf any
savings achieved by high-order entropy. We report further
results and tradeoffs on high-order entropy-compressed text
indexes in the paper.
. Moreover, we show a
strong relationship between a compressed full-text index and
the succinct dictionary problem. We also examine the
importance of lower-order terms, as these can dwarf any
savings achieved by high-order entropy. We report further
results and tradeoffs on high-order entropy-compressed text
indexes in the paper.
Run-Length-Encoding (RLE) is a data compression technique
that is used in various applications, e.g., biological se-
quence databases, multimedia, and facsimile transmission.
One of the main challenges is how to operate, e.g.,
indexing, searching, and retrieval, on the compressed data
without decompressing it. In this paper, we present the
String B-tree for Compressed sequences, termed the SBC-tree,
for indexing and searching RLE-compressed sequences of
arbitrary length. The SBC-tree is a two-level index
structure based on the well-known String B-tree and a
3-sided range query structure. The SBC-tree supports
substring as well as prefix matching, and range search
operations over RLE-compressed sequences. The SBC-tree has
an optimal external-memory space complexity of  pages,
where
 pages,
where  is the total length of the compressed sequences, and
 is the total length of the compressed sequences, and
 is the disk page size. The insertion and deletion of all
suffixes of a compressed sequence of length m takes
 is the disk page size. The insertion and deletion of all
suffixes of a compressed sequence of length m takes 
 I/O operations. Substring matching, prefix
matching, and range search execute in an optimal
 I/O operations. Substring matching, prefix
matching, and range search execute in an optimal 
 I/O operations, where
 I/O operations, where  is the length of the
compressed query pattern and
 is the length of the
compressed query pattern and  is the query output size. We
present also two variants of the SBC-tree: the SBC-tree that
is based on an R-tree instead of the 3-sided structure, and
the one-level SBC-tree that does not use a two-dimensional
index.
These variants do not have provable worst-case theoretical
bounds for search operations, but perform well in practice.
The SBC-tree index is realized inside PostgreSQL in the
context of a biological protein database
application. Performance results illustrate that using the
SBC-tree to index 
RLE-compressed sequences achieves up to an order of
magnitude reduction in storage, up to 30% reduction in I/Os
for 
the insertion operations, and retains the optimal search
performance achieved by the String B-tree over the uncompressed sequences.
 is the query output size. We
present also two variants of the SBC-tree: the SBC-tree that
is based on an R-tree instead of the 3-sided structure, and
the one-level SBC-tree that does not use a two-dimensional
index.
These variants do not have provable worst-case theoretical
bounds for search operations, but perform well in practice.
The SBC-tree index is realized inside PostgreSQL in the
context of a biological protein database
application. Performance results illustrate that using the
SBC-tree to index 
RLE-compressed sequences achieves up to an order of
magnitude reduction in storage, up to 30% reduction in I/Os
for 
the insertion operations, and retains the optimal search
performance achieved by the String B-tree over the uncompressed sequences.
We consider a central problem in text indexing: Given a
text  over an alphabet
 over an alphabet  , construct a compressed
data structure answering the queries
, construct a compressed
data structure answering the queries 
 ,
,
 , and
, and 
 for a symbol
 for a symbol  .
Many data structures consider these queries for static
text
.
Many data structures consider these queries for static
text  . We consider the dynamic version of the problem,
where we are allowed to insert and delete symbols at
arbitrary positions of
. We consider the dynamic version of the problem,
where we are allowed to insert and delete symbols at
arbitrary positions of  . This problem is a key challenge
in compressed text indexing and has direct application to
dynamic XML indexing structures that answer subpath
queries [XBW].
. This problem is a key challenge
in compressed text indexing and has direct application to
dynamic XML indexing structures that answer subpath
queries [XBW].
We build on the results of [RRR, GMR] and give the best
known query bounds for the dynamic version of this problem,
supporting arbitrary insertions and deletions of symbols
in  . Specifically, with an amortized update time of
. Specifically, with an amortized update time of
 , we suggest how to support
, we suggest how to support
 ,
, 
 , and
, and 
 queries in
 queries in
 time, for any
 time, for any  .
The best previous query times for this problem were
.
The best previous query times for this problem were 
 , given by [Makinen Navarro].  Our bounds
are competitive with state-of-the-art static structures
[GMR]. Some applicable lower bounds for the partial sums
problem [PD] show that our update/query tradeoff is also
nearly optimal. In addition, our space bound is competitive
with the corresponding static structures. For the special
case of bitvectors (i.e.,
, given by [Makinen Navarro].  Our bounds
are competitive with state-of-the-art static structures
[GMR]. Some applicable lower bounds for the partial sums
problem [PD] show that our update/query tradeoff is also
nearly optimal. In addition, our space bound is competitive
with the corresponding static structures. For the special
case of bitvectors (i.e.,  ), we also show the
best tradeoffs for query/update time, improving upon the
results of [Makinen Navarro, Hon, RRR].
), we also show the
best tradeoffs for query/update time, improving upon the
results of [Makinen Navarro, Hon, RRR].
Our focus on fast query/slower update is
well-suited for a query-intensive XML indexing environment.
Using the XBW transform [XBW], we also present a
dynamic data structure that succinctly maintains an ordered
labeled tree  and supports a powerful set of queries
on
 and supports a powerful set of queries
on  .
.
 aw, Poland, July 2007, 
  published in Lecture Notes in Computer Science, 4596
  Springer-Verlag, Berlin, Germany, 521-532.
aw, Poland, July 2007, 
  published in Lecture Notes in Computer Science, 4596
  Springer-Verlag, Berlin, Germany, 521-532.
We present a framework to dynamize succinct data structures,
to encourage their use over non-succinct versions in a wide
variety of important application areas. Our framework can
dynamize most state-of-the-art succinct data structures for
dictionaries, ordinal trees, labeled trees, and text
collections. Of particular note is its direct application to
XML indexing structures that answer 
 queries. Our
framework focuses on achieving information-theoretically
optimal space along with near-optimal update/query bounds.
 queries. Our
framework focuses on achieving information-theoretically
optimal space along with near-optimal update/query bounds.
As the main part of our work, we consider the following
problem central to text indexing: Given a text  over an
alphabet
 over an
alphabet  , construct a compressed data structure
answering the queries
, construct a compressed data structure
answering the queries 
 ,
, 
 , and
, and
 for a symbol
 for a symbol  . Many data
structures consider these queries for static text
. Many data
structures consider these queries for static text  . We
build on these results and give the best known query bounds
for the dynamic version of this problem, supporting
arbitrary insertions and deletions of symbols in
. We
build on these results and give the best known query bounds
for the dynamic version of this problem, supporting
arbitrary insertions and deletions of symbols in  .
.
Specifically, with an amortized update time of
 , any static succinct data structure
, any static succinct data structure  for
 for
 , taking
, taking  time for queries, can be converted by our
framework into a dynamic succinct data structure that
supports
 time for queries, can be converted by our
framework into a dynamic succinct data structure that
supports 
 ,
, 
 , 
and
, 
and 
 queries in
queries in 
 time, for any constant
 time, for any constant
 . When
. When 
 , we achieve
, we achieve
 query times. Our update/query bounds are near-optimal
with respect to the lower bounds.
 query times. Our update/query bounds are near-optimal
with respect to the lower bounds.
We present a unified algorithmic framework to obtain nearly
optimal space bounds for text compression and compressed
text indexing, apart from lower-order terms. For a text  of
 of
 symbols drawn from an alphabet
 symbols drawn from an alphabet  , our bounds are stated in
terms of the
, our bounds are stated in
terms of the  th-order empirical entropy of the text,
th-order empirical entropy of the text,  . In
particular, we provide a tight analysis of the
Burrows-Wheeler transform (BWT) establishing a bound of
. In
particular, we provide a tight analysis of the
Burrows-Wheeler transform (BWT) establishing a bound of 
 bits, where
 bits, where 
 denotes the asymptotical
number of bits required to store the empirical statistical
model for contexts of order
 denotes the asymptotical
number of bits required to store the empirical statistical
model for contexts of order  appearing in
 appearing in  . Using the same
framework, we also obtain an implementation of the
compressed suffix array (CSA) that achieves
. Using the same
framework, we also obtain an implementation of the
compressed suffix array (CSA) that achieves 
 bits of space while still retaining
competitive full-text indexing functionality.
 bits of space while still retaining
competitive full-text indexing functionality.  
The novelty
of the proposed framework lies in its use of the finite set
model instead of the empirical probability model (as in
previous work), giving us new insight into the design and
analysis of our algorithms. For example, we show that our
analysis gives improved bounds since 
 , where
, where 
 and
 and 
 do not depend on
the text length
 do not depend on
the text length  , while
, while 
 is the modified
 is the modified 
 th-order empirical entropy of
th-order empirical entropy of  . Moreover, we show a strong
relationship between a compressed full-text index and the
succinct dictionary problem. We also examine the importance
of lower-order terms, as these can dwarf any savings
achieved by high-order entropy. We report further results
and tradeoffs on high-order entropy-compressed text indexes
in the paper.
. Moreover, we show a strong
relationship between a compressed full-text index and the
succinct dictionary problem. We also examine the importance
of lower-order terms, as these can dwarf any savings
achieved by high-order entropy. We report further results
and tradeoffs on high-order entropy-compressed text indexes
in the paper.
We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called Geometric Burrows-Wheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the best known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching.
The past few years have witnessed several exciting results
on compressed representation of a string  that supports
efficient pattern matching, and the space complexity has been
reduced to
 that supports
efficient pattern matching, and the space complexity has been
reduced to 
 bits, where
 bits, where  denotes the
denotes the  th-order empirical entropy of
th-order empirical entropy of  , and
, and  is the
size of the alphabet. In this paper we study compressed
representation for another classical problem of string
indexing, which is called dictionary matching in the
literature. Precisely, a collection
 is the
size of the alphabet. In this paper we study compressed
representation for another classical problem of string
indexing, which is called dictionary matching in the
literature. Precisely, a collection  of strings (called
patterns) of total length
 of strings (called
patterns) of total length  is to be indexed so that given a
text
 is to be indexed so that given a
text  , the occurrences of the patterns in
, the occurrences of the patterns in  can be found
efficiently. In this paper we show how to exploit a sampling
technique to compress the existing O(n)-word index to an
 can be found
efficiently. In this paper we show how to exploit a sampling
technique to compress the existing O(n)-word index to an
 -bit index with only a small sacrifice
in search time.
-bit index with only a small sacrifice
in search time.
Current data structures for searching large string collections are limited in that they either fail to achieve minimum space or they cause too many cache misses. In this paper, we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and storable in compressed space. The widely known frontcoding scheme is one example of linearization; it is at the core of Prefix B-trees and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-effective heuristic without efficient search support.
In this paper, we introduce new insights on front-coding and other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. The second contribution of this paper engineers these linearizations to design a novel dictionary encoding scheme that achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries and obtain a succinct variant, whose space is close to the information-theoretic minimum.
A new trend in the field of pattern matching is to design
indexing data structures 
which take the space very close to that required by the
indexed text (in 
entropy-compressed form) and also simultaneously achieve
good query performance. Two 
popular indexes, namely the FM-index of Ferragina and Manzini and the
CSA of Grossi and Vitter,
achieve this goal by exploiting the Burrows-Wheeler
transform (BWT). 
However, due to the intricate permutation structure of BWT,
no locality 
of reference can be guaranteed when we perform pattern
matching with these indexes. 
Chien et al. gave an alternative text
index which is based 
on sparsifying the traditional suffix tree and maintaining
an auxiliary 2-D range query structure. 
Given a text  of length
 of length  drawn from a
 drawn from a  -sized
alphabet set, 
they achieved an
-sized
alphabet set, 
they achieved an 
 -bit index for
-bit index for  and 
showed that this index can preserve locality in pattern
matching and hence 
is amenable to be used in external-memory settings. 
We improve upon this index and show how to apply entropy
compression to reduce index space. 
Our index takes
 and 
showed that this index can preserve locality in pattern
matching and hence 
is amenable to be used in external-memory settings. 
We improve upon this index and show how to apply entropy
compression to reduce index space. 
Our index takes 
 bits of
space where
 bits of
space where  is the
 is the  th-order 
empirical entropy of the text. 
This is achieved by creating variable length blocks of text
using arithmetic coding.
th-order 
empirical entropy of the text. 
This is achieved by creating variable length blocks of text
using arithmetic coding. 
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident.
Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory.
In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs. I/O efficiency trade-offs for our structures.
Slides for CPM '10 keynote talk (Adobe pdf)
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
 -bit Filtering Based Text Decomposition,"
invited paper from ISCIS 2010 in
 The Computer Journal, December 20, 2010.
An extended abstract appears in ``Boosting Pattern Matching Performance via
-bit Filtering Based Text Decomposition,"
invited paper from ISCIS 2010 in
 The Computer Journal, December 20, 2010.
An extended abstract appears in ``Boosting Pattern Matching Performance via  -bit Filtering,"
Proceedings of the 25th International Symposium on Computer and Information Sciences (ISCIS '10), 
London, September 2010,
published in Lecture Notes in Electrical Engineering, 1, 62,
  Springer, Berlin, Germany, 27-32.
-bit Filtering,"
Proceedings of the 25th International Symposium on Computer and Information Sciences (ISCIS '10), 
London, September 2010,
published in Lecture Notes in Electrical Engineering, 1, 62,
  Springer, Berlin, Germany, 27-32.
This study explores an alternative way of storing text files to answer
exact match queries faster. We decompose the original file into two
parts as filter and payload. The filter part contains the most
informative  bits of each byte, and the remaining bits of the bytes
are concatenated in order of appearance to generate the payload. We
refer to this structure as
 bits of each byte, and the remaining bits of the bytes
are concatenated in order of appearance to generate the payload. We
refer to this structure as  -bit filtered format. When an input
pattern is to be searched on the
-bit filtered format. When an input
pattern is to be searched on the  -bit filtered structure, the same
decomposition is performed on the pattern. The
-bit filtered structure, the same
decomposition is performed on the pattern. The  bits from each byte
of the pattern form the pattern filter bit sequence, and the rest is
the payload.  The pattern filter is first scanned on the filter part
of the file. At each match position detected in the filter part, the
pattern payload is verified against the corresponding location in the
payload part of the text.  Thus, instead of searching an
 bits from each byte
of the pattern form the pattern filter bit sequence, and the rest is
the payload.  The pattern filter is first scanned on the filter part
of the file. At each match position detected in the filter part, the
pattern payload is verified against the corresponding location in the
payload part of the text.  Thus, instead of searching an  -byte
pattern on an
-byte
pattern on an  -byte text, the first
-byte text, the first  bits are scanned on
 bits are scanned on  bits, followed by a verification of
 bits, followed by a verification of  bits on
the respective locations of the matching positions.  Experiments
conducted on natural language texts, plain ASCII DNA sequences, and
random byte sequences showed that the search performance with the
proposed scheme is on average two times faster than the tested best
exact pattern matching algorithms. The highest gain is obtained on
plain ASCII DNA sequences.  We also developed an effective bitwise
pattern matching algorithm of possible independent interest within
this study.
 bits on
the respective locations of the matching positions.  Experiments
conducted on natural language texts, plain ASCII DNA sequences, and
random byte sequences showed that the search performance with the
proposed scheme is on average two times faster than the tested best
exact pattern matching algorithms. The highest gain is obtained on
plain ASCII DNA sequences.  We also developed an effective bitwise
pattern matching algorithm of possible independent interest within
this study.
 String Retrieval,''
Journal of the ACM, 61(2), April 2014, 9.1-9.36.
An extended abstract appears in
W.-K. Hon, R. Shah, and J. S. Vitter,
``Space-Efficient Frameworks for Top-
 String Retrieval,''
Journal of the ACM, 61(2), April 2014, 9.1-9.36.
An extended abstract appears in
W.-K. Hon, R. Shah, and J. S. Vitter,
``Space-Efficient Frameworks for Top- String Retrieval
Problems,'' 
  Proceedings of the 50th Annual IEEE Symposium on Foundations of
    Computer Science (FOCS '09), Atlanta, GA, October 2009,
  713-722.
 String Retrieval
Problems,'' 
  Proceedings of the 50th Annual IEEE Symposium on Foundations of
    Computer Science (FOCS '09), Atlanta, GA, October 2009,
  713-722.
Given a set 
 of
 of  strings of
total length
 strings of
total length  , our task is to report the ``most
relevant'' strings for a given query pattern
, our task is to report the ``most
relevant'' strings for a given query pattern  . This
involves somewhat more advanced query functionality than the
usual pattern matching, as some notion of ``most relevant''
is involved.  In information retrieval literature, this task
is best achieved by using inverted indexes. However,
inverted indexes work only for some predefined set of
patterns. In the pattern matching community, the most
popular pattern-matching data structures are suffix trees
and suffix arrays. However, a typical suffix tree search
involves going through all the occurrences of the pattern
over the entire string collection, which might be a lot more
than the required relevant documents.
. This
involves somewhat more advanced query functionality than the
usual pattern matching, as some notion of ``most relevant''
is involved.  In information retrieval literature, this task
is best achieved by using inverted indexes. However,
inverted indexes work only for some predefined set of
patterns. In the pattern matching community, the most
popular pattern-matching data structures are suffix trees
and suffix arrays. However, a typical suffix tree search
involves going through all the occurrences of the pattern
over the entire string collection, which might be a lot more
than the required relevant documents.
The first formal framework to study such kind of retrieval
problems was given by Muthukrishnan. He considered two
metrics for relevance: frequency and proximity. He took a
threshold-based approach on these metrics and gave data
structures taking  words of space. We study
this problem in a slightly different framework of reporting
the top
 words of space. We study
this problem in a slightly different framework of reporting
the top  most relevant documents (in sorted order) under
similar and more general relevance metrics. Our framework
gives linear space data structure with optimal query times
for arbitrary score functions. As a corollary, it improves
the space utilization for the problems considered by
Muthukrishnan while maintaining optimal query performance.
We also develop compressed variants of these data structures
for several specific relevance metrics.
 most relevant documents (in sorted order) under
similar and more general relevance metrics. Our framework
gives linear space data structure with optimal query times
for arbitrary score functions. As a corollary, it improves
the space utilization for the problems considered by
Muthukrishnan while maintaining optimal query performance.
We also develop compressed variants of these data structures
for several specific relevance metrics.
 -RA: a Parallel Sparse Index for Genomic Read Alignment,''
invited paper from BIBM 2010 in
BMC Genomics, 12(Suppl. 2), 2011, S7.
An extended abstract appears in ``
-RA: a Parallel Sparse Index for Genomic Read Alignment,''
invited paper from BIBM 2010 in
BMC Genomics, 12(Suppl. 2), 2011, S7.
An extended abstract appears in `` -RA: A Parallel Sparse Index for Read Alignment on Genomes,''
Proceedings of the IEEE International
Conference on Bioinformatics & Biomedicine (BIBM '10), Hong Kong, December 2010, 663-668.
-RA: A Parallel Sparse Index for Read Alignment on Genomes,''
Proceedings of the IEEE International
Conference on Bioinformatics & Biomedicine (BIBM '10), Hong Kong, December 2010, 663-668.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays
(SSAs) and propose a new short read aligner named  -RA (PSI-RA:
parallel sparse index read aligner).  The motivation in using SSAs is
the ability to trade memory against time.  It is possible to fine tune
the space consumption of the index based on the available memory of
the machine and the minimum length of the arriving pattern queries.
Although SSAs have been studied before for exact matching of short
reads, an elegant way of approximate matching capability was missing.
We provide this by defining the rightmost mismatch criteria that
prioritize the errors towards the end of the reads, where errors are
more probable.
-RA (PSI-RA:
parallel sparse index read aligner).  The motivation in using SSAs is
the ability to trade memory against time.  It is possible to fine tune
the space consumption of the index based on the available memory of
the machine and the minimum length of the arriving pattern queries.
Although SSAs have been studied before for exact matching of short
reads, an elegant way of approximate matching capability was missing.
We provide this by defining the rightmost mismatch criteria that
prioritize the errors towards the end of the reads, where errors are
more probable.  -RA supports any number of mismatches in aligning
reads.  We give comparisons with some of the well-known short read
aligners, and show that indexing a genome with SSA is a good
alternative to the Burrows-Wheeler transform or seed-based solutions.
-RA supports any number of mismatches in aligning
reads.  We give comparisons with some of the well-known short read
aligners, and show that indexing a genome with SSA is a good
alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions:  -RA is expected to serve as a valuable tool in the alignment of
short reads generated by the next generation high-throughput
sequencing technology.
-RA is expected to serve as a valuable tool in the alignment of
short reads generated by the next generation high-throughput
sequencing technology.   -RA is very fast in exact matching and
also supports rightmost approximate matching. The SSA structure that
-RA is very fast in exact matching and
also supports rightmost approximate matching. The SSA structure that
 -RA is built on naturally incorporates the modern multicore
architecture and thus further speed-up can be gained.  All the
information, including the source code of
-RA is built on naturally incorporates the modern multicore
architecture and thus further speed-up can be gained.  All the
information, including the source code of  -RA, can be downloaded
at http://www.busillis.com/o_kulekci/PSIRA.zip.
-RA, can be downloaded
at http://www.busillis.com/o_kulekci/PSIRA.zip.
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.
Given a set  of
 of  patterns of total length
 patterns of total length  , the dictionary matching problem 
is to index
, the dictionary matching problem 
is to index  such that for any query text
 such that for any query text  , we can locate the occurrences of any
pattern within
, we can locate the occurrences of any
pattern within  efficiently. This problem can be solved in optimal
 efficiently. This problem can be solved in optimal  time by
the classical AC automaton (Aho and Corasick, 1975) where
 time by
the classical AC automaton (Aho and Corasick, 1975) where  denotes the number of
occurrences. The space requirement is
 denotes the number of
occurrences. The space requirement is  words. In the approximate dictionary match-
ing problem with one error, we consider a substring of
 words. In the approximate dictionary match-
ing problem with one error, we consider a substring of ![$T[i. .j]$](img183.png) an occurrence of
 an occurrence of  whenever
the edit distance between
 whenever
the edit distance between ![$T[i. .j]$](img183.png) and
 and  is at most 1. For this problem, the best known
indexes are by Cole et al. (2004), which requires
 is at most 1. For this problem, the best known
indexes are by Cole et al. (2004), which requires words of space and reports
all occurrences in
 words of space and reports
all occurrences in 
 time, and by Ferragina et al. (1999), which
requires
 time, and by Ferragina et al. (1999), which
requires 
 words of space and reports all occurrences in
 words of space and reports all occurrences in 
 time.
Recently, there have been successes in compressing the dictionary matching index while
keeping the query time optimal (Belazzougui, 2010; Hon et al., 2010). However, a com-
pressed index for approximate dictionary matching problem is still open. In this paper,
we propose the rst such index which requires an optimal
 time.
Recently, there have been successes in compressing the dictionary matching index while
keeping the query time optimal (Belazzougui, 2010; Hon et al., 2010). However, a com-
pressed index for approximate dictionary matching problem is still open. In this paper,
we propose the rst such index which requires an optimal 
 -bit
index space, where
-bit
index space, where  denotes the
 denotes the  th-order empirical entropy of
th-order empirical entropy of  ,
,  and is the size of
alphabet set from which all the characters in
 and is the size of
alphabet set from which all the characters in  and
 and  are drawn. The query time of our
index is
 are drawn. The query time of our
index is 
 .
.
The wavelet tree data structure is a space-efficient technique
for rank and select queries that generalizes from binary characters to
an arbitrary multicharacter alphabet. It has become a key tool in
modern full-text indexing and data compression because of its
capabilities in compressing, indexing, and searching.  We present a
comparative study of its practical performance regarding a wide range
of options on the dimensions of different coding schemes and tree
shapes.  Our results are both theoretical and experimental:
(1) We show that the run-length  coding size of wavelet trees
achieves the 0-order empirical entropy size of the original string
with leading constant 1, when the string's 0-order empirical entropy
is asymptotically less than the logarithm of the alphabet size.  This
result complements the previous works that are dedicated to analyzing
run-length
 coding size of wavelet trees
achieves the 0-order empirical entropy size of the original string
with leading constant 1, when the string's 0-order empirical entropy
is asymptotically less than the logarithm of the alphabet size.  This
result complements the previous works that are dedicated to analyzing
run-length  -encoded wavelet trees. It also reveals the
scenarios when run-length
-encoded wavelet trees. It also reveals the
scenarios when run-length  encoding becomes practical.
(2) We introduce a full generic package of wavelet trees for a wide
range of options on the dimensions of coding schemes and tree shapes.
Our experimental study reveals the practical performance of the
various modifications.
 encoding becomes practical.
(2) We introduce a full generic package of wavelet trees for a wide
range of options on the dimensions of coding schemes and tree shapes.
Our experimental study reveals the practical performance of the
various modifications.
Inverted indexes are the most fundamental and widely used
data structures in information retrieval. For each unique
word occurring in a document collection, the inverted index
stores a list of the documents in which this word occurs.
Compression techniques are often applied to further reduce
the space requirement of these lists. However, the index
has a shortcoming, in that only predefined pattern queries
can be supported efficiently. In terms of string documents
where word boundaries are undefined, if we have to index all
the substrings of a given document, then the storage quickly
becomes quadratic in the data size. Also, if we want to apply
the same type of indexes for querying phrases or sequence of
words, then the inverted index will end up storing redundant
information. In this paper, we show the first set of inverted
indexes which work naturally for strings as well as phrase
searching. The central idea is to exclude document  in the
inverted list of a string
 in the
inverted list of a string  if every occurrence of
 if every occurrence of  in
 in  is
subsumed by another string of which
 is
subsumed by another string of which  is a prefix. With this
we show that our space utilization is close to the optimal.
Techniques from succinct data structures are deployed to
achieve compression while allowing fast access in terms of
frequency and document id based retrieval. Compression
and speed tradeoffs are evaluated for different variants of
the proposed index. For phrase searching, we show that our
indexes compare favorably against a typical inverted index
deploying position-wise intersections. We also show efficient
top-
 is a prefix. With this
we show that our space utilization is close to the optimal.
Techniques from succinct data structures are deployed to
achieve compression while allowing fast access in terms of
frequency and document id based retrieval. Compression
and speed tradeoffs are evaluated for different variants of
the proposed index. For phrase searching, we show that our
indexes compare favorably against a typical inverted index
deploying position-wise intersections. We also show efficient
top- based retrieval under relevance metrics like frequency
and tf-idf.
 based retrieval under relevance metrics like frequency
and tf-idf.
 -RA: a Parallel Sparse Index for Genomic Read Alignment,''
invited paper from BIBM '10 in
BMC Genomics, 12(Suppl. 2), 2011, S7.
A shorter version appears in ``PSI-RA: A Parallel Sparse Index for Read Alignment on Genomes,"
Proceedings of 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM '10), 
Hong Kong, December 2010, 663-668.
-RA: a Parallel Sparse Index for Genomic Read Alignment,''
invited paper from BIBM '10 in
BMC Genomics, 12(Suppl. 2), 2011, S7.
A shorter version appears in ``PSI-RA: A Parallel Sparse Index for Read Alignment on Genomes,"
Proceedings of 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM '10), 
Hong Kong, December 2010, 663-668.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new 
short read aligner named  -RA (PSI-RA: parallel sparse index read aligner). The motivation in using 
SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption 
of the index based on the available memory of the machine and the minimum length of the arriving 
pattern queries. Although SSAs have been studied before for exact matching of short reads, an 
elegant way of approximate matching capability was missing. We provide this by defining the 
rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors 
are more probable.
-RA (PSI-RA: parallel sparse index read aligner). The motivation in using 
SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption 
of the index based on the available memory of the machine and the minimum length of the arriving 
pattern queries. Although SSAs have been studied before for exact matching of short reads, an 
elegant way of approximate matching capability was missing. We provide this by defining the 
rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors 
are more probable.  -RA supports any number of mismatches in aligning reads. We give
comparisons with some of the well-known short read aligners, and show that indexing a genome with 
SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions.
-RA supports any number of mismatches in aligning reads. We give
comparisons with some of the well-known short read aligners, and show that indexing a genome with 
SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions:  -RA is expected to serve as a valuable tool in the alignment of short reads generated 
by the next generation high-throughput sequencing technology.
-RA is expected to serve as a valuable tool in the alignment of short reads generated 
by the next generation high-throughput sequencing technology.  -RA is very fast in exact matching 
and also supports rightmost approximate matching. The SSA structure that
-RA is very fast in exact matching 
and also supports rightmost approximate matching. The SSA structure that  -RA is built on naturally 
incorporates the modern multicore architecture and thus further speed-up can be gained. All the 
information, including the source code of
-RA is built on naturally 
incorporates the modern multicore architecture and thus further speed-up can be gained. All the 
information, including the source code of
 -RA, can be downloaded at http://www.busillis.com/o_kulekci/PSIRA.zip.
-RA, can be downloaded at http://www.busillis.com/o_kulekci/PSIRA.zip.
T.-H. Ku, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. ``Compressed Text Indexing With Wildcards,'' Journal of Discrete Algorithms, 19, March 2013, 23-29. An extended abstract appears in Proceedings of the 18th International Conference on String Processing and Information Retrieval (SPIRE '11), Pisa, Italy, October 2011, published in Lecture Notes in Computer Science, Springer, Berlin, Germany.
Let 
 be a text of
total length
 be a text of
total length  , where characters of each
, where characters of each  are chosen from an
alphabet
 are chosen from an
alphabet  of size
 of size  , and
, and  denotes a wildcard
symbol.  The text indexing with wildcards problem is to index
 denotes a wildcard
symbol.  The text indexing with wildcards problem is to index  such
that when we are given a query pattern
 such
that when we are given a query pattern  , we can locate the
occurrences of
, we can locate the
occurrences of  in
 in  efficiently.  This problem has been applied
in indexing genomic sequences that contain single-nucleotide
polymorphisms (SNP) because SNP can be modeled as wildcards.  Recently
Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes
for this problem.  In this paper, we present the first compressed
index for this problem, which takes only
 efficiently.  This problem has been applied
in indexing genomic sequences that contain single-nucleotide
polymorphisms (SNP) because SNP can be modeled as wildcards.  Recently
Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes
for this problem.  In this paper, we present the first compressed
index for this problem, which takes only 
 bits space, where
 bits space, where  is the
 is the  th-order
empirical entropy (
th-order
empirical entropy (
 ) of
) of  .
.
Slides for CIKM '12 keynote talk (Adobe pdf)
We describe recent breakthroughs in the field of compressed data structures, in which the data structure is stored in a compressed representation that still allows fast answers to queries. We focus in particular on compressed data structures to support the important application of pattern matching on massive document collections. Given an arbitrary query pattern in textual form, the job of the data structure is to report all the locations where the pattern appears. Another variant is to report all the documents that contain at least one instance of the pattern. We are particularly interested in reporting only the most relevant documents, using a variety of notions of relevance. We discuss recently developed techniques that support fast search in these contexts as well as under additional positional and temporal constraints.
We study the position restricted sub-string searching (PRSS) problem, where the task is to index a text ![$T[0...n-1]$](img197.png) of
 of  characters over an alphabet set
 characters over an alphabet set  of size
 of size  , in order to answer the following:  given a query pattern
, in order to answer the following:  given a query pattern  (of length
 (of length  ) and two indices
) and two indices  and
 and  , report all
, report all  occurrences of
 occurrences of  in
 in ![$T[\ell...r]$](img200.png) . The known indexes take
. The known indexes take  bits or
 bits or 
 bits space, and answer the query in
 bits space, and answer the query in 
 time or in optimal
 time or in optimal 
 time respectively, where
 time respectively, where   is  any positive constant. The main drawback of these indexes is their space requirement of
 is  any positive constant. The main drawback of these indexes is their space requirement of 
 bits, which can be much more than the optimal
 bits, which can be much more than the optimal  bits to store the text
 bits to store the text  . This paper addresses an open question asked by Mäkinen and Navarro [LATIN, 2006], whether it is possible to design a succinct index answering PRSS queries efficiently ? We first study the hardness of this problem and also prove the following result: a succinct index cannot answer PRSS queries optimally at least in the pointer machine model and also not in RAM model unless bounds on well-researched orthogonal range query problem improve. However, for the special case of sufficiently long query patterns, that is for
. This paper addresses an open question asked by Mäkinen and Navarro [LATIN, 2006], whether it is possible to design a succinct index answering PRSS queries efficiently ? We first study the hardness of this problem and also prove the following result: a succinct index cannot answer PRSS queries optimally at least in the pointer machine model and also not in RAM model unless bounds on well-researched orthogonal range query problem improve. However, for the special case of sufficiently long query patterns, that is for 
 , we derive an
, we derive an 
 bits index with optimal query time, where
 bits index with optimal query time, where  and
 and  are the space (in bits) of compressed suffix arrays (with
 are the space (in bits) of compressed suffix arrays (with  time for pattern search) of
 time for pattern search) of  and
 and 
 (the reverse of
 (the reverse of  ) respectively.
The space can be reduced further to
) respectively.
The space can be reduced further to  bits and the resulting query time will be
 bits and the resulting query time will be 
 . 
For the general case, where there is no restriction on pattern length, we obtain an
. 
For the general case, where there is no restriction on pattern length, we obtain an 
 bits index with
 bits index with 
 query time. We use suffix sampling techniques to achieve these space-efficient indexes.
 query time. We use suffix sampling techniques to achieve these space-efficient indexes. 
Let 
 be a given collection of
 be a given collection of  string
documents of total length
 string
documents of total length  .  We consider the problem of indexing
.  We consider the problem of indexing
 such that, whenever two patterns
 such that, whenever two patterns  and
 and  comes as an
online query, we can list all those documents containing
 comes as an
online query, we can list all those documents containing  but not
but not  . Let
. Let  represent the number of such documents. An index
proposed by Fischer et al. (LATIN, 2012) can answer this query in
 represent the number of such documents. An index
proposed by Fischer et al. (LATIN, 2012) can answer this query in
 time. However, its space requirement is
 time. However, its space requirement is
 bits. We propose the first linear-space index for this
problem with a worst case query time of
 bits. We propose the first linear-space index for this
problem with a worst case query time of 
 .
.
 Document Retrieval,''
Proceedings of the 2013 IEEE Data
    Compression Conference (DCC '13), Snowbird, UT,
  March 2013.
 Document Retrieval,''
Proceedings of the 2013 IEEE Data
    Compression Conference (DCC '13), Snowbird, UT,
  March 2013.
Let 
 be a given collection of
 be a given collection of  string
documents of total length
 string
documents of total length  , our task is to index
, our task is to index  , such that
whenever a pattern
, such that
whenever a pattern  (of length
 (of length  ) and an integer
) and an integer  come as a
query, those
 come as a
query, those  documents in which
 documents in which  appears the most number of
times can be listed efficiently. In this paper, we propose a
compressed index taking
 appears the most number of
times can be listed efficiently. In this paper, we propose a
compressed index taking 
 bits of
space, which answers a query with
 bits of
space, which answers a query with 
 per document report time. This improves the
per document report time. This improves the 
 per document report time of the previously
best-known index with (asymptotically) the same space requirements
[Belazzougui and Navarro, SPIRE 2011]. Here,
 per document report time of the previously
best-known index with (asymptotically) the same space requirements
[Belazzougui and Navarro, SPIRE 2011]. Here,  represents the
size (in bits) of the compressed suffix array (CSA) of the text
obtained by concatenating all documents in
 represents the
size (in bits) of the compressed suffix array (CSA) of the text
obtained by concatenating all documents in  , and
, and  is the
time for decoding a suffix array value using the CSA.
 is the
time for decoding a suffix array value using the CSA.
Document retrieval is a special type of pattern matching that is closely
related to information retrieval and web searching.  In this problem, the data
consist of a collection of text documents, and given a query pattern  , we
are required to report all the documents (not all the occurrences) in which this pattern occurs. 
In addition, the notion of relevance is commonly
applied to rank all the documents that satisfy the query, and only those
documents with the highest relevance are returned. Such a concept of relevance
has been central in the effectiveness and usability of present day search
engines like Google, Bing, Yahoo, or Ask. When relevance is considered, the
query has an additional input parameter
, we
are required to report all the documents (not all the occurrences) in which this pattern occurs. 
In addition, the notion of relevance is commonly
applied to rank all the documents that satisfy the query, and only those
documents with the highest relevance are returned. Such a concept of relevance
has been central in the effectiveness and usability of present day search
engines like Google, Bing, Yahoo, or Ask. When relevance is considered, the
query has an additional input parameter  , and the task is to report only the
, and the task is to report only the  documents with the highest relevance to
documents with the highest relevance to  , instead of finding all the
documents that contains
, instead of finding all the
documents that contains  . For example, one such relevance function could be the
frequency of the query pattern in the document. In the information
retrieval literature, this task is best achieved by using inverted indexes.
However, if the query consists of an arbitrary string--which can be a
partial word, multiword phrase, or more generally any sequence of characters--we 
cannot take advantages of the word boundaries and we need a different approach.
. For example, one such relevance function could be the
frequency of the query pattern in the document. In the information
retrieval literature, this task is best achieved by using inverted indexes.
However, if the query consists of an arbitrary string--which can be a
partial word, multiword phrase, or more generally any sequence of characters--we 
cannot take advantages of the word boundaries and we need a different approach.
This leads to one of the active research topics in string matching and text indexing community in recent years, and various aspects of the problem have been studied, such as space-time tradeoffs, practical solutions, multipattern queries, and I/O-efficiency. In this article, we review some of the initial frameworks for designing such indexes and also summarize the developments in this area.
 Document Retrieval in External Memory,''
Proceedings of the 21st Annual European Symposium on Algorithms
    (ESA '13), Sophia-Antipolis, France, September 2013, published in Lecture Notes
  in Computer Science, Springer, Berlin,
  Germany.
 Document Retrieval in External Memory,''
Proceedings of the 21st Annual European Symposium on Algorithms
    (ESA '13), Sophia-Antipolis, France, September 2013, published in Lecture Notes
  in Computer Science, Springer, Berlin,
  Germany.
Let  be a given set of (string) documents of total length
 be a given set of (string) documents of total length
 . The top-
. The top- document retrieval problem is to index
 document retrieval problem is to index  such
that when a pattern
 such
that when a pattern  of length
 of length  , and a parameter
, and a parameter  come as a
query, the index returns those
 come as a
query, the index returns those  documents which are most relevant
to
 documents which are most relevant
to  . Hon et al. [HSV09] proposed a linear space framework to
solve this problem in
. Hon et al. [HSV09] proposed a linear space framework to
solve this problem in 
 time. This query time was
improved to
 time. This query time was
improved to  by Navarro and Nekrich [NN12]. These
results are powerful enough to support arbitrary relevance functions
like frequency, proximity, PageRank, etc. Despite of continued
progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model
have so far been elusive. In this paper, we propose the first external
memory index supporting top-
 by Navarro and Nekrich [NN12]. These
results are powerful enough to support arbitrary relevance functions
like frequency, proximity, PageRank, etc. Despite of continued
progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model
have so far been elusive. In this paper, we propose the first external
memory index supporting top- document retrieval queries (outputs
unsorted) in optimal
 document retrieval queries (outputs
unsorted) in optimal 
 I/Os, where
 I/Os, where  is the
block size. The index space is almost linear
 is the
block size. The index space is almost linear  words,
where
 words,
where  is the iterated logarithm of
 is the iterated logarithm of  . We also improve the
existing internal memory results. Specifically, we propose a linear
space index for retrieving top-
. We also improve the
existing internal memory results. Specifically, we propose a linear
space index for retrieving top- documents in
 documents in  time, once the
locus of the pattern match is given.
 time, once the
locus of the pattern match is given.
Color (or categorical) range reporting is a variant of the orthogonal 
range reporting problem in which every point in the input is assigned a 
color. While the answer to an orthogonal point reporting query 
contains all points in the query range  , the answer to a color 
reporting query contains only  distinct colors of points in
, the answer to a color 
reporting query contains only  distinct colors of points in  .
In this paper we describe an
.
In this paper we describe an  -space data structure that answers 
one-dimensional color reporting queries in optimal
-space data structure that answers 
one-dimensional color reporting queries in optimal  time, where
 time, where 
 is the number of colors in the answer and
 is the number of colors in the answer and  is the number of 
points in the data structure. Our result can be also dynamized and 
extended to the external memory model.
 is the number of 
points in the data structure. Our result can be also dynamized and 
extended to the external memory model.
In recent years, there has been increasing interest in planted  motif search (PMS) with applications to
discovering significant segments in biological sequences. However, there has been little discussion about PMS over large
alphabets. This paper focuses on motif stem search (MSS), which was recently introduced to search motifs on large-alphabet
inputs. A motif stem is an
 motif search (PMS) with applications to
discovering significant segments in biological sequences. However, there has been little discussion about PMS over large
alphabets. This paper focuses on motif stem search (MSS), which was recently introduced to search motifs on large-alphabet
inputs. A motif stem is an  -length string with some wildcards. The goal of the MSS problem is to find a set of stems that
represents a superset of all
-length string with some wildcards. The goal of the MSS problem is to find a set of stems that
represents a superset of all  motifs present in the input sequences, and the superset is expected to be as small as possible.
The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular
expressions. (2)  We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an
efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous algorithms, StemFinder
runs much faster and first solves the (17, 8), (19, 9) and (21, 10) challenging instances on protein sequences; moreover,
StemFinder reports fewer stems which represent a smaller superset of all
 motifs present in the input sequences, and the superset is expected to be as small as possible.
The three main contributions of this paper are as follows: (1) We build motif stem representation more precisely by using regular
expressions. (2)  We give a method for generating all possible motif stems without redundant wildcards. (3) We propose an
efficient exact algorithm, called StemFinder, for solving the MSS problem. Compared with the previous algorithms, StemFinder
runs much faster and first solves the (17, 8), (19, 9) and (21, 10) challenging instances on protein sequences; moreover,
StemFinder reports fewer stems which represent a smaller superset of all  motifs. StemFinder is freely available at
http://sites.google.com/site/feqond/stemfinder.
 motifs. StemFinder is freely available at
http://sites.google.com/site/feqond/stemfinder.
In this paper we describe compressed indexes that support pattern matching queries for strings
with wildcards. We present a data structure that uses 
 bits for any
 bits for any  and reports
all occ occurrences of a wildcard string in
 and reports
all occ occurrences of a wildcard string in 
 time, where
 time, where  is the alphabet
size,
 is the alphabet
size,  is the number of alphabet symbols, and
 is the number of alphabet symbols, and  is the number of wildcard symbols in the
query string. We also present an
 is the number of wildcard symbols in the
query string. We also present an  -bit index with
-bit index with 
 query time and
an
 query time and
an 
 -bit index with
-bit index with 
 query time. These are the first
data structures for this problem that need only
 query time. These are the first
data structures for this problem that need only  bits of space.
 bits of space.
In this paper, we develop a simple and practical storage scheme for
compressed suffix arrays (CSA). Our CSA can be constructed in linear time and
needs 
 bits of space simultaneously for any
 bits of space simultaneously for any 
 and any constant
and any constant  , where
, where  denotes the
 denotes the  -th order entropy. We compare the
performance of our method with two established compressed indexing methods,
viz. the FM-index and Sadakane's CSA. Experiments on the Canterbury Corpus and
the Pizza&Chili Corpus show significant advantages of our algorithm over two
other indexes in terms of compression and query time. Our storage scheme
achieves better performance on all types of data present in these two corpora,
except for evenly distributed data, such as DNA. The source code for our CSA is
available online.
-th order entropy. We compare the
performance of our method with two established compressed indexing methods,
viz. the FM-index and Sadakane's CSA. Experiments on the Canterbury Corpus and
the Pizza&Chili Corpus show significant advantages of our algorithm over two
other indexes in terms of compression and query time. Our storage scheme
achieves better performance on all types of data present in these two corpora,
except for evenly distributed data, such as DNA. The source code for our CSA is
available online.
The planted  motif discovery has been successfully
used to locate transcription factor binding sites in dozens
of promoter sequences over the past decade. However, there has
not been enough work done in identifying
 motif discovery has been successfully
used to locate transcription factor binding sites in dozens
of promoter sequences over the past decade. However, there has
not been enough work done in identifying  motifs in the
next-generation sequencing (ChIP-seq) data sets, which contain
thousands of input sequences and thereby bring new challenge
to make a good identification in reasonable time. To cater this
need, we propose a new planted
 motifs in the
next-generation sequencing (ChIP-seq) data sets, which contain
thousands of input sequences and thereby bring new challenge
to make a good identification in reasonable time. To cater this
need, we propose a new planted  motif discovery algorithm
named MCES, which identifies motifs by mining and combining
emerging substrings. Specially, to handle larger data sets, we
design a MapReduce-based strategy to mine emerging substrings
distributedly. Experimental results on the simulated data show
that i) MCES is able to identify
 motif discovery algorithm
named MCES, which identifies motifs by mining and combining
emerging substrings. Specially, to handle larger data sets, we
design a MapReduce-based strategy to mine emerging substrings
distributedly. Experimental results on the simulated data show
that i) MCES is able to identify  motifs efficiently and effectively
in thousands to millions of input sequences, and runs faster
than the state-of-the-art
 motifs efficiently and effectively
in thousands to millions of input sequences, and runs faster
than the state-of-the-art  motif discovery algorithms, such as
F-motif and TraverStringsR; ii) MCES is able to identify motifs
without known lengths, and has a better identification accuracy
than the competing algorithm CisFinder. Also, the validity of
MCES is tested on real data sets. MCES is freely available at
http://sites.google.com/site/feqond/mces.
 motif discovery algorithms, such as
F-motif and TraverStringsR; ii) MCES is able to identify motifs
without known lengths, and has a better identification accuracy
than the competing algorithm CisFinder. Also, the validity of
MCES is tested on real data sets. MCES is freely available at
http://sites.google.com/site/feqond/mces.
In the dynamic indexing problem, we must maintain a changing collection of text documents so that we can e ciently support insertions, deletions, and pattern matching queries. We are especially interested in developing e cient data structures that store and query the documents in compressed form. All previous compressed solutions to this problem rely on answering rank and select queries on a dynamic sequence of symbols. Because of the lower bound in [Fredman and Saks, 1989], answering rank queries presents a bottleneck in compressed dynamic indexing. In this paper we show how this lower bound can be circumvented using our new framework. We demonstrate that the gap between static and dynamic variants of the indexing problem can be almost closed. Our method is based on a novel framework for adding dynamism to static compressed data structures. Our framework also applies more generally to dynamizing other problems. We show, for example, how our framework can be applied to develop compressed representations of dynamic graphs and binary relations.
Graph similarity search has received considerable attention in many applications, such as bioinformatics, data mining, pattern recognition, and social networks. Existing methods for this problem have limited scalability because of the huge amount of memory they consume when handling very large graph databases with millions or billions of graphs.
In this paper, we study the problem of graph similarity search under the graph edit distance constraint. We present a space-efficient index structure based upon the q-gram tree that incorporates succinct data structures and hybrid encoding to achieve improved query time performance with minimal space usage. Specifically, the space usage of our index requires only 5%-15% of the previous state-of-the-art indexing size on the tested data while at the same time achieving 2-3 times acceleration in query time with small data sets. We also boost the query performance by augmenting the global filter with range search, which allows us to perform a query in a reduced region. In addition, we propose two effective filters that combine degree structures and label structures. Extensive experiments demonstrate that our proposed approach is superior in space and competitive in filtering to the state-of-the-art approaches. To the best of our knowledge, our index is the first in-memory index for this problem that successfully scales to cope with the large dataset of 25 million chemical structure graphs from the PubChem dataset.
In this paper we present some practical modifications of the higher-order entropy-compressed text indexing method of Foschini et al. [6] based upon the Burrows-Wheeler transform and the FM-index. Our method, called FM-Adaptive, applies a wavelet tree to the entire BWT. It partitions each bit vector of nodes in the wavelet tree into blocks and applies the hybrid encoding along with run-length Gamma code rather than the fixed-length code of [14] to each block while explores data-aware compression. FM-Adaptive retains the theoretical performance of previous work and introduces some improvements in practice. At the same time, broad experiments indicate that our index achieves superior performance, especially in terms of compression, in comparison to the state-of-the-art indexing techniques. The source code is available online.
Many real-world applications, such as bioinformatics, data mining, pattern recognition, and social network analysis, benefit from efficient solutions for the graph similarity search problem. Existing methods have limited scalability when they handle the large graph databases, for example, those with millions or billions of graphs that cannot fit in main memory. In this paper, we study the problem of graph similarity search under the graph edit distance constraint in external memory. We present an efficient framework for arbitrary q-gram based representations of a graph. Specifically, we propose a q-gram matrix index stored in hybrid layout in external memory to achieve efficient query processing, by converting the q-gram counting filter into a sparse matrix-vector multiplication (SpMV) problem. Furthermore, we also boost the query performance by transforming the global filter to a two-dimensional query rectangle, which allows us to perform a query in a reduced region, significantly reducing the number of query I/Os in practice. Extensive experiments on real datastes confirm that: (1) our method can compete with the state-of-the-art in-memory methods in index size and filtering ability, and outperform them on scalability of coping with the PubChem dataset including 25 million chemical structure graphs. (2) compared with the popular q-gram-based external inverted index, our external index structure needs much fewer number of query I/Os on the PubChem dataset.
Chien et al. [1, 2] introduced the geometric Burrows-Wheeler transform (GBWT) as the
first succinct text index for I/O-efficient pattern matching in external memory; it operates
by transforming a text  into point set
 into point set  in the two-dimensional plane. In this paper
we introduce a practical succinct external memory text index, called mKD-GBWT. We
partition
 in the two-dimensional plane. In this paper
we introduce a practical succinct external memory text index, called mKD-GBWT. We
partition  into
 into  subregions by partitioning the x-axis into
 subregions by partitioning the x-axis into  intervals using the suffix
ranges of characters of
 intervals using the suffix
ranges of characters of  and partitioning the y-axis into
 and partitioning the y-axis into  intervals using characters of
 intervals using characters of  ,
where
,
where  is the alphabet size of
 is the alphabet size of  . In this way, we can represent a point using fewer
bits and perform a query in a reduced region so as to improve the space usage and I/Os
of GBWT in practice. In addition, we plug a crit-bit tree into each node of string B-trees
to represent variable-length strings stored. Experimental results show that mKD-GBWT
provides significant improvement in space usage compared with the state-of-the-art indexing
techniques. The source code is available online [3].
. In this way, we can represent a point using fewer
bits and perform a query in a reduced region so as to improve the space usage and I/Os
of GBWT in practice. In addition, we plug a crit-bit tree into each node of string B-trees
to represent variable-length strings stored. Experimental results show that mKD-GBWT
provides significant improvement in space usage compared with the state-of-the-art indexing
techniques. The source code is available online [3].
 
 
 
 
