and Eliminating Search Algorithm (AESA) makes use of metric
properties of given distance. The algorithm consists of (1) an efficient
elimination rules and (2) an appropriate search for available rules by which maximum
efficiency is maintained. For N prototypes, the it uses precompution of triangular
array of (N2-N)/2 distances between prototypes. By which Nearest Neighbour (NN) Search is carried out
through a very small number of distance computations. Furthermore, for large N,
this number of computations, tends to be independent of N (asymptotic constant
time-complexity). However, the great storage complexity and preprocessing time
(O (N2)), severely limit the practical use of the AESA for large
sets of prototypes.
Reference: E. Vidal,
“An algorithm for finding nearest neighbors in (approximately) constant average
time,” Pattern Recog. Lett., vol. 4, no. 3, pp. 145–157, 1986.
LAESA (Linear AESA) a new version of the AESA, which uses a linear array of
distance, but is strictly based on metric arguments like the original AESA. The
procedure starts by selecting “Base Prototypes” (BP) from the given
set of prototypes which is relatively a small set there by computing the
distances between these BP’s and the completing set of prototypes.
Reference: L. Mico,
J.Oncina, E. Vidal An algorithm for finding nearest neighbours in constant
average time with a linear space complexity Proceedings, 11th IAPR International
Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition
Methodoly and Systems Year; 1992
Tree LAESA (TLAESA) is based on the LAESA which reduces its
average time complexity to sublinear. The TLAESA algorithm consists of two
parts: (1) preprocessing and (2) search. In the preprocessing algorithm, two
structures are built: a binary search tree storing all prototypes and a table of
distances. The search algorithm is essentially a traversal of the binary tree
where the table of distances is used in order to avoid the exploration of some branches
of the tree.
Mico, J. Oncina, and R. C. Carrasco, “A fast branch & bound nearest
neighbour classifier in metric spaces,” Pattern Recog. Lett., vol. 17, no. 7,
pp. 731–739, 1996.
(Ak-LAESA) is a fast classifier for general metric
spaces (no vector space required) based on the LAESA algorithm The aim is to
achieve classification rates similar to those of a k-NN classifier, using k
neighbours that may not be the k-NNs and preserving the main properties of
LAESA (distance computations and time and space complexities). It obtains error
rates very close to those of a k-NN classifier, while calculating a much lower
number of distances (the number of distances of the Ak-LAESA algorithm is
exactly the same that the computed by the LAESA algorithm).
modification of the LAESA algorithm for approximated k-NN classification,
Pattern Recognition Letters, Vol. 24, Issue: 1-3, January 2003,
(FQ Trees) are an evolution where the same pivot
is used for all the nodes of the same level of the tree. In this case the pivot
does not need to belong to the subtree. The main goal of this structure is to
minimize the number of element comparisons, as opposed to other data structure
operations (such as tree traversal). This is especially important in
applications where computing distances between two elements is a much more
expensive operation than, say following pointers. This tree structure differs
from other tree structures where the keys on each level of the tree are all the
same, so we have just one key per level. In other words, which comparisons we
make does not depend on the results of previous comparisons up the tree. Thus
the name fixed-queries tree or FQ-tree. A major strength of FQ-trees is that the data
structure and the search algorithms are independent of the distance function,
as long as it satisfies the triangle inequality.
Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed-queries
trees. In Proc. CPM’94, LNCS 807, pages 198–212, 1994.
(fhq-tree) A variant called FQ Tree where all the
leaves are at the same depth h, regardless of the bucket size
Ricardo Baeza-Yates Gonzalo Navarro “Fast
approximate string matching in a dictionary” Published in: String
Processing and Information Retrieval: A South American Symposium, 1998.
Queries Array (FQA), has two interesting properties. First, it is the first data
structure which is able to achieve a sub linear (in the database size) number
of side computations without using any extra space. Second, it is able to trade
number of pivots k for their precision, so as to optimize the usage of the
available space. FQA
are based in a common idea: k pivots are selected and each object is mapped to
k coordinates which are its distances to the pivots. Later, the query q is also
mapped and if it differs from an object in more than r along some coordinate
then the element is filtered out by the triangle inequality. That is, if for
some pivot pi and some element v of the set it holds |d(q,
pi) ? d(v, pi)| > r, then we
know that d(q, v) > r without need to evaluate d(v, q). The elements that
cannot be filtered out using this rule are directly compared.
Reference: E. Ch´avez, J.L. Marroquin, and G.
Navarro. Fixed queries array: A fast and economical data structure for
proximity searching. Multimedia Tools and Applications (MTAP), 14(2):113–135,
a new pivot-based technique. The main characteristic of this method is that it
guarantees a good pivot selection more efficiently. In addition, SSS adapts
itself to the dimensionality of the metric space we are working with, without
being necessary to specify in advance the number of pivots to use. Furthermore,
SSS is dynamic, that is, it is capable to support object insertions in the
database efficiently, it can work with both continuous and discrete distance
functions, and it is suitable for secondary memory storage.
contribution of the SSS is the use of a new pivot selection strategy which
generates a comparatively small number of good-quality pivots.
The main contribution of SSS is the generating a number of pivots that depends
on the intrinsic dimensionality of the space (something interesting from both
the theoretical and practical points of view.
can be extended to deal with more complex metric spaces where the distances
among subsets of objects depend on specific dimensions that are not relevant
for other set of objects. That is, in some applications areas, objects can be
grouped in different clusters with different associated metric spaces, all of
them nested into the general metric space that explains the distances among
clusters. To deal with these complex spaces we propose the extension of SSS
becoming Sparse Spatial Selection for Nested Metric Spaces (SSSNMS)
Reference: Spatial Selection of Sparse Pivots for Similarity
Search in Metric Spaces, JCS&T Vol. 7 No. 1, April
Tree (VP Tree) Like the
KD-Tree, each VP-Tree node cuts/divides the space. A VP-Tree node employs distance
from a selected vantage point rather than using coordinate values. Near points
make up the left/inside subspace while the right/outside subspace consists of
far points. A binary tree is formed by recursion. Each of the tree nodes
identifies a vantage point, and for its children (left/right), the node
contains bounds associated to subspace by the vantage point. Other forms of the
VP-Tree include additional subspace bounds and may employ buckets near leaf
VPS-Tree an element of S is compared with the vantage point
belonging to each of its ancestors in the tree. This information is also not
captured in the simple tree. It consists of a database element identifier ‘id’,
and a list ‘hist’ of distances from the item to each vantage point tracing back
to the root. A list of these structures is initialized with the history set to
null from the entire database. The algorithm then recursively splits the list
into two lists L and R, while adding to the history of each item. Each resulting
tree node contains a list ‘bnds’ which have a range interval for its
corresponding subspace which is seen by each ancestral vantage point.
VPSB-Tree Here we consider one way to form buckets of leaves
in order to save space while preserving the notion of ancestral history present
in the VPS-Tree. Buckets are formed by collapsing subtrees near leaf
level into a flat structure. Each bucket contains say no element records. Each
record must specify an id, and in addition holds distance entries for every
ancestor. We call the resulting structure a VPSB-Tree.
Reference: P. N.
Yianilos, “Data structures and algorithms for nearest neighbor search in
general metric spaces,” in Proc. Annu. ACM-SIAM Symp. Discrete Algorithms,
1993, pp. 311–321.
(MVP Tree) A distance based index structure called
multi-vantage point (MVP) tree for similarity queries on high-dimensional metric
spaces. The MVP Tree uses more than one vantage point to partition the space
into spherical cuts at each level. It also utilizes the pre-computed (at
construction time) distances between the data points and the vantage points.
Reference: T. Bozkaya and M. Ozsoyoglu. Distance-based
indexing for high-dimensional metric spaces. In Proc. SIGMOD’97, pages 357–368, 1997.
Sigmod Record 26(2).
Files A file data structure designed to
manage a disk allocation storage in terms of fixed size units like disk blocks,
pages or buckets depending on level of description. It is a variant of grid
method where the requirement is cell division lines must
be equidistant. Hashing method for multidimensional points. It is an extension
of extensible hashing.
Multipaging is similar to the grid file. It
uses a directory called axial arrays which is in the form of linear
scales instead of using a grid director. There are two variants of multipaging.
In dynamic multipaging, performance is controlled by setting a bound on the
probe factor which is defined as the average number of pages accessed in a
probe. In static multipaging, performance is controlled by setting a bound in
the load factor, or the average page occupancy.
Reference: A Survey on
Multidimensional Access Methods Hee-Kap Ahn Nikos Mamoulis Ho Min Wong
UU-CS-2001-14 May, 2001
method (Extendible CELL) related to
the grid file. It is a bintree with a directory in the form of an array
providing access by address computation. It can also be viewed as an adaptation
of extendible hashing to multidimensional point data. In contrast to the grid
file, where the partitioning hyperplanes may be spaced arbitrarily, the EXCELL
method decomposes the space regularly
ad all grid cells are of equal size.
File. The basic idea of it is to use a second
grid file to manage the grid directory. The first level is called the root
directory. Entries of the root directory contains pointers to the directory
pages of the lower level, which in turn contain pointers to the data pages. By having
a second level, splits are often confined to the subdirectory regions without
affecting too much of their surroundings.
Grid File It is other hashing method. It tries to
increase space utilization compared to the original grid file by introducing a
second grid file. The relationship between these two grid files is not hierarchical
but somewhat more balanced. Both grid files span the whole space (universe).
The distribution of the data among the two files is performed dynamically.
are one of the first data structures for higher-dimensional data. A quadtree is a
rooted tree in which every internal node has four children. Every node in the quadtree
corresponds to a square. If a node v has children, then their
corresponding squares are the four quadrants of the square of v ¾ hence
the name of the tree. This implies that the squares of the leaves together form
a subdivision of the square of the root. We call this subdivision the quadtree
subdivision. The children of the root are labeled NE, NW, SW, and SE to
indicate to which quadrant they belong to; NE stands for the north-east
quadrant, NW for the north-west quadrant, and so on.
**Octree Quadtree-like data structures which can also
be used to represent images in three or higher dimensions. The octree data
structure is the 3-D analog of the quadtree. Here we start with an image in the
form of a cubical volume and recursively subdivide it into eight congruent
disjoint cubes called octants until blocks are obtained in a uniform color or a
predetermined level of decomposition is reached.
**PR Quadtree is based on a regular
decomposition. The PR quadtree is organized in the same way as the region
quadtree. The only difference is that leaf nodes are either empty or contain a
data point and its coordinates. A quadrant contains, at most, one data point. The
PR quadtree representation can also be adapted to represent a region that
consists of a collection of polygons (termed a polygonal map). The
result is a family of representations referred to collectively as a PM
quadtree. The PM quadtree family
represents regions by specifying their boundaries; this is in contrast to the region quadtree, which is based on a
description of the region’s interior. The PM1 quadtree has
also been adapted to three dimensional images. We term the result a PM octree. The decomposition criteria are
such that no node contains more than one face, edge, or vertex unless all the
faces meet at the same vertex or are adjacent to the same edge.
k-d-tree a binary search
tree that stores points of the k-dimensional space. At each intermediate node,
the k-d-tree divides the k-dimensional space in two parts by a (k-1)-dimensional
hyperplane. The direction of the hyperplane, i.e. the dimension based on which
the division is made, alternates between the k possibilities from one tree level
to the next. Each splitting hyperplane contains at least one point, which is
used as the hyperplane’s representation in the tree.
version proposed is the adaptive k-d-tree. While splitting, the adaptive
k-d-tree chooses a hyperplane that divides the space in two sub-spaces with
equal number of points. The hyperplanes are still parallel to axes, but they do
not contain a point, and they do not have to strictly alternate. Interior nodes
of the tree contain the dimension (e.g. x or y), and the coordinate of the
respective split. All points are stored in the leaves, and a leave is can
contain up to a fixed number of points, if this number is exceeded, a split
properties of both the adaptive k-d-tree and the B-tree. It again uses
hyperplanes to divide the space; this time arbitrarily more than one
hyperplanes divide a tree node (depending on the tree’s storage utilization) in
a corresponding number of disjoint regions. All nodes of the tree correspond to
disk pages. A leaf node stores the data points that are located in the
respective partition the leaf defines. Like the B-tree, the k-d-B-tree is perfectly
balanced, however, it cannot ensure storage utilization.
hierarchical data structures, meant for efficient indexing of multidimensional
objects with spatial extent. R-trees are used to store, instead of the original
space objects, their minimum boundary boxes (MBBs). The MBB of an
n-dimensional object is defined to be the minimum n-dimensional rectangle that
contains the original object. Similar to B-trees, the R-trees are balanced and
they ensure efficient storage utilization. Each R-tree node corresponds to a
disk page and a n-dimensional rectangle. Each non-leaf node contains entries of
the form (ref, rect), where ref is the address of a child node
and rect is the MBB of all entries in that child node. Leaves contain
entries of the same format, where ref points to a database object, and rect
is the MBB of that object.
R+-tree is introduced as
a way to overcome the problem of inefficient searching that arises when sibling
nodes overlap in the R-tree. As a direct solution to these problems they use clipping,
i.e. there is no overlap between intermediate nodes of the tree at the same
level, and objects that intersect more than one MBB at a specific level are
clipped and stored on several different pages. As a result, point queries on
the R+-tree require traversing only one path of the tree. The price to pay is
the increase of storage requirement of the tree.
R*-tree this version
introduces a new insertion policy, that crucially improves the performance of
the tree. The main objective of this policy is to minimize the overlap region
between sibling nodes in the tree. A straightforward advantage of this is the
minimization of the tree paths that are traversed at an object search.