Approximating

and Eliminating Search Algorithm (AESA) makes use of metric

properties of given distance. The algorithm consists of (1) an efficient

elimination rules and (2) an appropriate search for available rules by which maximum

efficiency is maintained. For N prototypes, the it uses precompution of triangular

array of (N2-N)/2 distances between prototypes. By which Nearest Neighbour (NN) Search is carried out

through a very small number of distance computations. Furthermore, for large N,

this number of computations, tends to be independent of N (asymptotic constant

time-complexity). However, the great storage complexity and preprocessing time

(O (N2)), severely limit the practical use of the AESA for large

sets of prototypes.

Reference: E. Vidal,

“An algorithm for finding nearest neighbors in (approximately) constant average

time,” Pattern Recog. Lett., vol. 4, no. 3, pp. 145–157, 1986.

LAESA (Linear AESA) a new version of the AESA, which uses a linear array of

distance, but is strictly based on metric arguments like the original AESA. The

procedure starts by selecting “Base Prototypes” (BP) from the given

set of prototypes which is relatively a small set there by computing the

distances between these BP’s and the completing set of prototypes.

Reference: L. Mico,

J.Oncina, E. Vidal An algorithm for finding nearest neighbours in constant

average time with a linear space complexity Proceedings, 11th IAPR International

Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition

Methodoly and Systems Year; 1992

Tree LAESA (TLAESA) is based on the LAESA which reduces its

average time complexity to sublinear. The TLAESA algorithm consists of two

parts: (1) preprocessing and (2) search. In the preprocessing algorithm, two

structures are built: a binary search tree storing all prototypes and a table of

distances. The search algorithm is essentially a traversal of the binary tree

where the table of distances is used in order to avoid the exploration of some branches

of the tree.

Reference: L.

Mico, J. Oncina, and R. C. Carrasco, “A fast branch & bound nearest

neighbour classifier in metric spaces,” Pattern Recog. Lett., vol. 17, no. 7,

pp. 731–739, 1996.

BAESA

Approximating

k-LAESA

(Ak-LAESA) is a fast classifier for general metric

spaces (no vector space required) based on the LAESA algorithm The aim is to

achieve classification rates similar to those of a k-NN classifier, using k

neighbours that may not be the k-NNs and preserving the main properties of

LAESA (distance computations and time and space complexities). It obtains error

rates very close to those of a k-NN classifier, while calculating a much lower

number of distances (the number of distances of the Ak-LAESA algorithm is

exactly the same that the computed by the LAESA algorithm).

Reference: A

modification of the LAESA algorithm for approximated k-NN classification,

Pattern Recognition Letters, Vol. 24, Issue: 1-3, January 2003,

Pages: 47-53

Fixed

Queries Trees

(FQ Trees) are an evolution where the same pivot

is used for all the nodes of the same level of the tree. In this case the pivot

does not need to belong to the subtree. The main goal of this structure is to

minimize the number of element comparisons, as opposed to other data structure

operations (such as tree traversal). This is especially important in

applications where computing distances between two elements is a much more

expensive operation than, say following pointers. This tree structure differs

from other tree structures where the keys on each level of the tree are all the

same, so we have just one key per level. In other words, which comparisons we

make does not depend on the results of previous comparisons up the tree. Thus

the name fixed-queries tree or FQ-tree. A major strength of FQ-trees is that the data

structure and the search algorithms are independent of the distance function,

as long as it satisfies the triangle inequality.

Reference: R.

Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed-queries

trees. In Proc. CPM’94, LNCS 807, pages 198–212, 1994.

Fixed

Height fq-tree

(fhq-tree) A variant called FQ Tree where all the

leaves are at the same depth h, regardless of the bucket size

Reference:

Ricardo Baeza-Yates Gonzalo Navarro “Fast

approximate string matching in a dictionary” Published in: String

Processing and Information Retrieval: A South American Symposium, 1998.

Proceedings IEEE

Fixed

Queries Array (FQA), has two interesting properties. First, it is the first data

structure which is able to achieve a sub linear (in the database size) number

of side computations without using any extra space. Second, it is able to trade

number of pivots k for their precision, so as to optimize the usage of the

available space. FQA

are based in a common idea: k pivots are selected and each object is mapped to

k coordinates which are its distances to the pivots. Later, the query q is also

mapped and if it differs from an object in more than r along some coordinate

then the element is filtered out by the triangle inequality. That is, if for

some pivot pi and some element v of the set it holds |d(q,

pi) ? d(v, pi)| > r, then we

know that d(q, v) > r without need to evaluate d(v, q). The elements that

cannot be filtered out using this rule are directly compared.

Reference: E. Ch´avez, J.L. Marroquin, and G.

Navarro. Fixed queries array: A fast and economical data structure for

proximity searching. Multimedia Tools and Applications (MTAP), 14(2):113–135,

2001.

Sparse Spatial

Selection (SSS),

a new pivot-based technique. The main characteristic of this method is that it

guarantees a good pivot selection more efficiently. In addition, SSS adapts

itself to the dimensionality of the metric space we are working with, without

being necessary to specify in advance the number of pivots to use. Furthermore,

SSS is dynamic, that is, it is capable to support object insertions in the

database efficiently, it can work with both continuous and discrete distance

functions, and it is suitable for secondary memory storage.

The main

contribution of the SSS is the use of a new pivot selection strategy which

generates a comparatively small number of good-quality pivots.

The main contribution of SSS is the generating a number of pivots that depends

on the intrinsic dimensionality of the space (something interesting from both

the theoretical and practical points of view.

SSS

can be extended to deal with more complex metric spaces where the distances

among subsets of objects depend on specific dimensions that are not relevant

for other set of objects. That is, in some applications areas, objects can be

grouped in different clusters with different associated metric spaces, all of

them nested into the general metric space that explains the distances among

clusters. To deal with these complex spaces we propose the extension of SSS

becoming Sparse Spatial Selection for Nested Metric Spaces (SSSNMS)

Reference: Spatial Selection of Sparse Pivots for Similarity

Search in Metric Spaces, JCS&T Vol. 7 No. 1, April

2007

Vantage Point

Tree (VP Tree) Like the

KD-Tree, each VP-Tree node cuts/divides the space. A VP-Tree node employs distance

from a selected vantage point rather than using coordinate values. Near points

make up the left/inside subspace while the right/outside subspace consists of

far points. A binary tree is formed by recursion. Each of the tree nodes

identifies a vantage point, and for its children (left/right), the node

contains bounds associated to subspace by the vantage point. Other forms of the

VP-Tree include additional subspace bounds and may employ buckets near leaf

level.

VPS-Tree an element of S is compared with the vantage point

belonging to each of its ancestors in the tree. This information is also not

captured in the simple tree. It consists of a database element identifier ‘id’,

and a list ‘hist’ of distances from the item to each vantage point tracing back

to the root. A list of these structures is initialized with the history set to

null from the entire database. The algorithm then recursively splits the list

into two lists L and R, while adding to the history of each item. Each resulting

tree node contains a list ‘bnds’ which have a range interval for its

corresponding subspace which is seen by each ancestral vantage point.

VPSB-Tree Here we consider one way to form buckets of leaves

in order to save space while preserving the notion of ancestral history present

in the VPS-Tree. Buckets are formed by collapsing subtrees near leaf

level into a flat structure. Each bucket contains say no element records. Each

record must specify an id, and in addition holds distance entries for every

ancestor. We call the resulting structure a VPSB-Tree.

Reference: P. N.

Yianilos, “Data structures and algorithms for nearest neighbor search in

general metric spaces,” in Proc. Annu. ACM-SIAM Symp. Discrete Algorithms,

1993, pp. 311–321.

Multi-Vantage

Point Tree

(MVP Tree) A distance based index structure called

multi-vantage point (MVP) tree for similarity queries on high-dimensional metric

spaces. The MVP Tree uses more than one vantage point to partition the space

into spherical cuts at each level. It also utilizes the pre-computed (at

construction time) distances between the data points and the vantage points.

Reference: T. Bozkaya and M. Ozsoyoglu. Distance-based

indexing for high-dimensional metric spaces. In Proc. SIGMOD’97, pages 357–368, 1997.

Sigmod Record 26(2).

**Grid

Files A file data structure designed to

manage a disk allocation storage in terms of fixed size units like disk blocks,

pages or buckets depending on level of description. It is a variant of grid

method where the requirement is cell division lines must

be equidistant. Hashing method for multidimensional points. It is an extension

of extensible hashing.

Multipaging is similar to the grid file. It

uses a directory called axial arrays which is in the form of linear

scales instead of using a grid director. There are two variants of multipaging.

In dynamic multipaging, performance is controlled by setting a bound on the

probe factor which is defined as the average number of pages accessed in a

probe. In static multipaging, performance is controlled by setting a bound in

the load factor, or the average page occupancy.

Reference: A Survey on

Multidimensional Access Methods Hee-Kap Ahn Nikos Mamoulis Ho Min Wong

UU-CS-2001-14 May, 2001

EXCELL

method (Extendible CELL) related to

the grid file. It is a bintree with a directory in the form of an array

providing access by address computation. It can also be viewed as an adaptation

of extendible hashing to multidimensional point data. In contrast to the grid

file, where the partitioning hyperplanes may be spaced arbitrarily, the EXCELL

method decomposes the space regularly

ad all grid cells are of equal size.

Two-Level Grid

File. The basic idea of it is to use a second

grid file to manage the grid directory. The first level is called the root

directory. Entries of the root directory contains pointers to the directory

pages of the lower level, which in turn contain pointers to the data pages. By having

a second level, splits are often confined to the subdirectory regions without

affecting too much of their surroundings.

Twin

Grid File It is other hashing method. It tries to

increase space utilization compared to the original grid file by introducing a

second grid file. The relationship between these two grid files is not hierarchical

but somewhat more balanced. Both grid files span the whole space (universe).

The distribution of the data among the two files is performed dynamically.

Quadtrees

are one of the first data structures for higher-dimensional data. A quadtree is a

rooted tree in which every internal node has four children. Every node in the quadtree

corresponds to a square. If a node v has children, then their

corresponding squares are the four quadrants of the square of v ¾ hence

the name of the tree. This implies that the squares of the leaves together form

a subdivision of the square of the root. We call this subdivision the quadtree

subdivision. The children of the root are labeled NE, NW, SW, and SE to

indicate to which quadrant they belong to; NE stands for the north-east

quadrant, NW for the north-west quadrant, and so on.

**Octree Quadtree-like data structures which can also

be used to represent images in three or higher dimensions. The octree data

structure is the 3-D analog of the quadtree. Here we start with an image in the

form of a cubical volume and recursively subdivide it into eight congruent

disjoint cubes called octants until blocks are obtained in a uniform color or a

predetermined level of decomposition is reached.

**PR Quadtree is based on a regular

decomposition. The PR quadtree is organized in the same way as the region

quadtree. The only difference is that leaf nodes are either empty or contain a

data point and its coordinates. A quadrant contains, at most, one data point. The

PR quadtree representation can also be adapted to represent a region that

consists of a collection of polygons (termed a polygonal map). The

result is a family of representations referred to collectively as a PM

quadtree. The PM quadtree family

represents regions by specifying their boundaries; this is in contrast to the region quadtree, which is based on a

description of the region’s interior. The PM1 quadtree has

also been adapted to three dimensional images. We term the result a PM octree. The decomposition criteria are

such that no node contains more than one face, edge, or vertex unless all the

faces meet at the same vertex or are adjacent to the same edge.

k-d-tree a binary search

tree that stores points of the k-dimensional space. At each intermediate node,

the k-d-tree divides the k-dimensional space in two parts by a (k-1)-dimensional

hyperplane. The direction of the hyperplane, i.e. the dimension based on which

the division is made, alternates between the k possibilities from one tree level

to the next. Each splitting hyperplane contains at least one point, which is

used as the hyperplane’s representation in the tree.

An improved

version proposed is the adaptive k-d-tree. While splitting, the adaptive

k-d-tree chooses a hyperplane that divides the space in two sub-spaces with

equal number of points. The hyperplanes are still parallel to axes, but they do

not contain a point, and they do not have to strictly alternate. Interior nodes

of the tree contain the dimension (e.g. x or y), and the coordinate of the

respective split. All points are stored in the leaves, and a leave is can

contain up to a fixed number of points, if this number is exceeded, a split

takes place.

k-d-B-tree combines

properties of both the adaptive k-d-tree and the B-tree. It again uses

hyperplanes to divide the space; this time arbitrarily more than one

hyperplanes divide a tree node (depending on the tree’s storage utilization) in

a corresponding number of disjoint regions. All nodes of the tree correspond to

disk pages. A leaf node stores the data points that are located in the

respective partition the leaf defines. Like the B-tree, the k-d-B-tree is perfectly

balanced, however, it cannot ensure storage utilization.

R-trees are

hierarchical data structures, meant for efficient indexing of multidimensional

objects with spatial extent. R-trees are used to store, instead of the original

space objects, their minimum boundary boxes (MBBs). The MBB of an

n-dimensional object is defined to be the minimum n-dimensional rectangle that

contains the original object. Similar to B-trees, the R-trees are balanced and

they ensure efficient storage utilization. Each R-tree node corresponds to a

disk page and a n-dimensional rectangle. Each non-leaf node contains entries of

the form (ref, rect), where ref is the address of a child node

and rect is the MBB of all entries in that child node. Leaves contain

entries of the same format, where ref points to a database object, and rect

is the MBB of that object.

R+-tree is introduced as

a way to overcome the problem of inefficient searching that arises when sibling

nodes overlap in the R-tree. As a direct solution to these problems they use clipping,

i.e. there is no overlap between intermediate nodes of the tree at the same

level, and objects that intersect more than one MBB at a specific level are

clipped and stored on several different pages. As a result, point queries on

the R+-tree require traversing only one path of the tree. The price to pay is

the increase of storage requirement of the tree.

R*-tree this version

introduces a new insertion policy, that crucially improves the performance of

the tree. The main objective of this policy is to minimize the overlap region

between sibling nodes in the tree. A straightforward advantage of this is the

minimization of the tree paths that are traversed at an object search.