This is the documentation for Closest 2.0, last updated December 14, 2016.

Introduction

Closest is a C library for k-nearest neighbor search in N-dimensions with arbitrary metric functions.

Closest:
  • is simple to use
  • is flexible (custom distance functions)
  • provides API interfaces for C, C++ and Python
  • written in ISO C99
  • has no external dependencies
  • is small and fast
  • is well documented

Closest is licensed under the GPL-3.0 (see the LICENSE file for details).

Description

Closest implements four different strategies for the k-nearest neighbor search: a cell-based method, a kd-tree method, a dimensional culling method and a relatively efficient brute-force method.

The cell and kd-tree and culling methods work on the assumption that the underlying set is a vector space with a metric function (not necessarily a norm). The brute-force method only assumption is that there is a distance function that is defined in the working set, and in that regard is more general that the other three.

The default metric is the Euclidean norm (or L2), but it can be easily changed. Closest provides in addition, ready to use, the L1, L2, Lp and LM norms. New metric functions are very easy to write. In general, closest methods work fine with distance function that are not metrics in the mathematical sense, as long as they are positive definite: they need not to be necessarily symmetric nor verify the triangle inequality. For example, regarding a non-symmetric distance function, closest will always (and only) call it with the query point as the first argument with the cell, brute-force and culling methods. The kd-tree method will also call the distance function while pruning the search tree, so results might differ from the other methods. If in doubt, check the your solutions for a particular function comparing with the brute-force method.

For vector spaces with norm, which method to use depends mainly on the number of dimensions and the size of the problem. For low dimensions \(N_{D} < 6\) the fastest method is usually the cell method. If the number of points in the set is large, \(N > 10^7\) and \(N_{D} < 6\), or the number of dimensions \(6 < N_{D} \lesssim 20\), then the kd-tree method is the generally the fastest. For a number of dimensions \(N_{D} > 20\) the brute-force method is generally faster than any other. The dimensional culling method is the slowest, always, unless the point set is small and has a very biased distribution which might throw off-track both the cell and kd-tree methods.

The memory footprint, except for the dimensional culling method in which is quite memory hungry, is roughly \(N_{\mathrm{data}}\) doubles and \(N_{\mathrm{data}}\) ints, where \(N_{\mathrm{data}}\) are the \(N_{D}\)-dimensional data against which the searches will be performed.

Download

The latest Closest is version 2.0, and can be downloaded here: closest-2.0.zip.

The source repository is hosted at github.

Licensing

Closest is released under the GPL-3, but if you need it under other licensing terms, please feel free to email me.

Getting started

Installing closest

Closest uses cmake, so it should be easy and fairly automatic to build. For example, to build and install a dynamic version of closest in /usr/local do:

$ mkdir bld
$ cd bld
$ cmake -DCMAKE_INSTALL_PREFIX=/usr/local ../
$ make
$ make install

Using Closest

Closest uses one joint C/C++ header file, closest.h, so it’s enough to put the line

#include <closest.h>

at the beginning of every C or C++ source file that uses Closest. For the Python interface it is enough to

import closest

For the C API there is also just two libraries to link with, libclosest itself and the C math library m. Compile and link a program using closest as follows:

cc -o prog prog.c -lclosest -lm

The C++ API it also needs to be linked with the libclosest_cxx library:

c++ -o prog prog.cpp -lclosest_cxx -lclosest -lm

Tutorials

Simple example

Suppose we want to find the k=5 nearest neighbors to a point x in an array data of 10000 points in d=100 dimensions. Using the cell-based method:

First include the required header

#include<closest.h>

Then make the appropriate calls

double *data;  /* points */
int k = 5;     /* number of nearest neighbors to calculate */
int d = 100;   /* number of dimensions */
int n = 10000; /* number of data points */

/*
   assume data has been alloc'd filled with 10000
   points in d=100 dimensions, each of the 100
   coordinates for a point followed by the next
   until all 10000 are specified.

   x is the point where the nearest neighbors
   query is done.
*/


/* initialize the search */
cell_t *cell = cell_init( d, n, data, -1 );

/* alloc space for the answer */
int i_cell[k];
int d_cell[k];

/*
   cell_knearest returns
     in i_cell, the index for the points in the
                data array of the k-nearest neighbors to x
     in d_cell, the distances for those same points to x
*/
cell_knearest( cell, x, k, i_cell, d_cell );

/*
   any other number of queries for different x could be done at
   this point but we are done, so  free resources
*/
cell_free(cell);

Writing you own distance functions

Writing custom distance functions is very easy. For example, a metric based on the infinity norm

double inf( int nd, double *x, double *y, void *data ) {
   double t = DBL_MIN;
   for( int k=0; k<nd; k++ ) {
      double a = fabs(x[k]-y[k]);
      if( a > t )
         t = a;
   }
   return t;
}

can be plugged in a breeze:

/* init */
cell_t *cell = cell_init( d, n, data );

/* set the custom metric */
cell_set_metric( cell, inf, NULL );

/* search */
cell_knearest( cell, x, k, i_cell, d_cell );

/* clean up */
cell_free(cell);

As stated in the introduction, Closest is not very picky about the kind of distance functions one can plug and get correct results. However, if in doubt, please check the results against the brute-force method solutions.

API Reference

C API

There are four different sets of functions, one for each implemented algorithm, cell, tree, cull-based and brute-force, with a similar interface. To use them closest.h needs to be included:

#include<closest.h>
cell_t *cell_init(int nd, int ni, double *xi, int ncpd)
brut_t *brut_init(int nd, int ni, double *xi)
cull_t *brut_init(int nd, int ni, double *xi, int prob)
tree_t *brut_init(int nd, int ni, double *xi, int ppn)
  • int nd, input: the number of dimensions
  • int ni, input: the number of points
  • double *xi, input: pointer to the points, a [ni x nd] array in row-major order.
  • int ncpd, input: for the cell based method, the number of subdivision cells per dimension, or -1 to let closest decide a reasonable value. Performance can potentially be improved by carefully choosing ncpd.
  • double prob, input: for the cull based method, a probability that defines the initial search radius of the algorithm. 0 < prob < 1. Pass -1.0 to take the default value. Performance can potentially be improved by carefully choosing prob.
  • int ppn, input: what is generally know as bucket size, or the max number of points inside each leaf-node in the tree decomposition. By default 32. Pass -1 to let closest decide. Performance can potentially be improved by carefully choosing ppn.
  • returns: a cell_t, tree_t, cull_t or brut_t pointer

Process the ni points xi in nd-dimensions such that any number of k-nearest-neighbor searches can be performed subsequently against those points. Returns a pointer to a data struct type that needs to be passed to *_knearest set of functions. Once you are done searching, free up memory by calling *_free on the pointer.

int cell_knearest(cell_t *t, int nq, double *xq, int no, int *index, double *distance)
int brut_knearest(brut_t *t, int nq, double *xq, int no, int *index, double *distance)
int cull_knearest(cull_t *t, int nq, double *xq, int no, int *index, double *distance)
int tree_knearest(tree_t *t, int nq, double *xq, int no, int *index, double *distance)
  • cell_t, brut_t, cull_t or tree_t *t, input: the pointer returned by the *_init set of functions
  • int nq, input: the number of query points
  • double *xq, input: a pointer to an [nq x nd] array with the coordinates of the query points
  • int no, input: the number of nearest neighbors to find per query point
  • int *index: on input, a pointer to a [nq x no] array; on output it will contain the indices of the no closest neighbors to the query points, ordered by distance for each.
  • double *distance: on input, a pointer to a [nq x no] array; on output it will contain the distances of the nearest neighbors found.
  • returns: the number of neighbors found, if successful equal to nq x no.

Find the no nearest neighbors to the given query points xq, and return the indices to the original points array in index and the respective distances in distance.

void cell_free(cell_t *t)
void brut_free(brut_t *t)
void cull_free(cull_t *t)
void tree_free(tree_t *t)
  • cell_t, brut_t, cull_t or tree_t *t, input: the pointer returned by the *_init functions

Free the resources allocated by the *_init functions. No further calls to *_knearest or *_set_distfun can be performed.

void cell_set_metric(cell_t *t, metric fun, void *data)
void brut_set_metric(brut_t *t, metric fun, void *data)
void cull_set_metric(cull_t *t, metric fun, void *data)
void tree_set_metric(tree_t *t, metric fun, void *data)
  • cell_t, brut_t, cull_t or tree_t *t, input: the pointer returned by the *_init functions
  • fun, input: a function with the following signature: double fun( int n, double *x, double *y, void *data) that calculates the distance between two given points x and y of dimension n.
  • data, intput: a pointer that will be passed unaltered each time fun is called and that can be used to pass extra data to the distance function.
double L1metric(int n, double *x, double *y, void *data)

Implements the L1, Manhattan or taxicab metric.

double L2metric(int n, double *x, double *y, void *data)

Implements the L2 or Euclidean metric.

double Lpmetric(int n, double *x, double *y, void *data)

Implements the Lp metric. It can be used also as a fractional semimetric with 0 < p < 1. The p parameter shall be passed by the void* data pointer.

double LMmetric(int n, double *x, double *y, void *data)

Implements the LM metric d(x,y)=x M y where M is a nd x nd matrix that has to be passed through the void* data pointer.

C++ API

It is a very thin wrapper to the C API, which makes it ideal to plug in conjunction with you preferred array classes as it is framework-agnostic.

To use them closest.h needs to be included from a cpp file.

#include<closest.h>

All is defined within the closest namespace. The metric functions are the same as in the C interface, and it offers the following classes with the same parameters as the C interface:

class closest::Cell
closest::Cell::Cell(int nd, int ni, double *xi, int npcd)
closest::Cell::knearest(int nq, double *xq, int no, int *idx, double *dst)
closest::Cell::set_metric(metric fun, void *data)
class closest::Brut
closest::Brut::Brut(int nd, int ni, double *xi)
closest::Brut::knearest(int nq, double *xq, int no, int *idx, double *dst)
closest::Brut::set_metric(metric fun, void *data)
class closest::Cull
closest::Cull::Cull(int nd, int ni, double *xi, double prob)
closest::Cull::knearest(int nq, double *xq, int no, int *idx, double *dst)
closest::Cull::set_metric(metric fun, void *data)
class closest::Tree
closest::Tree::Tree(int nd, int ni, double *xi, int npcd)
closest::Tree::knearest(int nq, double *xq, int no, int *idx, double *dst)
closest::Tree::set_metric(metric fun, void *data)

Python API

To use it the closest module needs to be imported:

import closest
class Cell(data, ncpd = -1)
  • data, input: a numpy array with shape (ni,nd), where ni is the number of data points and nd is the number of dimensions.
  • int ncpd, input: the number of subdivision cells per dimension, or -1 to let closest decide a reasonable value. Performance can potentially be improved by carefully choosing ncpd.

Process the points in data such that any number of k-nearest-neighbor searches can be performed subsequently against those points.

Cell.knearest(x, n, metric = None, metric_data = None)
  • x, input: a numpy array with shape (nq,nd), where nq is the number of different query points and nd in the number of dimensions.
  • metric: None (L2 will be used), or a string with ‘L1’, ‘L2’ or ‘Lp’ to use those metrics, or a python function that returns a double and has the signature py_metric( x, y, data ), where x and y are numpy arrays of shape (nd,) being nd the number of dimensions, and data a python object with extra data or None.
  • metric_data: any python object that will be passed unaltered to the metric function

Find the n nearest neighbors to the given query points x, and return a tuple (idx,dst) with the indices to the original points array in idx and the respective distances in dst.

The brute-force, cull and tree interfaces are very similar, with the following signatures:

class Brut(data)
Brut.knearest(x, n, metric = None, metric_data = None)
class Cull(data, prob = -1.0)
Cull.knearest(x, n, metric = None, metric_data = None)
class Tree(data, ppn = -1)
Tree.knearest(x, n, metric = None, metric_data = None)