Instance selection of linear complexity for big data

Arnaiz González, Álvar; Diez Pastor, José Francisco; Rodríguez Diez, Juan José; García Osorio, César

doi:10.1016/j.knosys.2016.05.056

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10259/4221

Título

Instance selection of linear complexity for big data

Autor

Arnaiz González, Álvar

Diez Pastor, José Francisco

Rodríguez Diez, Juan José

García Osorio, César

Publicado en

Knowledge-Based Systems. 2016. V. 107, p. 83–95

Editorial

Elsevier

Fecha de publicación

2016-09

ISSN

0950-7051

DOI

10.1016/j.knosys.2016.05.056

Abstract

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2), or log-linear, O(nlogn)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).

Palabras clave

Nearest neighbor

Data reduction

Instance selection

Hashing

Big data

Materia

Informática

Computer science

URI

http://hdl.handle.net/10259/4221

Versión del editor