Instance selection of linear complexity for big data

Arnaiz González, Álvar; Diez Pastor, José Francisco; Rodríguez Diez, Juan José; García Osorio, César

doi:10.1016/j.knosys.2016.05.056

dc.contributor.author	Arnaiz González, Álvar
dc.contributor.author	Diez Pastor, José Francisco
dc.contributor.author	Rodríguez Diez, Juan José
dc.contributor.author	García Osorio, César
dc.date.accessioned	2016-09-01T09:42:59Z
dc.date.available	2016-09-01T09:42:59Z
dc.date.issued	2016-09
dc.identifier.issn	0950-7051
dc.identifier.uri	http://hdl.handle.net/10259/4221
dc.description.abstract	Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2), or log-linear, O(nlogn)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).	en
dc.description.sponsorship	Supported by the Research Projects TIN 2011-24046 and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness.	en
dc.format.mimetype	application/pdf
dc.language.iso	eng	es
dc.publisher	Elsevier	en
dc.relation.ispartof	Knowledge-Based Systems. 2016. V. 107, p. 83–95	en
dc.rights	Attribution 4.0 International
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Nearest neighbor	en
dc.subject	Data reduction	en
dc.subject	Instance selection	en
dc.subject	Hashing	en
dc.subject	Big data	en
dc.subject.other	Informática	es
dc.subject.other	Computer science	en
dc.title	Instance selection of linear complexity for big data	en
dc.type	info:eu-repo/semantics/article
dc.rights.accessRights	info:eu-repo/semantics/openAccess
dc.relation.publisherversion	http://dx.doi.org/10.1016/j.knosys.2016.05.056
dc.identifier.doi	10.1016/j.knosys.2016.05.056
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO/TIN 2011-24046
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO/TIN 2015-67534-P
dc.type.hasVersion	info:eu-repo/semantics/publishedVersion	en

Fichier(s) constituant ce document

Nom:: Arnaiz-KBS_2016.pdf
Taille:: 1.129Mo
Format:: Adobe PDF

Voir/Ouvrir

Ce document figure dans la(les) collection(s) suivante(s)

Afficher la notice abrégée