Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark

Juez Gil, Mario; Arnaiz González, Álvar; Rodríguez Diez, Juan José; López Nozal, Carlos; García Osorio, César

doi:10.1016/j.neucom.2021.08.086

dc.contributor.author	Juez Gil, Mario
dc.contributor.author	Arnaiz González, Álvar
dc.contributor.author	Rodríguez Diez, Juan José
dc.contributor.author	López Nozal, Carlos
dc.contributor.author	García Osorio, César
dc.date.accessioned	2021-11-23T08:25:06Z
dc.date.available	2021-11-23T08:25:06Z
dc.date.issued	2021-11
dc.identifier.issn	0925-2312
dc.identifier.uri	http://hdl.handle.net/10259/6206
dc.description.abstract	One of the main goals of Big Data research, is to find new data mining methods that are able to process large amounts of data in acceptable times. In Big Data classification, as in traditional classification, class imbalance is a common problem that must be addressed, in the case of Big Data also looking for a solution that can be applied in an acceptable execution time. In this paper we present Approx-SMOTE, a parallel implementation of the SMOTE algorithm for the Apache Spark framework. The key difference with the original SMOTE, besides parallelism, is that it uses an approximated version of k-Nearest Neighbor which makes it highly scalable. Although an implementation of SMOTE for Big Data already exists (SMOTE-BD), it uses an exact Nearest Neighbor search, which does not make it entirely scalable. Approx-SMOTE on the other hand is able to achieve up to 30 times faster run times without sacrificing the improved classification performance offered by the original SMOTE.	es
dc.description.sponsorship	“La Caixa” Foundation, under agreement LCF/PR/PR18/51130007. This work was supported by the Junta de Castilla y León under project BU055P20 and by the Ministry of Science and Innovation of Spain under project PID2020-119894 GB-I00, co-financed through European Union FEDER funds. It also was supported through Consejería de Educación of the Junta de Castilla y León and the European Social Fund through a pre-doctoral grant (EDU/1100/2017). This material is based upon work supported by Google Cloud.	es
dc.format.mimetype	application/pdf
dc.language.iso	eng	es
dc.publisher	Elsevier	es
dc.relation.ispartof	Neurocomputing. 2021, V. 464, p. 432-437	es
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	SMOTE	es
dc.subject	Imbalance	es
dc.subject	Spark	es
dc.subject	Big data	es
dc.subject	Data mining	es
dc.subject.other	Informática	es
dc.subject.other	Computer science	es
dc.title	Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark	es
dc.type	info:eu-repo/semantics/article	es
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es
dc.relation.publisherversion	https://doi.org/10.1016/j.neucom.2021.08.086	es
dc.identifier.doi	10.1016/j.neucom.2021.08.086
dc.relation.projectID	info:eu-repo/grantAgreement/Fundación Bancaria Caixa d'Estalvis i Pensions de Barcelona//LCF%2FPR%2FPR18%2F51130007	es
dc.relation.projectID	info:eu-repo/grantAgreement/Junta de Castilla y León//BU055P20//Métodos y Aplicaciones Industriales del Aprendizaje Semisupervisado	es
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-119894GB-I00/ES/APRENDIZAJE AUTOMATICO CON DATOS ESCASAMENTE ETIQUETADOS PARA LA INDUSTRIA 4.0	es
dc.type.hasVersion	info:eu-repo/semantics/publishedVersion	es

Arquivos deste item

Nome:: Juez-neurocomputing_2021.pdf
Tamanho:: 1.019Mb
Formato:: Adobe PDF

Visualizar/Abrir

Este item aparece na(s) seguinte(s) coleção(s)

Untitled

Mostrar registro simples