Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark

Juez Gil, Mario; Arnaiz González, Álvar; Rodríguez Diez, Juan José; López Nozal, Carlos; García Osorio, César

doi:10.1016/j.neucom.2021.08.086

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10259/6206

Título

Approx-SMOTE: Fast SMOTE for Big Data on Apache Spark

Autor

Juez Gil, Mario

Arnaiz González, Álvar

Rodríguez Diez, Juan José

López Nozal, Carlos

García Osorio, César

Publicado en

Neurocomputing. 2021, V. 464, p. 432-437

Editorial

Elsevier

Fecha de publicación

2021-11

ISSN

0925-2312

DOI

10.1016/j.neucom.2021.08.086

Abstract

One of the main goals of Big Data research, is to find new data mining methods that are able to process large amounts of data in acceptable times. In Big Data classification, as in traditional classification, class imbalance is a common problem that must be addressed, in the case of Big Data also looking for a solution that can be applied in an acceptable execution time. In this paper we present Approx-SMOTE, a parallel implementation of the SMOTE algorithm for the Apache Spark framework. The key difference with the original SMOTE, besides parallelism, is that it uses an approximated version of k-Nearest Neighbor which makes it highly scalable. Although an implementation of SMOTE for Big Data already exists (SMOTE-BD), it uses an exact Nearest Neighbor search, which does not make it entirely scalable. Approx-SMOTE on the other hand is able to achieve up to 30 times faster run times without sacrificing the improved classification performance offered by the original SMOTE.

Palabras clave

SMOTE

Imbalance

Spark

Big data

Data mining

Materia

Informática

Computer science

URI

http://hdl.handle.net/10259/6206

Versión del editor