SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco

Documento de conferencia

Acceso Abierto

SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

|

|

|

|

Fecha de publicación

2018

Lugar de desarrollo

Instituto de Investigación en Informática

Libro/Informe

Actas JCC&BD 2018

Nombre del evento

VI Jornadas de Cloud Computing & Big Data (JCC&BD) (La Plata, 2018)

Idioma

Inglés

Materia

Ciencias de la Computación e Información

Extensión

p. 23-28

HDL 11746/8512

HANDLE 10915/69676

Descargas

Documento completo (903.82 KB)

Resumen

The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.

Palabras clave

big data, imbalanced classification, preprocessing, SMOTE, spark

Esta obra se publica con la licencia Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0)

Página completa del ítem

SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Título alternativo

Título de investigación

Directores

Compiladores

Editores

Editorial

Fecha de publicación

Descripción

Emisor del título

Lugar de desarrollo

Centro CIC

Libro/Informe

Recursos relacionados

Evento

Nombre del evento

Idioma

Materia

Area temática

Clasificación FORD

Cobertura Espacial

Extensión

Descargas

Enlace externo

Resumen

Palabras clave

item.page.license