Web scraping by end users

Tacuri, Alex; Firmenich, Sergio; Fernández, Alejandro; Riva, María Florencia; Urbieta, Matías; Rossi, Gustavo Héctor

Artículo

Acceso Abierto

Web scraping by end users

Tacuri, Alex

|

Firmenich, Sergio

|

Fernández, Alejandro

|

Riva, María Florencia

|

Urbieta, Matías

|

Rossi, Gustavo Héctor

Fecha de publicación

25 de noviembre de 2025

Lugar de desarrollo

Laboratorio de Investigación y Formación en Informática Avanzada (LIFIA)

Centro CIC

Laboratorio de Investigación y Formación en Informática Avanzada (LIFIA)

Serie

IEEE Access

Volumen de la revista

2025

Idioma

Inglés

Materia

Ciencias de la Computación e Información

HDL 11746/12582

ISSN 2169-3536

DOI 10.1109/access.2025.3636662

Descargas

Documento completo (13.45 MB)

Resumen

Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.

Palabras clave

Web mining

End-user computing

Human computer interaction

User centered design

Web scraping

Data integration

Scraper specification

Web data extraction

Esta obra se publica con la licencia Creative Commons Attribution 4.0 International (BY 4.0)

Página completa del ítem

Web scraping by end users

Título alternativo

Título de investigación

Directores

Compiladores

Editores

Editorial

Fecha de publicación

Descripción

Emisor del título

Lugar de desarrollo

Centro CIC

Libro/Informe

Recursos relacionados

Serie

Volumen de la revista

Idioma

Materia

Area temática

Clasificación FORD

Cobertura Espacial

Extensión

Descargas

Enlace externo

Resumen

Palabras clave

item.page.license