Flexible Detection of Similar DOM elements

cic.institucionOrigenLaboratorio de Investigación y Formación en Informática Avanzada (LIFIA)
cic.isFulltextSI
cic.isPeerReviewedSI
cic.lugarDesarrolloLaboratorio de Investigación y Formación en Informática Avanzada (LIFIA)
cic.parentTypeObjeto de conferencia
cic.versionAceptada
dc.date.accessioned2023-11-06T15:08:58Z
dc.date.available2023-11-06T15:08:58Z
dc.identifier.urihttps://digital.cic.gba.gob.ar/handle/11746/12098
dc.titleFlexible Detection of Similar DOM elementsen
dc.typeDocumento de conferencia
dcterms.abstractDifferent research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.en
dcterms.creator.authorGrigera, Julián
dcterms.creator.authorGardey, Juan Cruz
dcterms.creator.authorRossi, Gustavo Héctor
dcterms.creator.authorGarrido, Alejandra
dcterms.identifier.otherDOI: 10.1007/978-3-031-24197-0_10
dcterms.identifier.otherISBN: 978-3-031-24197-0
dcterms.isPartOf.itemWeb Information Systems and Technologies. WEBIST WEBIST 2020 2021
dcterms.isPartOf.series16th International Conference (WEBIST 2020)(Modalidad virtual, 3 al 5 de noviembre de 2020) y 17th International Conference (WEBIST 2021) (Modalidad virtual, 26 al 28 de octubre de 2021)
dcterms.issued2021
dcterms.languageInglés
dcterms.licenseAttribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0)
dcterms.subjectInformation Extractionen
dcterms.subjectWeb Adaptationen
dcterms.subjectDOMes
dcterms.subject.materiaCiencias de la Computación e Información

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
Flexible_Detection_of_Similar_DOM_elements.pdf-PDFA.pdf
Tamaño:
1.32 MB
Formato:
Adobe Portable Document Format
Descripción:
Documento completo

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
3.46 KB
Formato:
Item-specific license agreed upon to submission
Descripción: