Flexible Detection of Similar DOM elements
cic.institucionOrigen | Laboratorio de Investigación y Formación en Informática Avanzada (LIFIA) | |
cic.isFulltext | SI | |
cic.isPeerReviewed | SI | |
cic.lugarDesarrollo | Laboratorio de Investigación y Formación en Informática Avanzada (LIFIA) | |
cic.parentType | Objeto de conferencia | |
cic.version | Aceptada | |
dc.date.accessioned | 2023-11-06T15:08:58Z | |
dc.date.available | 2023-11-06T15:08:58Z | |
dc.identifier.uri | https://digital.cic.gba.gob.ar/handle/11746/12098 | |
dc.title | Flexible Detection of Similar DOM elements | en |
dc.type | Documento de conferencia | |
dcterms.abstract | Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity. | en |
dcterms.creator.author | Grigera, Julián | |
dcterms.creator.author | Gardey, Juan Cruz | |
dcterms.creator.author | Rossi, Gustavo Héctor | |
dcterms.creator.author | Garrido, Alejandra | |
dcterms.identifier.other | DOI: 10.1007/978-3-031-24197-0_10 | |
dcterms.identifier.other | ISBN: 978-3-031-24197-0 | |
dcterms.isPartOf.item | Web Information Systems and Technologies. WEBIST WEBIST 2020 2021 | |
dcterms.isPartOf.series | 16th International Conference (WEBIST 2020)(Modalidad virtual, 3 al 5 de noviembre de 2020) y 17th International Conference (WEBIST 2021) (Modalidad virtual, 26 al 28 de octubre de 2021) | |
dcterms.issued | 2021 | |
dcterms.language | Inglés | |
dcterms.license | Attribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0) | |
dcterms.subject | Information Extraction | en |
dcterms.subject | Web Adaptation | en |
dcterms.subject | DOM | es |
dcterms.subject.materia | Ciencias de la Computación e Información |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- Flexible_Detection_of_Similar_DOM_elements.pdf-PDFA.pdf
- Tamaño:
- 1.32 MB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Documento completo
Bloque de licencias
1 - 1 de 1
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 3.46 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción: