Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets

cic.institucionOrigenLaboratorio de Investigación y Formación en Informática Avanzada (LIFIA)
cic.isFulltextSI
cic.isPeerReviewedSI
cic.lugarDesarrolloLaboratorio de Investigación y Formación en Informática Avanzada
cic.parentTypeObjeto de conferencia
cic.versionAceptada
dc.date.accessioned2025-02-18T13:01:23Z
dc.date.available2025-02-18T13:01:23Z
dc.identifier.urihttps://digital.cic.gba.gob.ar/handle/11746/12412
dc.titleInitial Explorations for Document Clustering Tasks in Latin Elegiac Poetsen
dc.typeDocumento de conferencia
dcterms.abstractThis article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.en
dcterms.creator.authorNusch, Carlos Javier
dcterms.creator.authorDel Rio Riande, Gimena
dcterms.creator.authorCagnina, Leticia Cecilia
dcterms.creator.authorErrecalde, Marcelo Luis
dcterms.creator.authorAntonelli, Leandro
dcterms.isPartOf.issueDecisioning 2024
dcterms.isPartOf.seriesKnowledge Discovery and Decision Making
dcterms.issued2024-06
dcterms.languageInglés
dcterms.licenseAttribution-NonCommercial-ShareAlike 4.0 International (BY-NC-SA 4.0)
dcterms.subjectLatin Elegiac Poetsen
dcterms.subjectDocument Clusteringen
dcterms.subjectK Meansen
dcterms.subjectSilhouette Coefficienten
dcterms.subjectDecision Treesen
dcterms.subjectFeature Importanceen
dcterms.subjectInformation Gain Ratioen
dcterms.subject.materiaCiencias de la Computación e Información

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
20_InitialExplorationForClustering_Decisioning2024.pdf-PDFA.pdf
Tamaño:
755.19 KB
Formato:
Adobe Portable Document Format
Descripción:
Documento completo

Bloque de licencias

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
3.46 KB
Formato:
Item-specific license agreed upon to submission
Descripción: