Tad Gonsalves, Hu Hang, Yoshimi Hiroyasu
Lengua y Sociedad 23(2) 2024年12月30日
The rapid globalization and growing need for cross-language communication necessitate modern, real-time corpora to aid language learners. Traditional methods for creating such corpora, especially in Spanish, are inadequate due to their inability to process the vast and unstructured data available online. This study explores Artificial Intelligence (AI) methodologies for automatic Spanish document acquisition from the web, pre-processing and classifying them in order to build a vast and flexible corpus for Spanish learning. The research applies web crawling using the Scrapy framework to collect data, which is then cleaned and classified using advanced Natural Language Processing (NLP) models. Specifically, the study employs BERT (Bidirectional Encoder Representations from Transformers) and its enhanced variant RoBERTa to achieve document classification. Through a combination of data augmentation techniques and deep learning models, the study achieves high accuracy in classifying Spanish-language texts, demonstrating the potential for using AI to overcome the limitations of traditional corpus-building approaches.