Improving Large-scale Language Models and Resources for Filipino. - Citegraph

Paper Info

Title
Improving Large-scale Language Models and Resources for Filipino.

Abstract
In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across the three classification tasks of varying difficulty.

Year	Venue	DocType
2022	International Conference on Language Resources and Evaluation (LREC)	Conference
Citations	PageRank	References
0	0.34	0
Authors
2

Authors (2 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jan Christian Blaise Cruz	1	0	1.35
Charibeth Cheng	2	0	0.68

1