Title
From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA
Abstract
ABSTRACTText-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as "pepsi" being recognized as "peosi". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.
Year
DOI
Venue
2022
10.1145/3503161.3547977
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Zan-Xia Jin100.34
Mike Zheng Shou200.34
Fang Zhou300.34
Satoshi Tsutsui400.34
Jingyan Qin500.34
Xu-Cheng Yin653344.83