From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA - Citegraph

Paper Info

Title
From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

Abstract
ABSTRACTText-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as "pepsi" being recognized as "peosi". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.

Year	DOI	Venue
2022	10.1145/3503161.3547977	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Zan-Xia Jin	1	0	0.34
Mike Zheng Shou	2	0	0.34
Fang Zhou	3	0	0.34
Satoshi Tsutsui	4	0	0.34
Jingyan Qin	5	0	0.34
Xu-Cheng Yin	6	533	44.83

1