Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching - Citegraph

Paper Info

Title
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Abstract
Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark data sets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark data set, code and a pre-trained model to accelerate future research on long-form document matching.

Year	DOI	Venue
2020	10.1145/3340531.3411908	CIKM '20: The 29th ACM International Conference on Information and Knowledge Management Virtual Event Ireland October, 2020
DocType	ISBN	Citations
Conference	978-1-4503-6859-9	3
PageRank	References	Authors
0.42	3	5

Authors (5 rows)

Cited by (3 rows)

References (3 rows)

Name	Order	Citations	PageRank
Liu Yang	1	380	18.54
Mingyang Zhang	2	104	10.61
Cheng Li	3	126	7.81
Michael Bendersky	4	986	48.69
Marc A. Najork	5	2538	278.16

1