Abstract | ||
---|---|---|
ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and academia. Its design is influenced by the research needs of the academic community and the real-world needs of large-scale industry systems. Compared with earlier ClueWeb datasets, the ClueWeb22 corpus is larger, more varied, and has higher-quality documents. Its core is raw HTML, but it includes clean text versions of documents to lower the barrier to entry. Several aspects of ClueWeb22 are available to the research community for the first time at this scale, for example, visual representations of rendered web pages, parsed structured information from the HTML document, and the alignment of document distributions (domains, languages, and topics) to commercial web search. This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer ClueWeb corpus will enable and support a broad range of research in IR, NLP, and deep learning. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1145/3477495.3536321 | SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Arnold Overwijk | 1 | 16 | 2.19 |
Chen-Yan Xiong | 2 | 405 | 30.82 |
James P. Callan | 3 | 6237 | 833.28 |