Research Report: Building a Wide Reach Corpus for Secure Parser Development - Citegraph

Paper Info

Title
Research Report: Building a Wide Reach Corpus for Secure Parser Development

Abstract
Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.

Year	DOI	Venue
2020	10.1109/SPW50608.2020.00066	2020 IEEE Security and Privacy Workshops (SPW)
Keywords	DocType	ISBN
LangSec,language-theoretic security,file corpus creation,file forensics,text extraction,parser resources	Conference	978-1-7281-9347-2
Citations	PageRank	References
2	0.41	7
Authors
9

Authors (9 rows)

Cited by (2 rows)

References (7 rows)

Name	Order	Citations	PageRank
Tim Allison	1	4	1.17
Wayne Burke	2	4	1.17
Valentino Constantinou	3	2	0.41
Edwin Goh	4	2	0.41
Chris A. Mattmann	5	200	25.39
Anastasija Mensikova	6	2	0.41
Philip Southam	7	4	1.17
Ryan Stonebraker	8	4	1.17
Virisha Timmaraju	9	2	0.41

1