Title
Findx: A Versatile, Low-Resource Approach To Financial Website Classification
Abstract
The World Wide Web provides an excellent platform for investors to discover new partnership opportunities with a variety of companies. Analysts can categorize websites according to their business domains to retain relevant investment opportunities. Classifying websites manually is too expensive and time-consuming; thus, automatic classification tools are necessary. In this paper, we present FinDX (Financial Data Exploration), a tool for automatic website content classification for the financial technology (fintech) domain. At the core of our system is a keyword-based web crawler that extracts text from the landing page and relevant subpages, such as the About or Product pages of company websites. After cleaning the text and filtering it using part-of-speech tagging, we use a Linear Support Vector Machine (SVM) or Multilayer Perceptron (MLP) to classify a company website as fintech or non-fintech. FinDX achieves high binary classification accuracy on two different datasets of business websites, attaining a maximal F-score of 96%. In addition, our flexible tool is easily adaptable to any business domain and is not resource-expensive. This makes FinDX ideal for use in startup environments.
Year
DOI
Venue
2019
10.1109/BigData47090.2019.9006368
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
Keywords
Field
DocType
Web Content Classification, Machine Learning, Linear Support Vector Machine, Bag-of-Words, Term Frequency Inverse-Document Frequency, Financial Technology
Bag-of-words model,Landing page,Binary classification,tf–idf,Computer science,Support vector machine,Business domain,FinTech,Finance,Web crawler
Conference
ISSN
Citations 
PageRank 
2639-1589
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Alissa Ostapenko100.34
Rodica Neamtu294.26
Frazer Anderson300.34