Learning to tokenize web domains - Citegraph

Paper Info

Title
Learning to tokenize web domains

Abstract
Domain Match is an Internet monetization product offered by web companies like Yahoo! The product offers display of ads and search results, when a user requests a webpage from a domain which is non-existent or does not have any content. This product earns significant amount of advertising revenue for major internet companies like Yahoo! Hence it is an important product receiving millions of queries per day. Domain Match (DM) works by tokenizing the input domains and sub-folders into keywords and then displaying ads and search results queried on the keywords. In this poster, we describe a machine learning based solution, which automatically learns to tokenize new domains, given a training dataset containing a set of domains and their tokenizations. We use positional frequency and parts of speech as features for scoring tokens. Tokens are scored combined using various scoring models. We compare two ways of training the models: a simple gain function based training and a large margin training. Experimental results are encouraging.

Year	DOI	Venue
2011	10.1145/1963192.1963258	World Wide Web Conference Series
Keywords	Field	DocType
large margin training,internet monetization product,important product,scoring token,domain tokenization,various scoring model,advertising revenue,internet monetization,web domain,training dataset,domain match,large margin learning,search result,machine learning,part of speech	Revenue,Data mining,World Wide Web,Web page,Computer science,Monetization,Part of speech,Gain function,Artificial intelligence,Machine learning,The Internet	Conference
Citations	PageRank	References
1	0.37	2
Authors
2

Authors (2 rows)

Cited by (1 rows)

References (2 rows)

Name	Order	Citations	PageRank
Sriram Srinivasan	1	379	27.92
Sourangshu Bhattacharya	2	94	14.00

1