Tweet Segmentation and Its Application to Named Entity Recognition - Citegraph

Paper Info

Title
Tweet Segmentation and Its Application to Named Entity Recognition

Abstract
Twitter has attracted millions of users to share and disseminate most up-to-date information, resulting in large volumes of data produced everyday. However, many applications in Information Retrieval (IR) and Natural Language Processing (NLP) suffer severely from the noisy and short nature of tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context information is well preserved and easily extracted by the downstream applications. HybridSeg finds the optimal segmentation of a tweet by maximizing the sum of the stickiness scores of its candidate segments. The stickiness score considers the probability of a segment being a phrase in English (i.e., global context) and the probability of a segment being a phrase within the batch of tweets (i.e., local context). For the latter, we propose and evaluate two models to derive local context by considering the linguistic features and term-dependency in a batch of tweets, respectively. HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback. Experiments on two tweet data sets show that tweet segmentation quality is significantly improved by learning both global and local contexts compared with using global context alone. Through analysis and comparison, we show that local linguistic features are more reliable for learning local context compared with term-dependency. As an application, we show that high accuracy is achieved in named entity recognition by applying segment-based part-of-speech (POS) tagging.

Year	DOI	Venue
2015	10.1109/TKDE.2014.2327042	IEEE Trans. Knowl. Data Eng.
Keywords	Field	DocType
wikipedia,hybridseg framework,context learning,nlp,named entity recognition,stickiness score,linguistic processing,context information,twitter stream,information retrieval,ir,pos tagging,segment-based part-of-speech tagging,linguistic features,natural language processing,social networking (online),tweet segmentation,semantic information,pragmatics,encyclopedias,electronic publishing,internet,semantics	Data mining,Pragmatics,Deep linguistic processing,Computer science,Phrase,Dissemination,Artificial intelligence,Natural language processing,Encyclopedia,Information retrieval,Segmentation,Named-entity recognition,Semantics,Machine learning	Journal
Volume	Issue	ISSN
27	2	1041-4347
Citations	PageRank	References
18	0.72	35
Authors
4

Authors (4 rows)

Cited by (18 rows)

References (35 rows)

Name	Order	Citations	PageRank
Chenliang Li	1	590	39.20
Aixin Sun	2	3071	156.89
Jianshu Weng	3	1609	83.04
Qi He	4	2326	132.92

1