Title
Broad coverage paragraph segmentation across languages and domains
Abstract
This article considers the problem of automatic paragraph segmentation. The task is relevant for speech-to-text applications whose output transcipts do not usually contain punctuation or paragraph indentation and are naturally difficult to read and process. Text-to-text generation applications (e.g., summarization) could also benefit from an automatic paragaraph segementation mechanism which indicates topic shifts and provides visual targets to the reader. We present a paragraph segmentation model which exploits a variety of knowledge sources (including textual cues, syntactic and discourse-related information) and evaluate its performance in different languages and domains. Our experiments demonstrate that the proposed approach significantly outperforms our baselines and in many cases comes to within a few percent of human performance. Finally, we integrate our method with a single document summarizer and show that it is useful for structuring the output of automatically generated text.
Year
DOI
Venue
2006
10.1145/1149290.1151098
TSLP
Keywords
Field
DocType
text-to-text generation application,output transcipts,paragraph indentation,segmentation,discourse-related information,broad coverage paragraph segmentation,paragraph breaks,paragraph segmentation model,summari- sation,human performance,automatic paragaraph segementation mechanism,additional key words and phrases: machine learning,automatic paragraph segmentation,knowledge source,different language,speech to text,summarization,machine learning
Automatic summarization,Indentation,Computer science,Segmentation,Exploit,Speech recognition,Paragraph,Artificial intelligence,Natural language processing,Structuring,Syntax,Punctuation
Journal
Volume
Issue
ISSN
3
2
1550-4875
Citations 
PageRank 
References 
9
0.70
27
Authors
2
Name
Order
Citations
PageRank
Caroline Sporleder145331.84
Mirella Lapata25973369.52