Broad coverage paragraph segmentation across languages and domains - Citegraph

Paper Info

Title
Broad coverage paragraph segmentation across languages and domains

Abstract
This article considers the problem of automatic paragraph segmentation. The task is relevant for speech-to-text applications whose output transcipts do not usually contain punctuation or paragraph indentation and are naturally difficult to read and process. Text-to-text generation applications (e.g., summarization) could also benefit from an automatic paragaraph segementation mechanism which indicates topic shifts and provides visual targets to the reader. We present a paragraph segmentation model which exploits a variety of knowledge sources (including textual cues, syntactic and discourse-related information) and evaluate its performance in different languages and domains. Our experiments demonstrate that the proposed approach significantly outperforms our baselines and in many cases comes to within a few percent of human performance. Finally, we integrate our method with a single document summarizer and show that it is useful for structuring the output of automatically generated text.

Year	DOI	Venue
2006	10.1145/1149290.1151098	TSLP
Keywords	Field	DocType
text-to-text generation application,output transcipts,paragraph indentation,segmentation,discourse-related information,broad coverage paragraph segmentation,paragraph breaks,paragraph segmentation model,summari- sation,human performance,automatic paragaraph segementation mechanism,additional key words and phrases: machine learning,automatic paragraph segmentation,knowledge source,different language,speech to text,summarization,machine learning	Automatic summarization,Indentation,Computer science,Segmentation,Exploit,Speech recognition,Paragraph,Artificial intelligence,Natural language processing,Structuring,Syntax,Punctuation	Journal
Volume	Issue	ISSN
3	2	1550-4875
Citations	PageRank	References
9	0.70	27
Authors
2

Authors (2 rows)

Cited by (9 rows)

References (27 rows)

Name	Order	Citations	PageRank
Caroline Sporleder	1	453	31.84
Mirella Lapata	2	5973	369.52

1