Abstract | ||
---|---|---|
This article considers the problem of automatic paragraph segmentation. The task is relevant for speech-to-text applications whose output transcipts do not usually contain punctuation or paragraph indentation and are naturally difficult to read and process. Text-to-text generation applications (e.g., summarization) could also benefit from an automatic paragaraph segementation mechanism which indicates topic shifts and provides visual targets to the reader. We present a paragraph segmentation model which exploits a variety of knowledge sources (including textual cues, syntactic and discourse-related information) and evaluate its performance in different languages and domains. Our experiments demonstrate that the proposed approach significantly outperforms our baselines and in many cases comes to within a few percent of human performance. Finally, we integrate our method with a single document summarizer and show that it is useful for structuring the output of automatically generated text. |
Year | DOI | Venue |
---|---|---|
2006 | 10.1145/1149290.1151098 | TSLP |
Keywords | Field | DocType |
text-to-text generation application,output transcipts,paragraph indentation,segmentation,discourse-related information,broad coverage paragraph segmentation,paragraph breaks,paragraph segmentation model,summari- sation,human performance,automatic paragaraph segementation mechanism,additional key words and phrases: machine learning,automatic paragraph segmentation,knowledge source,different language,speech to text,summarization,machine learning | Automatic summarization,Indentation,Computer science,Segmentation,Exploit,Speech recognition,Paragraph,Artificial intelligence,Natural language processing,Structuring,Syntax,Punctuation | Journal |
Volume | Issue | ISSN |
3 | 2 | 1550-4875 |
Citations | PageRank | References |
9 | 0.70 | 27 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Caroline Sporleder | 1 | 453 | 31.84 |
Mirella Lapata | 2 | 5973 | 369.52 |