Next: Performance criteria Up: Assigning Phrase Breaks from Previous: Part of speech tagging

Training Data

The MARSEC database of spoken British English [9] is used for training and testing. It consists of stories recorded from BBC Radio 4 including extracts from talk shows, news and weather reports, and continuity announcements. The corpus is labelled with part of speech tags and two levels of break. For the following tests we split the database into a training set of 30 stories consisting of 31,707 words containing 6,346 breaks and a test set of 10 stories consisting of 7,662 words and 1,404 breaks.

Although this database has its own tagset, we did not use it in our tests. The MARSEC tagset is different from the WSJ Penn Treebank one and the mapping appears non-trivial. Also there is not enough data in the MARSEC database itself to train a POS tagger. The MARSEC data was therefore re-tagged using our tagger with the WSJ tagset.

Alan W Black
Tue Jul 1 17:09:00 BST 1997