Registration and E-Mail List
In order to be able to participate in the evaluation please register here.
If you plan to participate in the evaluation campaign, please register to this e-mail list, in order to receive up-to-date information about developments in the campaign.
Permissible Training Data
Training of MT systems and language models for ASR is constrained to data supplied by the organizers or listed below. As for ASR acoustic modeling no training data are distributed. For German, participants are allowed to use any publicly available data recorded before July 17th 2012. For English the data has to be recorded before December 31st 2010.
Tracks
The IWSLT 2013 Evaluation Campaign will focus on the translation of TED Talks, a collection of public speeches on a variety of topics. The evaluation campaign will include the following tracks:
ASR track: automatic transcription of talks from audio to text- Languages: English and German
- Input format: unsegmented SPHERE
- Output format: CTM, no case, no punctuation, UTF8
- English development data: dev2010, tst2010 and dev2012 Note: for the ASR track, only an uem file similar to dev2010.uem and dev2012.uem will be provided, i.e. without sentence segmentation! Therefore, automatic segmentation of the data is mandatory!
- German development data: dev2012 Note: for the ASR track, only an uem file similar to dev2012.de-en.de.uem will be provided, i.e. without sentence segmentation! Therefore, automatic segmentation of the data is mandatory!
- Evaluation Data:
SLT track: speech translation of talks from audio (or ASR output) to text
- Input format: segmented SPHERE, or ASR output
- Directions:
- Official: English -> French, German -> English, English -> German
- Optional: English-> Spanish, Portuguese (B), Italian, Chinese, Polish, Slovenian, Arabic, Persian
- Output format: NIST XML format, true case with punctuation
- ASR development data for English: dev2010, tst2010 and dev2012 Note: for the SLT track, only an uem file similar to dev2010-manualSegmentation.uem and dev2012-manualSegmentation.uem will be provided. The output will have to be in this segmentation
- Automatic Transcripts of English Data: tst2010, dev2010 and dev2012.
- Automatic Transcripts of German data: dev2012
- MT Training and Development Data
- Evaluation Data (reference ASR output and manual segmentation. For audio see ASR track):
MT track: text translation of talks for two language pairs plus eleven optional language pairs:
- Input format: NIST XML format, true case with punctuation
- Output format: NIST XML format, true case with punctuation
- Directions:
- Official: English -> French, German -> English, English -> German
- Optional: English <-> Arabic, Spanish, Portuguese (B), Italian, Chinese, Polish, Persian, Slovenian, Turkish, Dutch, Romanian, Russian
- Training and development data
- Evaluation Data
Evaluation methods:
- ASR track: word or character error rate
- SLT/MT: BLEU, NIST, METEOR, TER (all directions)
Training of MT systems and language models for ASR is constrained to data supplied by the organizers. Supplied training, development, and test data will be available under the workshop’s webpage.
Data Permissible for MT model and ASR Language Model Training
Provided Data
Other Permissible Data
Parallel:
- All data from the WMT 2013 web page
- MultiUN
- SETimes parallel corpus: ro-en, tr-en
Monolingual:
LDC (German)
- LDC2002S24, 1997 HUB5 German Evaluation
- LDC2003T03, 1997 HUB5 German Transcripts
- LDC2006S31, 2003 NIST Language Recognition Evaluation
- LDC2009S04, 2007 NIST Language Recognition Evaluation Test Set
- LDC2012T03, 2009 CoNLL Shared Task Part 1
- LDC96S51, CALLFRIEND German
- LDC97L18, CALLHOME German Lexicon
- LDC97S43, CALLHOME German Speech
- LDC97T15, CALLHOME German Transcripts
- LDC96L14, CELEX2
- LDC2006S35, CSLU: Multilanguage Telephone Speech Version 1.2
- LDC94T5, ECI Multilingual Text
- LDC95T11, European Language Newspaper Text
- LDC2006S13, N4 NATO Native and Non-Native Speech
- LDC94S17, OGI Multilanguage Corpus
- LDC2009T25, Web 1T 5-gram, 10 European Languages Version 1
Miscellaneous:
- SFB 588 (German)
- German Political Speeches Corpus (German)
- Leipzig Corpora Collection (German)
- News Commentary v7 from WMT 2012 (English, French)
- News Crawl from WMT 2012 (English, French)
- Europarl v7 (English, French)
- LDC2011T07 English Gigaword Fifth Edition (English)
- LDC2009T28 French Gigaword Second Edition (French)
- Google books grams (English, Chinese, French, German, Italian, Russian, Spanish)
- LDC2012T13 English Web Treebank
- LDC2011T03 OntoNotes 4.0
- LDC99T42 Treebank-3
- Gold Alignment for the Europarl German-English Corpus (kindly supplied by RWTH)
- Arabic Gigaword Fifth Edition
Important Dates:
Workshop
- February 2013: Call for Participation
- Sept-Nov 2013: Early, late registration
- Dec 5-6, 2013: Workshop
Scientific Papers
- Sept 29, 2013: Paper Submission due
- November 12, 2013: Review Feedback
- November 18, 2013: Camera-ready paper due
Evaluation Campaign
- June 8, 2013: Release of TRAIN and DEV data
- Sep 2-8, 2013: Dry run ASR track
- Sep 9-15, 2013: Test period ASR track
- Sep 23-29, 2013: Test period SLT track (official directions)
- Oct 7-13, 2013: Test period MT track (official directions)
- Oct 7-20, 2013: Test period MT/SLT track (optional directions)
- Nov 3, 2013: System description paper due
- Nov 19, 2013: Review feedback
- Nov 25, 2013: Camera-ready paper due
Contact, Evaluation Chair
- Marcello Federico, FBK, federico ∂ fbk eu
- Sebastian Stüker, KIT, sebastian stueker ∂ kit edu