Design, creation, and analysis of Czech corpora for structural metadata extraction from speech

Kolář, Jáchym
December 2011
Language Resources & Evaluation;Dec2011, Vol. 45 Issue 4, p439
Academic Journal
Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and downstream automatic processes. The MDE annotation includes inserting boundaries of sentence-like units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper describes design, creation, and analysis of data resources for structural MDE from spoken Czech. The annotation is based on the LDC's MDE annotation standard for English, with changes applied to accommodate specific phenomena of Czech. In addition to the necessary language-dependent modifications, we further proposed and applied several language-independent modifications slightly refining the original annotation scheme. We created two Czech MDE speech corpora-one in the domain of broadcast news and the other in the domain of broadcast conversations. Both corpora have already been published at LDC. The analysis section of this paper presents a variety of statistics about fillers, edit disfluencies, and sentence-like units. The two Czech corpora are not only compared with each other, but also with statistics relating to the available English MDE corpora. We also report the statistics indicating that edit disfluencies have a different part of speech (POS) distribution in comparison with the overall POS distribution. The findings from the corpus analysis should help guide strategies for developing automatic MDE systems.


Related Articles

  • Phonetically rich and balanced text and speech corpora for Arabic language. Abushariah, Mohammad; Ainon, Raja; Zainuddin, Roziati; Elshafei, Moustafa; Khalifa, Othman // Language Resources & Evaluation;Dec2012, Vol. 46 Issue 4, p601 

    This paper describes the preparation, recording, analyzing, and evaluation of a new speech corpus for Modern Standard Arabic (MSA). The speech corpus contains a total of 415 sentences recorded by 40 (20 male and 20 female) Arabic native speakers from 11 different Arab countries representing...

  • On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling. Hämäläinen, Annika; Boves, Lou; De Veth, Johan; Ten Bosch, Louis // EURASIP Journal on Audio Speech & Music Processing;2007, Vol. 2007, Special section p1 

    Recent research on the TIMIT corpus suggests that longer-length acousticmodels are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use.However, the impressive speech recognition results obtained with...

  • VoCMex: a voice corpus in Mexican Spanish for research in speaker recognition. Olguín-Espinoza, José-Martín; Mayorga-Ortiz, Pedro; Hidalgo-Silva, Hugo; Vizcarra-Corral, Luis; Mendiola-Cárdenas, Mónica-Livier // International Journal of Speech Technology;Sep2013, Vol. 16 Issue 3, p295 

    Voice corpus is an essential element for automatic speaker recognition systems. In order for a corpus to be useful in recognition tasks, it must contain recordings from several speakers pronouncing phonetically balanced utterances; recorded through several sessions using different recording...

  • Improving the Accuracy of English-Arabic Statistical Sentence Alignment.  // International Arab Journal of Information Technology (IAJIT);Apr2011, Vol. 8 Issue 2, p171 

    No abstract available.

  • Exploring Translation Behavior Regarding Sentence Length and Sentence Constituents: A Descriptive Study Based on Chinese-Japanese Bi-directional Parallel Corpus. Min-chun Teng // Compilation & Translation Review;Mar2011, Vol. 4 Issue 1, p99 

    This research investigates the features of translated text and non-translated texts through analyzing the sentence length and constituent format in Chinese to Japanese and Japanese to Chinese translation. It uses a corpus-based methodology and sets up a model to describe translators' linguistic...

  • Hidden Markov Models for Automatic Speech Recognition. Aymen, Mbarki; Abdelaziz, Ammari; Halim, Sghaier; Maaref, Hassen // Journal of Mechanics Engineering & Automation;2011, Vol. 1 Issue 1, p68 

    In this paper the authors look into the problem of Hidden Markov Models (HMM): the evaluation, the decoding and the learning problem. The authors have explored an approach to increase the effectiveness of HMM in the speech recognition field. Although hidden Markov modeling has significantly...

  • On Developing an Automatic Speech Recognition System for Standard Arabic Language. Walha, R.; Drira, F.; El-Abed, H.; Alimi, A. M. // World Academy of Science, Engineering & Technology;Oct2012, Issue 70, p317 

    The Automatic Speech Recognition (ASR) applied to Arabic language is a challenging task. This is mainly related to the language specificities which make the researchers facing multiple difficulties such as the insufficient linguistic resources and the very limited number of available transcribed...

  • Speech Corpora as Facilities of Creation and Storage of Exemplary Speech Signals. Prodeus, A. M. // Naukovi visti NTUU - KPI;2013, Vol. 87 Issue 1, p13 

    Speech corpora are an important constituent of modern investigators' toolkit in such areas as speech correction, designing and testing elements of telecommunication systems and systems of automatic speech recognition. In this paper, we search for elements of construction technology of the sound...

  • CATEGORIZATION OF UNORGANIZED TEXT CORPORA FOR BETTER DOMAIN-SPECIFIC LANGUAGE MODELING. STAS, Jan; ZLACKY, Daniel; HLADEK, Daniel; JUHAR, Jozef // Advances in Electrical & Electronic Engineering;2013 Special Issue, Vol. 11 Issue 5, p398 

    This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics