An annotated corpus for the analysis of VP ellipsis

Bos, Johan; Spenader, Jennifer
December 2011
Language Resources & Evaluation;Dec2011, Vol. 45 Issue 4, p463
Academic Journal
Verb Phrase Ellipsis (VPE) has been studied in great depth in theoretical linguistics, but empirical studies of VPE are rare. We extend the few previous corpus studies with an annotated corpus of VPE in all 25 sections of the Wall Street Journal corpus (WSJ) distributed with the Penn Treebank. We annotated the raw files using a stand-off annotation scheme that codes the auxiliary verb triggering the elided verb phrase, the start and end of the antecedent, the syntactic type of antecedent (VP, TV, NP, PP or AP), and the type of syntactic pattern between the source and target clauses of the VPE and its antecedent. We found 487 instances of VPE (including predicative ellipsis, antecedent-contained deletion, comparative constructions, and pseudo-gapping) plus 67 cases of related phenomena such as do so anaphora. Inter-annotator agreement was high, with a 0.97 average F-score for three annotators for one section of the WSJ. Our annotation is theory neutral, and has better coverage than earlier efforts that relied on automatic methods, e.g. simply searching the parsed version of the Penn Treebank for empty VP's achieves a high precision (0.95) but low recall (0.58) when compared with our manual annotation. The distribution of VPE source-target patterns deviates highly from the standard examples found in the theoretical linguistics literature on VPE, once more underlining the value of corpus studies. The resulting corpus will be useful for studying VPE phenomena as well as for evaluating natural language processing systems equipped with ellipsis resolution algorithms, and we propose evaluation measures for VPE detection and VPE antecedent selection. The stand-off annotation is freely available for research purposes.


Related Articles

  • Intended boundaries detection in topic change tracking for text segmentation. Labadié, Alexandre; Prince, Violaine // International Journal of Speech Technology;Dec2008, Vol. 11 Issue 3/4, p167 

    This paper presents a topical text segmentation method based on intended boundaries detection and compares it to a well known default boundaries detection method, c99. We compared the two methods by running them on two different corpora of French texts and results are evaluated by two different...

  • Automatic transformation from TIDES to TimeML annotation. Saquete, Estela; Pustejovsky, James // Language Resources & Evaluation;Dec2011, Vol. 45 Issue 4, p495 

    Until recently, most systems performing temporal extraction and reasoning from text have focused on recognizing and normalizing temporal expressions alone, for which the TIDES annotation scheme has been adopted. Temporal awareness of a text, however, involves not only identifying the temporal...

  • Metaphor Identification in Large Texts Corpora. Neuman, Yair; Assaf, Dan; Cohen, Yohai; Last, Mark; Argamon, Shlomo; Howard, Newton; Frieder, Ophir // PLoS ONE;Apr2013, Vol. 8 Issue 4, p1 

    Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two...


    Parallel corpus has recently become an indispensable resource in multilingual natural language processing. Manual preparation of a bilingual corpus is a laborious task. Therefore methods for the automated creation of parallel corpus are currently a topic of concern for many researchers. A number...

  • Resources for Turkish morphological processing. Sak, Haşim; Güngör, Tunga; Saraçlar, Murat // Language Resources & Evaluation;May2011, Vol. 45 Issue 2, p249 

    We present a set of language resources and tools-a morphological parser, a morphological disambiguator, and a text corpus-for exploiting Turkish morphology in natural language processing applications. The morphological parser is a state-of-the-art finite-state transducer-based implementation of...

  • AUTOMATED ARABIC ANTONYM EXTRACTION USING A CORPUS ANALYSIS TOOL. ALDHUBAYI, LULUH; ALYAHYA, MAHA // Journal of Theoretical & Applied Information Technology;12/31/2014, Vol. 70 Issue 3, p422 

    The automatic extraction of semantic relations between words from textual corpora is an extremely challenging task. The increasing need for language resources supporting Natural language processing (NLP) applications has encouraged the development of automated methods for the extraction of...

  • Grammatical Relation Extraction in Arabic Language. Hammadi, Othman Ibrahim; Aziz, Mohd Juzaiddin Ab // Journal of Computer Science;2012, Vol. 8 Issue 6, p891 

    Problem statement: Grammatical Relation (GR) can be defined as a linguistic relation established by grammar, where linguistic relation is an association among the linguistic forms or constituents. Fundamentally the GR determines grammatical behaviors such as: placement of a word in a clause,...

  • My brave old world. Teubert, Wolfgang // International Journal of Corpus Linguistics;2010, Vol. 15 Issue 3, p395 

    The article discusses the use of corpus linguistics in controlling a discourse which is to be interpreted by people. It says that discourse is not being preferred by natural language processing (NLP) experts or language engineers in conceptual language models but rather the statistical models....

  • Efficient corpus development for lexicography: building the New Corpus for Ireland. Kilgarriff, Adam; Rundell, Michael; Dhonnchadha, Elaine Uí // Language Resources & Evaluation;May2006, Vol. 40 Issue 2, p127 

    In a 12-month project we have developed a new, register-diverse, 55-million-word bilingual corpus—the New Corpus for Ireland (NCI)—to support the creation of a new English-to-Irish dictionary. The paper describes the strategies we employed, and the solutions to problems...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics