An annotated corpus for the analysis of VP ellipsis

Bos, Johan; Spenader, Jennifer
December 2011
Language Resources & Evaluation;Dec2011, Vol. 45 Issue 4, p463
Academic Journal
Verb Phrase Ellipsis (VPE) has been studied in great depth in theoretical linguistics, but empirical studies of VPE are rare. We extend the few previous corpus studies with an annotated corpus of VPE in all 25 sections of the Wall Street Journal corpus (WSJ) distributed with the Penn Treebank. We annotated the raw files using a stand-off annotation scheme that codes the auxiliary verb triggering the elided verb phrase, the start and end of the antecedent, the syntactic type of antecedent (VP, TV, NP, PP or AP), and the type of syntactic pattern between the source and target clauses of the VPE and its antecedent. We found 487 instances of VPE (including predicative ellipsis, antecedent-contained deletion, comparative constructions, and pseudo-gapping) plus 67 cases of related phenomena such as do so anaphora. Inter-annotator agreement was high, with a 0.97 average F-score for three annotators for one section of the WSJ. Our annotation is theory neutral, and has better coverage than earlier efforts that relied on automatic methods, e.g. simply searching the parsed version of the Penn Treebank for empty VP's achieves a high precision (0.95) but low recall (0.58) when compared with our manual annotation. The distribution of VPE source-target patterns deviates highly from the standard examples found in the theoretical linguistics literature on VPE, once more underlining the value of corpus studies. The resulting corpus will be useful for studying VPE phenomena as well as for evaluating natural language processing systems equipped with ellipsis resolution algorithms, and we propose evaluation measures for VPE detection and VPE antecedent selection. The stand-off annotation is freely available for research purposes.


Related Articles

  • Intended boundaries detection in topic change tracking for text segmentation. Labadié, Alexandre; Prince, Violaine // International Journal of Speech Technology;Dec2008, Vol. 11 Issue 3/4, p167 

    This paper presents a topical text segmentation method based on intended boundaries detection and compares it to a well known default boundaries detection method, c99. We compared the two methods by running them on two different corpora of French texts and results are evaluated by two different...

  • Automatic transformation from TIDES to TimeML annotation. Saquete, Estela; Pustejovsky, James // Language Resources & Evaluation;Dec2011, Vol. 45 Issue 4, p495 

    Until recently, most systems performing temporal extraction and reasoning from text have focused on recognizing and normalizing temporal expressions alone, for which the TIDES annotation scheme has been adopted. Temporal awareness of a text, however, involves not only identifying the temporal...


    Parallel corpus has recently become an indispensable resource in multilingual natural language processing. Manual preparation of a bilingual corpus is a laborious task. Therefore methods for the automated creation of parallel corpus are currently a topic of concern for many researchers. A number...

  • Metaphor Identification in Large Texts Corpora. Neuman, Yair; Assaf, Dan; Cohen, Yohai; Last, Mark; Argamon, Shlomo; Howard, Newton; Frieder, Ophir // PLoS ONE;Apr2013, Vol. 8 Issue 4, p1 

    Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two...

  • Resources for Turkish morphological processing. Sak, Haşim; Güngör, Tunga; Saraçlar, Murat // Language Resources & Evaluation;May2011, Vol. 45 Issue 2, p249 

    We present a set of language resources and tools-a morphological parser, a morphological disambiguator, and a text corpus-for exploiting Turkish morphology in natural language processing applications. The morphological parser is a state-of-the-art finite-state transducer-based implementation of...

  • AUTOMATED ARABIC ANTONYM EXTRACTION USING A CORPUS ANALYSIS TOOL. ALDHUBAYI, LULUH; ALYAHYA, MAHA // Journal of Theoretical & Applied Information Technology;12/31/2014, Vol. 70 Issue 3, p422 

    The automatic extraction of semantic relations between words from textual corpora is an extremely challenging task. The increasing need for language resources supporting Natural language processing (NLP) applications has encouraged the development of automated methods for the extraction of...

  • Using Tectogrammatical Annotation for Studying Actors and Actions in Sallust's Bellum Catilinae. Saavedra, Berta González; Passarotti, Marco // Prague Bulletin of Mathematical Linguistics;Oct2018, Vol. 111 Issue 1, p5 

    In the context of the Index Thomisticus Treebank project, we have enhanced the full text of Bellum Catilinae by Sallust with semantic annotation. The annotation style resembles the one used for the so called "tectogrammatical" layer of the Prague Dependency Treebank. By exploiting the results of...

  • National ink.  // Business News New Jersey;05/19/97, Vol. 10 Issue 16, p3 

    Reports on the publishing of an account of Operation Cable Trap, a sting operation which helped trap a cartel of cable thieves in 1996, by the `Wall Street Journal' newspaper. Details on operation; How the federal government's case was put together.

  • Search for the Relation of Form and Function Using the ForFun Database. Mikulová, Marie; Bejček, Eduard; Hajičová, Eva; Panevová, Jarmila // Prague Bulletin of Mathematical Linguistics;Apr2018, Vol. 110 Issue 1, p71 

    The aim of the contribution is to introduce a database of linguistic forms and their functions built with the use of the multi-layer annotated corpora of Czech, the Prague Dependency Treebanks. The purpose of the Prague Database of Forms and Functions (ForFun) is to help the linguists to study...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics