Archives

Romanian Journal of Information Technology and Automatic Control / Vol. 25, No. 3, 2015


Accelerating the Developping of a Digital Corpus Annotated with Dependencies Using Resources and Tools Dedicated to Other Languages

Elena IRIMIA

Abstract:

Syntactically annotated corpora are fundamental for any language’s survival in the digital universe. We developed a corpus of small size (5000 sentences) in a short a period of time (12 months) and with limited work force; but it is meant to function as a base for developing more resources and instruments to support syntactic analysis for Romanian in the NLP group at ICIA. The sentences selected for annotation are representing different genres and domains, have different lengths (between 10 and 40 words), have high syntactical complexity and contain verbs that are frequently used in Romanian. By careful selection, we intended to assure the stylistic and syntactic diversity and the linguistic representativeness of the resulting corpus.

Keywords:
corpus, dependency grammar, parsing, treebank.

View full article:

CITE THIS PAPER AS:
Elena IRIMIA, "Accelerating the Developping of a Digital Corpus Annotated with Dependencies Using Resources and Tools Dedicated to Other Languages", Romanian Journal of Information Technology and Automatic Control, ISSN 1220-1758, vol. 25(3), pp. 5-16, 2015.