Art. 01 – Vol. 25 – No. 3 – 2015
Elena IRIMIA
elena@racai.ro
“Mihai Drăgănescu” Research Institute for Artificial Intelligence, Romanian Academy
Abstract: Syntactically annotated corpora are fundamental for any language’s survival in the digital universe. We developed a corpus of small size (5000 sentences) in a short a period of time (12 months) and with limited work force; but it is meant to function as a base for developing more resources and instruments to support syntactic analysis for Romanian in the NLP group at ICIA. The sentences selected for annotation are representing different genres and domains, have different lengths (between 10 and 40 words), have high syntactical complexity and contain verbs that are frequently used in Romanian. By careful selection, we intended to assure the stylistic and syntactic diversity and the linguistic representativeness of the resulting corpus.
Keywords: corpus, dependency grammar, parsing, treebank.