Art. 01 – Vol. 25 – No. 3 – 2015

Accelerating the Developping of a Digital Corpus Annotated with Dependencies Using Resources and Tools Dedicated to Other Languages

Elena IRIMIA
elena@racai.ro

“Mihai Drăgănescu” Research Institute for Artificial Intelligence, Romanian Academy

Abstract: Syntactically annotated corpora are fundamental for any language’s survival in the digital universe. We developed a corpus of small size (5000 sentences) in a short a period of time (12 months) and with limited work force; but it is meant to function as a base for developing more resources and instruments to support syntactic analysis for Romanian in the NLP group at ICIA. The sentences selected for annotation are representing different genres and domains, have different lengths (between 10 and 40 words), have high syntactical complexity and contain verbs that are frequently used in Romanian. By careful selection, we intended to assure the stylistic and syntactic diversity and the linguistic representativeness of the resulting corpus.

Keywords: corpus, dependency grammar, parsing, treebank.

REFERENCES

  1. TRANDABĂŢ, D.; IRIMIA, E.; BARBU MITITELU, V.; CRISTEA, D.; TUFIŞ, D.: The Romanian Language in the Digital Age. Limba română în era digitală. In White Papers Series (Rehm, Georg and Uszkoreit, Hans). Springer-Verlag, Berlin, Heidelberg, 2012.
  2. TUFIȘ, D.; CRISTEA, D.: Methodological issues in building the Romanian Wordnet and consistency checks in BalkaNet. In Proceedings of LREC 2002 Workshop on Wordnet Structures and Standardisation (Christodoulakis, Dimitris, N. and Kunze, Claudia and Lemnitzer, Lothar). Las Palmas, Spain, may 2002 pp. 35-41.
  3. BARBU MITITELU, V.; DUMITRESCU, Ş. D.; TUFIȘ, D.: News about the Romanian Wordnet. In Proceedings of the 7th International Global WordNet Conference. Tartu, Estonia, 2014.
  4. BARBU MITITELU, V.; IRIMIA, E.: The Provisional Structure of the reference Corpus of the Contemporary Romanian Language (CoRoLa). In Proceedings of the 10th International Conference “Linguistic resources and Tools for Processing the Romanian Language” (Colhon, Mihaela and Iftene, Adrian and Barbu Mititelu, Verginica and Cristea, Dan and Tufiș, Dan). Editura Universităţii „Alexandru Ioan Cuza”, Iaşi, September 2014, pp. 57–66.
  5. TUFIȘ, D.; ION, R.; DUMITRESCU, Ș. D.: Wikipedia as an SMT Training Corpus. In Proceedings of the International Conference on Recent Advances on Language Technology (RANLP 2013). Hissar, Bulgaria, September 2013.
  6. IRIMIA, E.: EBMT experiments for the English-Romanian Language Pair. In Recent Advances in Intelligent Information Systems (Klopotek et al.). Springer, Warsaw, 2009, pp. 91-102.
  7. TUFIȘ, D.; BOROȘ, T.; DUMITRESCU, Ș. D.: The RACAI Speech Translation System. In Proceedings of the 7th International Conference on Speech Technology and Human-Computer Dialogue (SPED 2013). Cluj-Napoca, October 2013.
  8. OCH, F.-J.; TILLMANN, CH.; NEY, H.: Improved Alignment Models for Statistical Machine Translation. Proceedings of the Joint Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, June, 1999, pp. 20–28.
  9. MARCU, D.; WONG, W.: A Phrased-Based, Joint Probability Model for Statistical Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, July, 2002, pp. 133-139.
  10. YAMADA, K.; KNIGHT, K.: A Decoder for Syntax-based Statistical MT. Proceedings of the 40th Annual Conf. of the Association for Computational Linguistics, Philadelphia, PA, July, 2002, pp. 303-310.
  11. COLHON, M.: Syntactic Translation Patterns from a Parallel Treebank. Workshop on Computational Linguistics and Natural Language Processing of Balkan Languages, Balkan Conference in Informatics, 2012, pp. 85-88.
  12. GARSIDE, R.; LEECH, G.; VARADI, T.: Manual of Information for the Lancaster Parsed Corpus. Lancaster University, 1992.
  13. TAYLOR, A.; MITCHELL, M.; SANTORINI, B.: The PENN Treebank: An Overview. In ABEILLE, A (ed.). Treebanks. Building and Using Parsed Corpora. Kluwer Academic Publishers, 2003, pp. 6-22.
  14. SKUT, W.; KRENN, B.; BRANTS, TH.; USZKOREIT, H.: An Annotation Scheme for Free Word Order Languages. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97). Washington, DC, USA.

View full article

  1. BRANTS, S.; DIPPER, S.; EISENBERG, P.; HANSEN, S.; KONIG, E.; LEZIUS, W.; ROHRER, C.; SMITH, G.; USZKOREIT H.: TIGER: Linguistic Interpretation of a German Corpus. Journal of Language and Computation, 2004 (2), pp. 597-620.
  2. HAJIC, J.; HAJICOVA, E.; PAJAS, P.; PANEVOVA, J.; SGALL, P.; VIDOVA HLADKA, B.: Prague Dependency Treebank 1.0 (Final Production Label). CD-ROM, CAT: LDC2001T10, ISBN 1-58563-212-0, Linguistic Data Consortium.
  3. HRISTEA, F.; POPESCU, M.: A Dependency Grammar Approach to Syntactic Analysis with Special Reference to Romanian. F. Hristea şi M. Popescu (coord.), Building Awareness in Language Technology, Bucureşti, Editura Universităţii din Bucureşti, 2003, pp. 9-16.
  4. BICK, E.; GREAVU, A.: A Grammatically Annotated Corpus of Romanian Business Texts. Proceedings of Multilinguality and Interoperability in Language Processing with Emphasis on Romanian, Editura Academiei Române, 2010, pp. 169-183.
  5. PEREZ, A.-C.: Resurse lingvistice pentru prelucrarea limbajului natural. PhD thesis, “Al. I Cuza” University, Iaşi, 2014.
  6. MĂRĂNDUC, C.; PEREZ, A.-C.: A Romanian dependency treebank. CICLing 2015, Cairo, 14-20 Aprilie.
  7. PUNYAKANOK, V.; ROTH, D.; YIH, W.-T.: The Importance of Syntactic Parsing and Inference in Semantic Role Labeling. Computational Linguistics, 34(2), 2008, pp. 257-287.
  8. CIARAMITA, M.; ATTARDI, G.: Dependency Parsing with Second-Order Feature Maps and Annotated Semantic Information. In H. Bunt, P. Merlo, J. Nivre (eds.), Trends in Parsing Technology, Springer, 2010, pp. 87-104.
  9. WANG, Q. I.; SHUURMANS, S.; LIN, D.: Strictly Lexical Dependency Parsing. In H. Bunt, P. Merlo, J. Nivre (eds.), Trends in Parsing Technology, Springer, 2010, pp. 105-120.
  10. COLLINS, M.: A new statistical parser based on bigram lexical dependencies, 1996.
  11. COLLINS, M.: Head-driven statistical models for natural language parsing. Ph.D. thesis, Computer Science Department, University of Pennsylvania, 1999.
  12. KLEIN, D.; MANNING, C. D.: Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, 2003, pp. 3-10.
  13. CHEN, D.; MANNING, C. D.: A Fast and Accurate Dependency Parser using Neural Networks. Proceedings of EMNLP 2014.
  14. NIVRE, J.; HALL, J.; NILSSON, J.: MaltParser: A Data-Driven Parser-Generator for Dependency Parsing. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), Genoa, Italy, 2006, pp. 2216-2219.
  15. CĂLĂCEAN, M.; NIVRE, J.: A Data-Driven Dependency Parser for Romanian. Proceedings the Seventh International Workshop on Treebanks and Linguistic Theories, 2009, pp. 65-76.
  16. SERETAN, V.; WEHRLI, E.; NERIMA, L.; SOARE, G.: FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser. Proceedings of the Seventh International Conference on Language Resources and Evaluation, Valletta, Malta, 2010.
  17. ION, R.; IRIMIA, E.; ȘTEFĂNESCU, D.; TUFIȘ, D.: ROMBAC: The Romanian Balanced Annotated Corpus. Procedings of LREC 2012, Istanbul, Turkey.
  18. ARIAS, B.; BEL, N.; FOMICHEVA, M.; LARREA, I.; LORENTE, M.; MARIMON, M.; MILA, A.; VIVALDI, J.; PADRO, M.: Boosting the creation of a treebank. Proceedings of LREC 2014, Reykjavik, Iceland.
  19. FLOREA, I. M.; REBEDEA, T.; CHIRU, C. G.: Parser de dependenţe pentru limba română realizat pe baza parserelor pentru alte limbi romanice. Revista Română de Interacţiune Om-Calculator 7(1), 2014, pp. 1-20.
  20. MARIMON, M.; BEL, N.: Dependency structure annotation in the IULA Spanish LSP Treebank. Language Resources and Evaluation. Amsterdam: Springer Netherlands, 2014.
  21. NILSSON, J.; NIVRE, J.: MaltEval: An Evaluation and Visualization Tool for Dependency Parsing. Proceedings of LREC 2008, Marrakesch, Morocco.
  22. TESNIERE, L.: Éléments de syntaxe structurale. Paris, Klincksieck, 1959.
  23. MELCUK, I. A.: Dependency syntax : theory and practice. Albany, State University Press of New York, 1987.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.