2022.1.5

>ČASOPIS PRO MODERNÍ FILOLOGII 2022 (104) 1

Hodnoty slovesných morfologických kategorií v korpusu SYN2020 — atribut verbtag

VALUES OF VERBAL MORPHOLOGICAL CATEGORIES IN THE SYN2020 CORPUS — THE VERBTAG ATTRIBUTE

Tomáš Jelínek — Vladimír Petkevič — Hana Skoumalová

 

 FULL TEXT   

 ABSTRACT (en)

The paper describes the verbtag attribute, which allows a user to search, in the SYN2020 corpus (and also subsequent corpora, SYNv9 and SYNv10) of contemporary Czech, for all values of morphological categories of verbs, i.e., not only those contained in the tag attribute, but also those related mainly to multi-word participial verb predicates, which are prevalent in Czech. The verbtag attribute contains information indicating whether the verb (co-)forming the verbal meaning is either auxiliary or autosemantic, as well as information about the verb mode, diathesis, person, number and tense. The annotation applies both to verb predicates expressed in a single word (e.g., the 1st person indicative present tense: Čtu rád detektivní příběhy. ‘I like to read detective stories.’) and (especially) to verb predicates expressed in multiple words (e.g., the present conditional of the 1st person singular: Pak bych mu s chutí nabídla výhodnou smlouvu. ‘Then I would gladly offer him a good deal.’). The authors introduce the motivation and the concept of the verbtag annotation, describe relevant morphological categories and their values in detail, and show, via examples, how various multiword structures expressing verbal meaning are annotated in the verbtag attribute. They also offer users a guide to the whole issue of verbal morphosyntax manifested in the verbtag attribute and possibilities for efficient search for and retrieval of morphological/morphosyntactic data. The paper shows which multiple verb complexes are simple in terms of annotation, but also identifies more complex cases (e.g., coordination of participles) which are not easy to automatically annotate, and/or whose annotation is unclear in terms of an adequate theoretical approach. The authors also present the method used for annotating multiword verbal complexes and its current success rate.

 KEYWORDS (cz)

atribut verbtag, morfologie českých sloves, morfologické kategorie a hodnoty, automatické značkování, korpus SYN2020

 KEYWORDS (en)

verbtag attribute, morphology of Czech verbs, morphological categories and values, automatic annotation, SYN2020 corpus

 DOI

https://doi.org/10.14712/23366591.2022.1.5

 REFERENCES

Апресян, Ю., Д. — Богуславский, И., М. — Иомдин, Б., Л., и др. (2005): Синтаксически и семантически аннотированный корпус русского языка: современное состояние и перспективы // Национальный корпус русского языка: 2003—2005. М.: Индрик, s. 193–214. https:// ruscorpora.ru/new/sbornik2005/12apresyan. pdf

Bejček, E. et al. (2011): Prague Dependency Treebank 2.5, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL). Prague: Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11858/00-097C-0000- 0006-DB11-8.

Bejček, E. — Panevová, J. — Popelka, J. — Straňák, P. — Ševčíková, M. — Štěpánek, J. — Žabokrtský, Z. (2012): Prague Dependency Treebank 2.5 — a revisited version of PDT 2.0. In: Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012). Mumbai, India: Coling 2012 Organizing Committee, s. 231–246.

Dönicke, T. (2020): Clause-Level Tense, Mood, Voice and Modality Tagging for German. TLT. Göttingen: University of Göttingen, Centre for Digital Humanities Papendiek 16, 37073. https://aclanthology.org/2020.tlt-1.1.pdf

Jelínek, T. — Petkevič, V. (2011): Systém jazykového značkování současné psané češtiny. In: Korpusová lingvistika Praha 2011, sv. 3: Gramatika a značkování korpusů. Praha: Nakladatelství Lidové noviny / Ústav českého národního korpusu, s. 154–170.

Jelínek, T. — Křivan, J. — Petkevič, V. — Skoumalová, H. — Šindlerová, J. (2021): SYN2020: A New Corpus of Czech With an Innovated Annotation. In: K. Ekštein — F. Pártl — M. Konopík (eds.), Proceedings of the Text, Speech and Dialogue 24th International conference TSD 2021. Olomouc, Czech Republic, September 6–9, 2021. LNAI 12848. Springer Nature Switzerland AG 2021, s. 48–59. https:// doi.org/10.1007/978-3-030-83527-9

Květoň, P. (2006): Rule-based Morphological Disambiguation. Ph.D. thesis. Praha: MFF UK.

Patejuk, A. — Przepiórkowski, A. (2014): Synergistic development of grammatical resources: A valence dictionary, an LFG grammar, and an LFG structure bank for Polish. In: V. Henrich — E. Hinrichs — D. de Kok — P. Osenova — A. Przepiórkowski (eds.), Proceedings of the Thirteenth International Workshop on Treebanks and Linguistic Theories (TLT 13). Tübingen: Department of Linguistics (SfS), University of Tübingen, s. 113–126.

Petkevič, V. (2014): Problémy automatické morfologické disambiguace češtiny. Naše řeč, 97, 4–5, s. 194–207.

Petkevič, V. — Rosen, A. — Skoumalová, H. — Vítovec, P. (2015): Analytic Morphology — Merging the Paradigmatic and Syntagmatic Perspective in a Treebank. In: J. Piskorski — L. Pivovarova — J. Šnajder — H. Tanev — R. Yangarber (eds.), The 5th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2015). Hissar, Bulgaria, s. 9–16. http://bsnlp-2015. cs.helsinki.fi/.

Ramm, A. — Loáiciga, S. — Friedrich, A. — Fraser, A. (2017): Annotating tense, mood and voice for English, French and German. Proceedings of ACL 2017, System Demonstrations. Vancouver: Association for Computational Linguistics. https://www.cis.uni-muenchen. de/~fraser/pubs/ramm_acldemo2017.pdf

Skoumalová, H. (2021): Etalon: manuálně anotovaný synchronní korpus českých textů. Praha: Ústav Českého národního korpusu FF UK. https://www.korpus.cz a http://hdl. handle.net/11234/1-3698

Straka, M. — Straková, J. — Hajič, J. (2019): Czech text processing with contextual embeddings: Pos tagging, lemmatization, parsing and NER. In: International Conference on Text, Speech, and Dialogue, Ljubljana: Springer, s. 137–150.

Štěpánková, B. — Mikulová, M. — Hajič, J. (2020): The MorfFlex Dictionary of Czech as a Source of Linguistic Data. In: Euralex XIX Proceedings Book: Lexicography for inclusion. European Association for Lexicography, s. 387–391.

 Corpuses

Křen, M. — Cvrček, V. — Henyš, J. — Hnátková, M. — Jelínek, T. — Kocek, J. — Kováříková, D. — Křivan, J. — Milička, J. — Petkevič, V. — Procházka, P. — Skoumalová, H. — Šindlerová, J. — Škrabal, M. (2020): SYN2020: reprezentativní korpus psané češtiny. Praha: Ústav Českého národního korpusu FF UK. https://www. korpus.cz

Křen, M. — Cvrček, V. — Henyš, J. — Hnátková, M. — Jelínek, T. — Kocek, J. — Kováříková, D. — Křivan, J. — Milička, J. — Petkevič, V. — Procházka, P. — Skoumalová, H. — Šindlerová, J. — Škrabal, M. (2021): Korpus SYN, verze 9 z 5. 12. 2021. Praha: Ústav Českého národního korpusu FF UK. https://www.korpus.cz

Úvod > 2022.1.5