Example of a dependency tree in the Spoken Slovenian Treebank.

Based on the unitary approach to the study of language, whereby speech and writing as seen as two ends of the same continuum, the past three decades have witnessed an unprecedented increase of corpus linguistic research aimed at describing speech-specific syntactic phenomena that have been ignored or insufficiently addressed by traditional grammatical frameworks. However, this trend is significantly less pronounced in Slovenian linguistics, where research on syntactic characteristics of spoken Slovenian is still scarce and has mostly been focused on top-down investigations of individual syntactic phenomena based on qualitative analyses of small amounts of unrepresentative data.

To bridge this gap and establish the necessary empirical foundations for future grammatical descriptions of spoken Slovenian, this project will systematically investigate the methodological potential of syntactically annotated corpora , i.e. treebanks, for linguistic research on spoken Slovenian by  (1) establishing a coherent framework for dependency annotation of spoken Slovenian, (2) providing a high-quality treebank of spoken Slovenian, and (3) developing a methodology for its bottom-up linguistic analysis, while (4) promoting the use of syntactically annotated corpora in linguistics in general.

Specifically, we will significantly improve the current version of the Spoken Slovenian Treebank (SST), the only syntactically annotated corpus of spoken Slovenian to date, both in terms of size, documentation and the quality of annotations. In turn, the treebank will be used to perform a pioneering bottom-up, statistics-driven identification of speech-specific patterns in spoken Slovenian by comparing it to the reference treebank of written Slovenian. We expect the results will empirically confirm the known, prototypical, cognitively most salient speech-specific syntactic phenomena on the one hand, and lead to the potential discovery of previously unidentified, statistically salient patterns of spoken language use, on the other.

Thus, the project will result in several important contributions to Slovenian linguistics by providing new resources, methods, and analyses for the study of spoken Slovenian, but also in the field of corpus linguistics in general by providing new insights on the underexploited methodological potential of syntactically parsed corpora in studies on spoken language and language variation in general.