SST Treebank of Spoken Slovenian

The Spoken Slovenian Treebank (SST) is the first syntactically parsed corpus of spoken Slovenian to date. It was designed as a representative sample of the Gos reference corpus of spoken Slovenian in order to be used in linguistic and computational research on Slovenian speech. It contains manually morphosyntactically annotated transcripts of spontaneous speech in various everyday situations, ranging from academic lectures to private conversations between friends, and is planned to be substantially extended within the SPOT project. The SST treebank adheres to the Universal Dependencies the cross-lingually harmonized annotation scheme, which facilitates its direct comparisons with numerous written and spoken corpora that adopt the same scheme across over 140 languages worldwide.

SSJ Treebank of Written Slovenian

The SSJ treebank, named after the original project of the same name, is the largest manually parsed corpus of Slovenian language to date. It contains morphosyntactically annotated sentences sourced from fiction, non-fiction, journalistic and encyclopedic texts. In addition to being used in the development of language technology, such as software for automated grammatical annotation, the SSJ treebank is increasingly being used for monolingual and contrastive linguistic research as well. It adheres to the cross-linguistically harmonized Universal Dependencies and, as part of the SUK reference training corpus for Slovene, also contains linguistic annotation on several other levels. Within SPOT, the SSJ treebank serves as a reference corpus for automatic detection of speech-specific syntactic patterns in the identically annotated Spoken Slovenian Treebank (SST).