Corpora - SPOT

SST Treebank of Spoken Slovenian

The Spoken Slovenian Treebank (SST) is the only syntactically annotated corpus of spoken Slovenian currently available. It was developed as a representative subset of the Gos reference corpus and is intended to support both linguistic and computational research on Slovenian speech. The SST contains manually annotated transcripts of spontaneous speech in a variety of everyday contexts—from academic lectures to informal conversations among friends—and was significantly expanded and refined as part of the SPOT project. Annotated according to the cross-linguistically harmonized Universal Dependencies scheme, the SST enables direct comparison with written and spoken corpora in over 160 languages worldwide. SST is also the backbone of the recently emerged ROG training corpus of spoken Slovene, which includes additional annotation layers such as prosody, disfluency, and dialogue acts.

Browse SST in Drevesnik Browse SST in Grew-match SST at UD GitHub repository SST/ROG at CLARIN.SI repository

SSJ Treebank of Written Slovenian

The SSJ treebank, named after the original project of the same name, is the largest manually parsed corpus of Slovenian language to date. It contains morphosyntactically annotated sentences sourced from fiction, non-fiction, journalistic and encyclopedic texts. In addition to being used in the development of language technology, such as software for automated grammatical annotation, the SSJ treebank is increasingly being used for monolingual and contrastive linguistic research as well. It adheres to the cross-linguistically harmonized Universal Dependencies and, as part of the SUK reference training corpus for Slovene, also contains linguistic annotation on several other levels. Within SPOT, the SSJ treebank serves as a reference corpus for automatic detection of speech-specific syntactic patterns in the identically annotated Spoken Slovenian Treebank (SST).

Browse SSJ in Drevesnik Browse SSJ in Grew-match SSJ at UD GitHub repository SSJ/SUK at CLARIN.SI repository

SST Treebank of Spoken Slovenian

SSJ Treebank of Written Slovenian

Funding

Host institution

Project Leader

Field

Duration

Range