Språkbanken CLARIN        SweCLARIN        Uppsala universitet SVENSKA

Resource and Knowledge Centres

– Swedish in a Multilingual Setting, SMS

The CLARIN Knowledge Centre for Swedish in a Multilingual Setting (CLARIN-SMS) offers expertise in linguistic processing of text, especially for Swedish and/or when multiple languages are involved. In addition, CLARIN-SMS offers expertise in the application of language technology to Swedish Sign Language.

CLARIN-SMS is primarily directed at researchers in the humanities and social sciences with a need for analysis, annotation or mining of Swedish or multilingual text, and additionally at researchers with a need for corpora or tools for Swedish Sign Language.

CLARIN-SMS makes resources in the form of tools for linguistic processing and corpora available in service of the humanities and social sciences. The resources include monolingual (mainly Swedish) and multilingual corpora across several domains, and tools for basic processing of text, including tokenization, morphological analysis, part-of-speech tagging, syntactic parsing, and named entity recognition. CLARIN-SMS offers special expertise in the following areas:

  • Processing of parallel and comparable corpora, including alignment and machine translation
  • Cross-linguistically consistent annotation within the framework of Universal Dependencies
  • Computation and evaluation of measures of text complexity
  • Language technology for Swedish Sign Language

Tools and Resources

  • Sapis - StilLett API Service
    A restful web service including tools for measuring text complexity and text simplification. The Sapis User manual is available here.

  • LinES: Linköping English-Swedish Parallel Treebank
    A parallel corpus with some 4000 English original sentences from different sources and their Swedish translations.

  • Swectors
    A set of static Swedish word vectors and the code used for generating them.

  • A Gold Standard Word Alignment for English-Swedish
    Gold alignments for 1164 English-Swedish sentence pairs for the purpose of testing word alignment software. Source data from Europarl v.2.

  • Svensk Diakronisk korpus (Swedish Diachronic Corpus)
    A corpus of texts covering the time period from Old Swedish to present day, with a wide variety of text types and freely available for download and search.
    Contact person: Eva Pettersson, Uppsala University

  • SweGram
    SWEGRAM aims to provide a tool for text analysis in Swedish and English. You can upload one or several texts and annotate them at different linguistic levels with morphological and syntactic information. The annotated texts can then be used to extract statistics about the text properties with respect to text length, number of words, readability measures, part-of-speech, and much more.
    Contact person: Beáta Megyesi, Stockholm University

  • Universal Dependencies
    Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing nearly 200 treebanks in over 100 languages.
    Contact person: Joakim Nivre, Uppsala University

  • SOU corpus
    This repository contains cleaned and further processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU). The documents are based on html versions from Riksdagens öppna data (http://data.riksdagen.se/) and cover the years 1994 to 2020.
    Contact person: Sara Stymne, Uppsala University

  • Swedish Causality Datasets
    Three data sets of Swedish text annotated for the presence of causality. The sets are annotated with two different tasks in mind, namely causality recognition and causality ranking with respect to a query prompt containing at least a cause or an effect.
    Contact person: Sara Stymne, Uppsala University


Helpdesk Contact

Eva Pettersson, eva.pettersson@lingfil.uu.se


     Partners Members Contact us     
Swedish Research Council
Språkbanken Text, GU
Språkbanken Tal, KTH
Språkbanken Sam, Isof
Computational linguistics, UU
Department of Linguistics, SU
Humanities Lab, LU Humlab, UmU
National Library of Sweden
Swedish National Archive