Swe-CLARIN

Home About us Resource and Knowledge Centres Report Series Contact us

Resource and Knowledge Centres

– Swedish in a Multilingual Setting, SMS

The CLARIN Knowledge Centre for Swedish in a Multilingual Setting (CLARIN-SMS) offers expertise in linguistic processing of text, especially for Swedish and/or when multiple languages are involved. In addition, CLARIN-SMS offers expertise in the application of language technology to Swedish Sign Language.

CLARIN-SMS is primarily directed at researchers in the humanities and social sciences with a need for analysis, annotation or mining of Swedish or multilingual text, and additionally at researchers with a need for corpora or tools for Swedish Sign Language.

CLARIN-SMS makes resources in the form of tools for linguistic processing, as well as corpora available for research in the Humanities and Social Sciences. The resources include monolingual (mainly Swedish) and multilingual corpora across several domains, and tools for the basic processing of text, including tokenisation, morphological analysis, part-of-speech tagging, syntactic parsing, and named entity recognition.

Main Areas of Expertise

CLARIN-SMS offers special expertise:

For researchers interested in exploring Swedish texts by providing support for the creation and processing of Swedish texts with a variety of computational methods, such as linguistic annotation at different levels, or sentiment analysis.

For researchers interested in comparative analyses by providing support for the creation and processing of parallel and comparable corpora, including alignment and machine translation, as well as cross-linguistically consistent annotation within the framework of Universal Dependencies which allows for easy comparative analyses.

For researchers interested in education and content accessibility by providing support for computation and evaluation of measures of text complexity.

For researchers and users of Swedish Sign Language by providing support for the creation of lexicons and corpora for SSL, and annotation of SSL (including glosses, part-of-speech tagging and syntactic structure).

The support is provided by several partners participating in the CLARIN-SMS distributed Knowledge Centre:

Linköping University, Department of Computer and Information Science
Stockholm University, Department of Linguistics
Uppsala University, Department of Linguistics and Philology.

Although each CLARIN-SMS node works as a separate unit and promotes its services and resources in various ways, including promotion tours at universities, with web pages presenting projects and resources, and with presentations at CLARIN-related events, the K-centre is a common resource. CLARIN-SMS is a vibrant community, and, in accordance with CLARIN’s general mission of creating and promoting language resources, a variety of activities has been carried out at the respective nodes, including tool and resource development for language analysis, both multilingual and Swedish only.

An Active Research Hub

A number of activities are focused especially on promoting the use of language technology in SSH. For instance, one of the projects includes analysing the development of the concept of 'handicapped' from a Swedish parliamentary perspective. In this project, we help researchers process and analyse the Swedish Government’s official reports from early 1900 to the present day with a variety of SweClarin resources and language technology tools, such as the SPARV pipeline.

Another example is the analysis of the protocols of the Swedish National Bank (Sw. Riksbanken), where we compare protocols from the period when they were anonymous to protocols from the period when they were not. One of the goals of this study is to see if we can identify individual speakers from the period of anonymous protocols. Another goal is to provide the National Bank with information about potential differences and similarities in argumentation between the two types of protocols. To this end, we use a variety of SweClarin resources, such as the sensaldo-v02 sentiment lexicon or the SPARV pipeline for parsing, in combination with, for instance, topic and sentiment analysis models.

A further example is a project that is led in cooperation with management researchers, in which we are analysing Swedish companies’ adherence and adoption of the information security standard ISO 27001. The aim of the project is to examine the communicative constitution of preventive innovation in organisations. For this project, we helped create a corpus and analyse it from multiple interdisciplinary perspectives using SweClarin tools and resources, such as the sensaldo-v02 sentiment lexicon or the SPARV pipeline for parsing, as well as other language technology tools, including word clouds.

Tools and Resources

Sapis - StilLett API Service
A restful web service including tools for measuring text complexity and text simplification.
Contact person: Arne Jönsson, Linköping University

LinES: Linköping English-Swedish Parallel Treebank
A parallel corpus with some 4000 English original sentences from different sources and their Swedish translations.
Contact person: Lars Ahrenberg, Linköping University

Swectors
A set of static Swedish word vectors and the code used for generating them.

A Gold Standard Word Alignment for English-Swedish
Gold alignments for 1164 English-Swedish sentence pairs for the purpose of testing word alignment software. Source data from Europarl v.2.

Svensk Diakronisk korpus (Swedish Diachronic Corpus)
A corpus of texts covering the time period from Old Swedish to present day, with a wide variety of text types and freely available for download and search.
Contact person: Eva Pettersson, Uppsala University

SweGram
SWEGRAM aims to provide a tool for text analysis in Swedish and English. You can upload one or several texts and annotate them at different linguistic levels with morphological and syntactic information. The annotated texts can then be used to extract statistics about the text properties with respect to text length, number of words, readability measures, part-of-speech, and much more.
Contact person: Beáta Megyesi, Stockholm University

Universal Dependencies
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 600 contributors producing nearly 300 treebanks in over 150 languages.
Contact person: Joakim Nivre, Uppsala University

SOU corpus
This repository contains cleaned and further processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU). The documents are based on html versions from Riksdagens öppna data (http://data.riksdagen.se/) and cover the years 1994 to 2020.
Contact person: Sara Stymne, Uppsala University

Swedish Causality Datasets
Three data sets of Swedish text annotated for the presence of causality. The sets are annotated with two different tasks in mind, namely causality recognition and causality ranking with respect to a query prompt containing at least a cause or an effect.
Contact person: Sara Stymne, Uppsala University

Efficient Low-Memory Aligner
Tools for automatic word alignment of parallel texts.
Contact person: Robert Östling, Stockholm University

Tools for processing of massively parallel texts along with data for 1295 languages
Data: https://zenodo.org/records/7506220
Tools: https://github.com/robertostling/parallel-tools
Contact person: Robert Östling, Stockholm University

Datasets and tools for annotation for the purpose of evaluation of grammatical error correction
Contact person: Robert Östling, Stockholm University

Participants

Helpdesk Contact

Arne Jönsson, arne.jonsson@liu.se

Publications

Lars Ahrenberg (2015). Converting an English–Swedish Parallel Treebank to Universal Dependencies. In Proceedings of the Third International Conference on Dependency Linguistics (DepLing 2.015), Association for Computational Linguistics, pages 10–19. ACL Anthology W15-2103.

Lars Ahrenberg, Henrik Danielsson, Staffan Bengtsson, Hampus Arvå, Lotta Holme and Arne Jönsson (2020). Studying Disability Related Terms with Swe-Clarin Resources. In Selected Papers from the CLARIN Annual Conference 2019, DOI: https://doi.org/10.3384/ecp2020172.

Lars Ahrenberg, Daniel Holmer, Stefan Holmlid and Arne Jönsson (2023). Analysing changes in official use of the design concept using SweCLARIN resources. In Selected papers from the CLARIN Annual Conference 2022. DOI: https://doi.org/10.3384/ecp198.

Bodil Axelsson, Daniel Holmer, Lars Ahrenberg and Arne Jönsson (2021). Studying Emerging New Contexts for Museum Digitisations on Pinterest. In Selected Papers from the CLARIN Annual Conference 2020. DOI: https://doi.org/10.3384/ecp180, 2021.

Daniel Holmer, Lars Ahrenberg, Julius Monsen, Arne Jönsson, Mikael Apel, and Marianna Blix Grimaldi (2023). Who said what? Speaker Identification from Anonymous Minutes of Meetings. In Proceedings of the 24 Nordic Conference on Computational Linguistics (NoDaLiDa).

Oskar Jerdhaf, Marina Santini, Peter Lundberg, Anette Karlsson and Arne Jönsson (2021). Focused Terminology Extraction for CPSs: The Case of "Implant Terms" in Electronic Medical Records. In Proceedings of the IEEE International Conference on Communications Workshop on Communication, Computing, and Networking in Cyber-Physical Systems (IEEE CCN-CPS 2021), Montreal, Canada.

Arne Jönsson, Subhomoy Bandyopadhyay, Svjetlana Pantic Dragisic and Andrea Fried (2024). Analyses of information security standards on data crawled from company web sites using SweClarin resources. In Selected papers from the 2023 CLARIN Annual Conference. DOI: https://doi.org/10.3384/ecp210.

Marco Kuhlmann and Stephan Oepen (2016). Towards a Catalogue of Linguistic Graph Banks. In Computational Linguistics, 42, 4, 819–827. ISSN 0891-2017, E-ISSN 1530-9312.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).

Agata Savary, Sara Stymne, Verginica Barbu Mititelu, Nathan Schneider, Carlos Ramisch and Joakim Nivre (2023). PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions. In Northern European Journal of Language Technology, 9.

Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, and Sara Stymne (2018). 82 treebanks, 34 models: Universal Dependency Parsing with Multi-Treebank Models. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 113–123.

Leonie Weissweiler, Nina Böbel, Kirian Guiller, Santiago Herrera, Wesley Scivetti, Arthur Lorenzi, Nurit Melnik, Archna Bhatia, Hinrich Schütze, Lori Levin, Amir Zeldes, Joakim Nivre, William Croft och Nathan Schneider (2024). UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies. I Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16919–16932, Torino, Italia. ELRA and ICCL.

Robert Östling (2018). Part of Speech Tagging: Shallow or Deep Learning? In Northern European Journal of Language Technology, Volume 5, Article 1.

Robert Östling, Carl Börstell, Moa Gärdenfors and Mats Wirén (2017). Universal Dependencies for Swedish Sign Language. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 303–308. Linköping.

Robert Östling, Katarina Gillholm, Muratan Kurfalı, Marie Mattson and Mats Wirén (2024). Evaluation of really good grammatical error correction. In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N., editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6582--6593, Torino, Italia. ELRA and ICCL.

Robert Östling and Muratan Kurfalı (2023). Language Embeddings Sometimes Contain Typological Generalizations. In Computational Linguistics, 49(4):1003--1051.

Robert Östling and Jörg Tiedemann. (2016). Efficient word alignment with Markov Chain Monte Carlo. In Prague Bulletin of Mathematical Linguistics, 106:125--146.

Partners	Members		Contact us
CLARIN ERIC Språkbanken Swedish Research Council	Språkbanken Text, GU Språkbanken Tal, KTH Språkbanken Sam, Isof Computational Linguistics, UU Department of Linguistics, SU GRIDH, GU	Humanities Lab, LU Humlab, UmU National Library of Sweden Computer and Information Science, LiU Swedish National Archive	sbc-info@lists.uu.se