Språkbanken CLARIN        CLARIN K-Centre        Uppsala universitet SVENSKA
 
                 

Resource and Knowledge Centres

– Swedish in a Multilingual Setting, SMS

The CLARIN Knowledge Centre for Swedish in a Multilingual Setting (CLARIN-SMS) offers expertise in linguistic processing of text, especially for Swedish and/or when multiple languages are involved. In addition, CLARIN-SMS offers expertise in the application of language technology to Swedish Sign Language.

CLARIN-SMS is primarily directed at researchers in the humanities and social sciences with a need for analysis, annotation or mining of Swedish or multilingual text, and additionally at researchers with a need for corpora or tools for Swedish Sign Language.

CLARIN-SMS makes resources in the form of tools for linguistic processing, as well as corpora available for research in the Humanities and Social Sciences. The resources include monolingual (mainly Swedish) and multilingual corpora across several domains, and tools for the basic processing of text, including tokenisation, morphological analysis, part-of-speech tagging, syntactic parsing, and named entity recognition.

Main Areas of Expertise

CLARIN-SMS offers special expertise:

For researchers interested in exploring Swedish texts by providing support for the creation and processing of Swedish texts with a variety of computational methods, such as linguistic annotation at different levels, or sentiment analysis.

For researchers interested in comparative analyses by providing support for the creation and processing of parallel and comparable corpora, including alignment and machine translation, as well as cross-linguistically consistent annotation within the framework of Universal Dependencies which allows for easy comparative analyses.

For researchers interested in education and content accessibility by providing support for computation and evaluation of measures of text complexity.

For researchers and users of Swedish Sign Language by providing support for the creation of lexicons and corpora for SSL, and annotation of SSL (including glosses, part-of-speech tagging and syntactic structure).

The support is provided by several partners participating in the CLARIN-SMS distributed Knowledge Centre:

  • Linköping University, Department of Computer and Information Science
  • Stockholm University, Department of Linguistics
  • Uppsala University, Department of Linguistics and Philology.

Although each CLARIN-SMS node works as a separate unit and promotes its services and resources in various ways, including promotion tours at universities, with web pages presenting projects and resources, and with presentations at CLARIN-related events, the K-centre is a common resource. CLARIN-SMS is a vibrant community, and, in accordance with CLARIN’s general mission of creating and promoting language resources, a variety of activities has been carried out at the respective nodes, including tool and resource development for language analysis, both multilingual and Swedish only.

An Active Research Hub

A number of activities are focused especially on promoting the use of language technology in SSH. For instance, one of the projects includes analysing the development of the concept of 'handicapped' from a Swedish parliamentary perspective. In this project, we help researchers process and analyse the Swedish Government’s official reports from early 1900 to the present day with a variety of SweClarin resources and language technology tools, such as the SPARV pipeline.

Another example is the analysis of the protocols of the Swedish National Bank (Sw. Riksbanken), where we compare protocols from the period when they were anonymous to protocols from the period when they were not. One of the goals of this study is to see if we can identify individual speakers from the period of anonymous protocols. Another goal is to provide the National Bank with information about potential differences and similarities in argumentation between the two types of protocols. To this end, we use a variety of SweClarin resources, such as the sensaldo-v02 sentiment lexicon or the SPARV pipeline for parsing, in combination with, for instance, topic and sentiment analysis models.

A further example is a project that is led in cooperation with management researchers, in which we are analysing Swedish companies’ adherence and adoption of the information security standard ISO 27001. The aim of the project is to examine the communicative constitution of preventive innovation in organisations. For this project, we helped create a corpus and analyse it from multiple interdisciplinary perspectives using SweClarin tools and resources, such as the sensaldo-v02 sentiment lexicon or the SPARV pipeline for parsing, as well as other language technology tools, including word clouds.

Tools and Resources

Participants

Helpdesk Contact

Arne Jönsson, arne.jonsson@liu.se

Publications

  • Lars Ahrenberg (2015). Converting an English–Swedish Parallel Treebank to Universal Dependencies. In Proceedings of the Third International Conference on Dependency Linguistics (DepLing 2.015), Association for Computational Linguistics, pages 10–19. ACL Anthology W15-2103.

  • Lars Ahrenberg, Henrik Danielsson, Staffan Bengtsson, Hampus Arvå, Lotta Holme and Arne Jönsson (2020). Studying Disability Related Terms with Swe-Clarin Resources. In Selected Papers from the CLARIN Annual Conference 2019, DOI: https://doi.org/10.3384/ecp2020172.

  • Lars Ahrenberg, Daniel Holmer, Stefan Holmlid and Arne Jönsson (2023). Analysing changes in official use of the design concept using SweCLARIN resources. In Selected papers from the CLARIN Annual Conference 2022. DOI: https://doi.org/10.3384/ecp198.

  • Bodil Axelsson, Daniel Holmer, Lars Ahrenberg and Arne Jönsson (2021). Studying Emerging New Contexts for Museum Digitisations on Pinterest. In Selected Papers from the CLARIN Annual Conference 2020. DOI: https://doi.org/10.3384/ecp180, 2021.

  • Daniel Holmer, Lars Ahrenberg, Julius Monsen, Arne Jönsson, Mikael Apel, and Marianna Blix Grimaldi (2023). Who said what? Speaker Identification from Anonymous Minutes of Meetings. In Proceedings of the 24 Nordic Conference on Computational Linguistics (NoDaLiDa).

  • Oskar Jerdhaf, Marina Santini, Peter Lundberg, Anette Karlsson and Arne Jönsson (2021). Focused Terminology Extraction for CPSs: The Case of "Implant Terms" in Electronic Medical Records. In Proceedings of the IEEE International Conference on Communications Workshop on Communication, Computing, and Networking in Cyber-Physical Systems (IEEE CCN-CPS 2021), Montreal, Canada.

  • Arne Jönsson, Subhomoy Bandyopadhyay, Svjetlana Pantic Dragisic and Andrea Fried (2024). Analyses of information security standards on data crawled from company web sites using SweClarin resources. In Selected papers from the 2023 CLARIN Annual Conference. DOI: https://doi.org/10.3384/ecp210.

  • Marco Kuhlmann and Stephan Oepen (2016). Towards a Catalogue of Linguistic Graph Banks. In Computational Linguistics, 42, 4, 819–827. ISSN 0891-2017, E-ISSN 1530-9312.

  • Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).

  • Agata Savary, Sara Stymne, Verginica Barbu Mititelu, Nathan Schneider, Carlos Ramisch and Joakim Nivre (2023). PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions. In Northern European Journal of Language Technology, 9.

  • Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, and Sara Stymne (2018). 82 treebanks, 34 models: Universal Dependency Parsing with Multi-Treebank Models. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 113–123.

  • Leonie Weissweiler, Nina Böbel, Kirian Guiller, Santiago Herrera, Wesley Scivetti, Arthur Lorenzi, Nurit Melnik, Archna Bhatia, Hinrich Schütze, Lori Levin, Amir Zeldes, Joakim Nivre, William Croft och Nathan Schneider (2024). UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies. I Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16919–16932, Torino, Italia. ELRA and ICCL.

  • Robert Östling (2018). Part of Speech Tagging: Shallow or Deep Learning? In Northern European Journal of Language Technology, Volume 5, Article 1.

  • Robert Östling, Carl Börstell, Moa Gärdenfors and Mats Wirén (2017). Universal Dependencies for Swedish Sign Language. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 303–308. Linköping.

  • Robert Östling, Katarina Gillholm, Muratan Kurfalı, Marie Mattson and Mats Wirén (2024). Evaluation of really good grammatical error correction. In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N., editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6582--6593, Torino, Italia. ELRA and ICCL.

  • Robert Östling and Muratan Kurfalı (2023). Language Embeddings Sometimes Contain Typological Generalizations. In Computational Linguistics, 49(4):1003--1051.

  • Robert Östling and Jörg Tiedemann. (2016). Efficient word alignment with Markov Chain Monte Carlo. In Prague Bulletin of Mathematical Linguistics, 106:125--146.


     Partners Members Contact us     
CLARIN ERIC
Språkbanken
Swedish Research Council
Språkbanken Text, GU
Språkbanken Tal, KTH
Språkbanken Sam, Isof
Computational linguistics, UU
Department of Linguistics, SU
GRIDH, GU
Humanities Lab, LU Humlab, UmU
National Library of Sweden
Computer and Information Science, LiU
Swedish National Archive
info@sweclarin.se