Jul 11, 2024

Launching Col·lectivaT Tech Lab


At Col·lectivaT, we are passionate about exploring new ideas, especially when it comes to language technology. Artificial Intelligence (AI) offers infinite possibilities, and since our inception, we have focused on how recent technologies can empower the marginalized instead of exacerbating their position. Our work has spanned various languages, including Catalan, Tamazight, Judeo-Spanish, Aranese, and Galician, producing pioneering projects like Catotron, automatic content readers, and multilingual message dissemination during the COVID-19 pandemic. Our ongoing journey continues to evolve, driving us toward new horizons. That’s why we’ve decided to share our vision, findings and prototypes on language technology, from now on with the wider community under the banner of Col·lectivaT Tech Lab.

Under this name, we will share our experiments and comments on the latest advancements in the field of language technology and also make visible the work we develop with the professionals and initiatives we collaborate with. Col·lectivaT Tech Lab represents our experimental side and the network we are connected to beyond our client and collaborator base.

This connection with the technological community, together with the significant influence of AI advancements, helps us formulate new projects and build fruitful relationships with initiatives, companies, universities, and language activists from around the world. These alliances are based on the shared vision of a technology that serves people, communities, languages, and cultures, and flourishes freely in a rapidly digitizing world.

A summary of the work of Col·lectivaT’s technology axis

As we’re preparing our first publication about the promises of large language models in education, we want to highlight what Col·lectivaT Tech does and our alignment with AI and ethical practices. We are excited about AI’s potential to automate work for welfare, democratize access to information, and bridge the digital divide for marginalized communities. But at the same time, we are well aware of the ethical pitfalls and stand against closed practices that exploit others’ work without permission. One of the dangers AI poses is how it follows the global trend in prioritizing certain languages and leaving behind many. This digital disparity can further entrench linguistic hierarchies, marginalizing already vulnerable language communities.

At Col·lectivaT, we work diligently to bridge this gap by promoting the digitalization and technological advancement of all languages, ensuring that no language is left behind in the digital age. Our commitment is evident in our efforts to create and support open-source language technology. In this blogpost, we highlight a few selected projects from our portfolio.

Catalan’s Digital Evolution

Since Col·lectivaT’s inception in 2017, we have seen the promises that AI entails and have acted to make language technology more inclusive and just, starting with our adopted language, Catalan. In 2018, we created the first large multi-speaker speech dataset for Catalan and then extended it to obtain a total of 560 hours of data. It was a privilege to see how Catalan has come a long way since then in the Mozilla-initiated Common Voice initiative thanks to immense participation from a nation that puts their language first.

Logo of Catotron, the first modern open source text to speech system for Catalan Thanks to open academic work and support from the Department of Culture of Catalonia, we were able to create the first modern text-to-speech application Catotron in 2019, with other commercial and open-source alternatives emerging today. One of our latest Lab initiative involves the automatic web content reader enVeu, which you can see at the top of the Catalan version of this article.

Especially for marginalized languages, private initiatives or even academia alone are not enough to build a solid technological foundation for their full incorporation into the digital world. We believe that open-source and public initiatives are essential for creating sovereign solutions, as demonstrated by projects like Bhashini from India, Proyecto Ilenia, and AINA. However, it is crucial for these initiatives to avoid monopolizing the narrative and instead support a diverse ecosystem of researchers, technology producers, grassroots community groups, and language activists.

What About the World’s 7000 Languages?

We can confidently say that Catalan has secured itself a solid future in terms of technological support thanks its significant strides in the last years. However, what about languages with, little digital presence, no official status or those facing endangerment? Studies show that a significant percentage of languages are not fairly represented online, affecting their survival. It’s not merely a digitization issue but is a continuation of centuries-old colonization and centralization policies, as we stated in our collaborative paper with the African NLP community Masakhane, “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages.”

Digitizing an Endangered Language: Judeo-Spanish

In 2020, we took on the challenging task of digitizing Judeo-Spanish, a language that connects to the migrated lands of a majority of our members. Judeo-Spanish, also known as Ladino, is the language of the Sephardic Jews expelled from the Iberian Peninsula during the Spanish Inquisition. Many of these Jews settled in the Ottoman Empire and have succeeded in keeping their culture and language alive to this day in modern Turkey.

In collaboration with researchers and Sephardic community in Istanbul, we created and published several datasets on a dedicated portal. Most excitingly, we developed the first machine translation application that translates between Judeo-Spanish and Spanish, Turkish, and English. By creating a text-to-speech application, we also made the translations audible, enabling users to practice pronunciation. Seferad Translate has been used thousands of times since its launch, helping to preserve and revitalize this endangered language.

Digitizing Tamazight: The Awal Project Image of the Awal project, digitizing Tamazight

Our most recent and currently active project, Awal, aims to digitize the Tamazight language, preserving and promoting it in the digital space by developing innovative tools to facilitate its use and dissemination. This project involves active participation from the Tamazight-speaking community both in Catalonia and Morocco, who contribute to creating a comprehensive database through translations and voice recordings. The overarching goal is to address the digital divide by providing linguistic and technical support, thereby ensuring Tamazight’s presence in the digital world. In our newly launched web portal, we have already collected more than 5,500 translations and 2 hours of speech data which will power the creation of open-source machine translation and speech recognition tools that serve the Tamazight-speaking people.

Image from Araina project with a girl recording voice on her computer for Aranese Pioneering Language Data Collection for Aranese: The Araina Project

We are proud to have been pioneers in language data collection for the Aranese. The Araina Project, launched in collaboration with local institutions, aims to preserve and revitalize the Aranese language, a variety of Occitan spoken in the Val d’Aran in Catalonia. The project has not only been a stepping stone in digitizing the language but also increased awareness in the community on language technology through the celebration of a Voice Marathon in Vielha.

Promises of Generative AI

Large language models have been evolving steadily over the past few years and have recently redefined the AI landscape. They have brought significant breakthroughs in areas such as automation of work processes, improving information access, and enhancing creative outputs. While there are concerns, we believe that, like any technology, Generative AI can either serve to narrow hierarchies and centralize power or democratize access to knowledge and tools. Our stance is firmly in favor of the latter, advocating for open tools and methodologies, which unfortunately, the majority of big tech companies don’t follow. OpenAI, creators of ChatGPT, for example, don’t specify how they obtain their data, which could indicate the use of copyrighted creative work and user data. Open-source alternatives, which we opt for, can pave the way for more accountable and privacy-oriented solutions.

Large language models (LLMs), today’s most prominent example of generative AI, have significant potential for social uses across various domains such as healthcare, legal aid, and disaster response. For example, LLMs can provide vital information and support in medical settings, offer legal guidance to those who cannot afford it, and assist in coordinating responses during emergencies by understanding and processing large amounts of data quickly.

One of the most promising areas for LLMs is education. They are already helping people practice and learn new languages by providing interactive, real-time feedback and conversation practice. LLMs have the potential to guide children as well as adults in their learning journeys, making education fun and engaging through interactive storytelling, personalized quizzes, and educational games. These models can tailor educational content to individual learning styles and needs, making it easier for students to grasp complex concepts and retain information.

We look forward to sharing our experiments soon on our use of LLMs. Until then, please feel free to write to us if you’re interested in collaborating. You can reach us at info@collectivat.cat. We look forward to hearing from you!