As a part of our mission, we provide open data and resources on speech technologies, specifically automatic speech recognition (ASR), text-to-speech synthesis (TTS) and machine translation (MT) in the languages we work with. You can find a detailed list here with short explanations and further references to get more information. You can also find some of these resources in Col·lectivaT’s page in Hugging Face.
Name | Language | Type | License | Download |
---|---|---|---|---|
TV3Parla v0.3 | Catalan | acoustic model | GNU AGPL-3.0 | link |
TV3Parla+ParlamentParla v0.2 | Catalan | acoustic model | GNU AGPL-3.0 | link |
TV3Parla Corpus v0.3 | Catalan | audio corpus | CC-BY-NC 4.0 | link |
ParlamentParla Corpus v2.0 | Catalan | audio corpus | CC-BY 4.0 | link |
Catotron - Ona | Catalan | TTS model | CC-BY 4.0 | link |
Catotron - Pau | Catalan | TTS model | CC-BY 4.0 | link |
UPC FestCat Ona - optimized | Catalan | TTS audio corpus | CC BY-SA 3.0 ES | link |
UPC FestCat Pau - optimized | Catalan | TTS audio corpus | CC BY-SA 3.0 ES | link |
OpenSubtitles LM v1.0 | Catalan | language model | CC-BY 4.0 | link |
Tamazight monolingual and parallel texts | Tamazight | text data | CC-BY 2.0 | link |
Araina text corpus | Occitan Aranese | text data | CC-0 1.0 | link |
Şalom articles | Judeospanish | text data | CC-BY 4.0 | link |
Una Fraza al diya | Judeospanish | text data | CC-BY 4.0 | link |
During various projects, we have gathered publicly available speech data and converted them into acoustic training corpora. These data sets are available for download with varying open licenses.
This corpus includes 240 hours of Catalan speech from broadcast material. The details of segmentation, data processing and also model training are explained in Külebi, Öktem; 2018. The content is owned by Corporació Catalana de Mitjans Audiovisuals, SA (CCMA); we processed their material and hereby making it available under their terms of use.
The corpus can be reached here under a CC BY-NC 4.0 license.
This project was supported by the Softcatalà Association.
We have gathered this audio corpus from the recordings and the transcripts of the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007 and 2018. We aligned the transcriptions with their respective recordings and segmented them optimal for ASR development. The content belongs to the Catalan Parliament and the data is released conforming their terms of use.
The version 0.3 corpus includes per-intervention full text aligned with audio links.
As of version 1.0, the corpus can be reached in two parts; 90 hours of clean and 230 hours of other quality segments.
As of version 2.0, the corpus is extended and separated into 211 hours of clean and 400 hours of other quality segments. Furthermore, each speech segment is tagged with its speaker and each speaker with their gender.
Preparation of this corpus was partly supported by the Department of Culture of the Catalan autonomous government, and the v2.0 was supported by the Barcelona Supercomputing Center, within the framework of the project AINA of the Department of Digital Policies.
FestCat corpus was developed by TALP Research Center, Polytechnic University of Barcelona in 2007 for building open source TTS systems for Catalan. We reprocessed this corpus by optimizing it to build our neural-network based TTS Catotron. Long segments were split or either discarded to have a maximum audio length of 12 seconds. The male voice corpus Pau contains 6 hours 54 minutes and female voice corpus Ona contains 6 hours 12 minutes. Both of them are released with Attribution-ShareAlike 3.0 Spain (CC BY-SA 3.0 ES) license.
Preparation of this corpus was supported by the Department of Culture of the Catalan autonomous government
These are the ASR models that we trained with CMUSphinx speech recognition toolkit, using the aforementioned corpora. We continue our work on maintaining and bettering the models in our repository. You can find installation and configuration guides, including tutorials on basic use-cases in the wiki.
sphinxtrain
5pre-alpha continuous modelsphinxtrain
5pre-alpha continuous modelFor more information, you can refer to our paper published in Iberspeech 2018.
Catotron is the first free, open speech synthesis system based on neural networks. Col·lectivaT has lead the development with funding from Department of Culture of the Catalan autonomous government with the participation of researchers from Natural Language Processing research group (TALN) of Pompeu Fabra University and Language and Speech Technologies and Applications Center of Polytechnic University of Catalonia (UPC-TALP).
For more information, you can refer to our paper published in Interspeech 2020.
The preparation of this page was supported by the Culture Department of the Catalan autonomous government.