Resources for Speech Technologies

As a part of our mission we provide open data and resources to the public concerning speech technologies. You can find a detailed list here with short explanations and further references to get more information. Also, please visit our wiki for more details on speech technologies and the models we maintain.


Name Language Type License Download
TV3Parla v0.3 Catalan acoustic model GNU AGPL-3.0 link
TV3Parla+ParlamentParla v0.2 Catalan acoustic model GNU AGPL-3.0 link
TV3Parla Corpus v0.3 Catalan audio corpus CC-BY-NC 4.0 link
ParlamentParla Corpus - clean v1.0 Catalan audio corpus CC-BY 4.0 link
ParlamentParla Corpus - other v1.0 Catalan audio corpus CC-BY 4.0 link
ParlamentParla Corpus - old v0.3 Catalan audio corpus CC-BY 4.0 link
OpenSubtitles LM v1.0 Catalan language model CC-BY 4.0 link



Acoustic corpora

For the two projects we finished successfully, we have gathered publicly available speech data and converted them into acoustic training corpora. These data sets are available for download with varying open licenses.

  • TV3Parla

    This corpus includes 240 hours of Catalan speech from broadcast material. The details of segmentation, data processing and also model training are explained in Külebi, Öktem; 2018. The content is owned by Corporació Catalana de Mitjans Audiovisuals, SA (CCMA); we processed their material and hereby making it available under their terms of use.

    The corpus can be reached here under a CC BY-NC 4.0 license.

    You can cite the data using the BibTeX entry:

    @inproceedings{Külebi2018,
      author={Baybars Külebi and Alp Öktem},
      title={Building an Open Source Automatic Speech Recognition System for Catalan},
      year=2018,
      booktitle={Proc. IberSPEECH 2018},
      pages={25--29},
      doi={10.21437/IberSPEECH.2018-6},
      url={http://dx.doi.org/10.21437/IberSPEECH.2018-6}
    }
    


    This project was supported by the Softcatalà Association.

  • ParlamentParla

    We have gathered this audio corpus from the recordings and the transcripts of the Catalan Parliament (Parlament de Catalunya) plenary sessions. We have aligned the transcriptions with the recordings and extracted the cleanest 320 hours to train speech models. The content belongs to the Catalan Parliament and the data is released conforming their terms of use.

    As of version 1.0, the corpus can be reached in two parts; 90 hours of clean and 230 hours of other quality segments, both under a CC BY 4.0 license. In addition to the segmented audio files and the transcriptions, the v0.3 corpus also includes per intervention full text vs audio links. In the near future we will also publish the structured form of the parliamentary sessions (session id, speaker, intervention text, intervention duration etc.).

    This project was supported by the Department of Culture of the Catalan autonomous government.

ASR models

These are the ASR models that we trained, using the aforementioned corpora. For now we used the CMUSphinx speech recognition toolkit, that is the result of over 20 years of research in Carnegie Mellon University. Although currently the state-of-the-art is hybrid Hidden Markov Model (HMM) and Neural Networks (NN) technologies such as Kaldi, pocketsphinx tool is still the leader in offline decoding for resource limited environments such as hand-held devices. We continue our work on maintaining and bettering the models in our repository. You can find installation and configuration guides, including tutorials on basic use-cases in the wiki.

Here is a list of latest CMUSphinx models:

The preparation of this page was supported by the Culture Department of the Catalan autonomous government.