Dec 5, 2019
Here you can find all the information related to Catotron, the open and free text-to-speech system in Catalan that has been trained with deep neural networks. Our Interspeech 2020 conference paper detailing the implementation process can be accessed from here. The open source code is available in github here and here and the models can be downloaded here.
You can try the Ona voice via our demo page.
Thanks to the application of deep learning methods, text-to-speech systems (TTS) have advanced considerably in the recent years. The most important change has been the introduction of the vocoders (voice encoders) that are trained by neural networks. The first example of such technology was Wavenet, which was published in 2016 and developed by DeepMind, one of the companies owned by Google.
Currently, these vocoders that are being used in the TTS architectures such as Tacotron and Tacotron2 are trained by neural networks. Various implementations of these technologies can be found as open source. These technologies represent the state-of-the-art in TTS technologies and can produce the best possible results with respect to the intelligibility and the naturalness of synthesized speech.
Unfortunately, in order to train these systems, it is absolutely necessary to have access to resources such as data and computational power. That is why there is no published model with open licenses for languages other than English. Thanks to a project funded by the Ministry of Culture of the Catalan Government (Departament de Cultura), we trained the TTS models in Catalan with neural networks and published them with open licenses. We want to present here the results, our experience and the details of how to integrate this technology.
For building Catatron we used are the repositories Tacotron2 and Waveglow of the company NVIDIA and are published with open-source licences in github. One of the most important results of the project is the code itself, namely our fork of Tacotron2. It has been modified for the Catalan language and it is indispensable for using the Catalan models.
Here we present some example recordings. The chosen phrases belong to the validation data set and therefore were not used for the training of Catotron models.
We also made experiments with the ParlamentParla data set and created a model of Artur Mas, who was the person with most recorded hours in the data set. We took advantage of this test in making an estimation of the quantity of data that is needed for training a model. Due to privacy concerns, we do not publish this model, but we present some inference examples. The phrases belong to the recordings used for the validation and were therefore not used for the training of the model.
Different phrases that are being read by all the model voices:
|Festcat Ona||Festcat Pau||Artur Mas|
Non-developers can only synthesize speech via the Google Colaboratory interface for now. With this online ipython notebook, the Google GPUs can be used free of charge. In a few weeks, we will publish a demo web page in which visitors can synthesise short sentences.
How to train new voices: It is already possible to adapt the currently published models to synthesize any voice. It can be done with the help of transfer learning and using the published models and recordings of the new speaker. Our example of catotron-transfer-learning.ipynb gives a step-by-step guide on how to do it. In this specific guide, the model of Ona is adapted to the voice of Pau, with the use of computing resources of Google.
These resources have been developed thanks to the project «síntesi de la parla contra la bretxa digital» (Speech synthesis against the digital gap) that was subsidised by the Department of Culture. A part of the funding comes from the financing administered by the inheritance board of the Generalitat de Catalunya.