中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

Message Board

Respected readers, authors and reviewers, you can add comments to this page on any questions about the contribution, review, editing and publication of this journal. We will give you an answer as soon as possible. Thank you for your support!

Name
E-mail
Phone
Title
Content
Verification Code
Issue 4
Jul.  2019
Turn off MathJax
Article Contents
WANG Guo-liang, CHEN Meng-nan, CHEN Lei. An end-to-end Chinese speech synthesis scheme based on Tacotron 2[J]. Journal of East China Normal University (Natural Sciences), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011
Citation: WANG Guo-liang, CHEN Meng-nan, CHEN Lei. An end-to-end Chinese speech synthesis scheme based on Tacotron 2[J]. Journal of East China Normal University (Natural Sciences), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011

An end-to-end Chinese speech synthesis scheme based on Tacotron 2

doi: 10.3969/j.issn.1000-5641.2019.04.011
  • Received Date: 2018-10-28
  • Publish Date: 2019-07-25
  • The disruptively design for an end-to-end speech synthesis system Tacotron 2, is currently only available in English. This paper is devoted to implementing several improvements to Tacotron 2 and presents a Chinese speech synthesis scheme, including:a pre-processing module to convert Chinese characters into phonetic characters to address the challenge of Chinese character not corresponding to pronunciation, having multiple tones, and having polyphonic words; a pre-training decoder to achieve better sound quality with less corpus given the lack of existing Chinese training corpus; a strategy of weighting the cross-entropy loss and using the multi-layer perceptron, instead of the linear transformation, to predict stop tokens and to solve the Chinese speech synthesis sudden pause problem; and a multi-head attention mechanism to further improve Chinese speech quality. The experimental comparison of the Mel spectrum and the Mel cepstrum distance (MCD) shows that our work is effective and can make Tacotron 2 adapted to the requirements of Chinese speech synthesis.
  • loading
  • [1]
    MOHAMMADI S H, KAIN A. An overview of voice conversion systems[J]. Speech Communication, 2017, 88:65-82. doi:  10.1016/j.specom.2017.01.008
    [2]
    GONZALVO X, TAZARI S, CHAN C A, et al. Recent advances in Google real-time HMM-driven unit selection synthesizer[C]//Interspeech 2016. 2016: 2238-2242.
    [3]
    ZEN H, AGIOMYRGIANNAKIS Y, EGBERTS N, et al. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices[C]//Interspeech 2016. 2016: 2273-2277.
    [4]
    TAYLOR P. Text-to-Speech Synthesis[M]. Cambridge:Cambridge University Press, 2009.
    [5]
    WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: Towards end-to-end speech synthesis[J]. arXiv preprint arXiv: 1703.10135, 2017.
    [6]
    SHEN J, PANG R, WEISS R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 4779-4783.
    [7]
    VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[J]. arXiv preprint arXiv: 1609.03499, 2016.
    [8]
    OORD A, LI Y, BABUSCHKIN I, et al. Parallel WaveNet: Fast high-fidelity speech synthesis[J]. arXiv preprint arXiv: 1711.10433, 2017.
    [9]
    ARIK S O, Chrzanowski M, Coates A, et al. Deep voice: Real-time neural text-to-speech[J]. arXiv preprint arXiv: 1702.07825, 2017.
    [10]
    ARIK S, DIAMOS G, GIBIANSKY A, et al. Deep voice 2: Multi-speaker neural text-to-speech[J]. arXiv preprint arXiv: 1705.08947, 2017.
    [11]
    PING W, PENG K, CHEN J. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech[J]. arXiv preprint arXiv: 1807.07281, 2018.
    [12]
    PRENGER R, VALLE R, CATANZARO B. WaveGlow: A Flow-based Generative Network for Speech Synthesis[J]. arXiv preprint arXiv: 1811.00002, 2018.
    [13]
    OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[J]. arXiv preprint arXiv: 1601.06759, 2016.
    [14]
    VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Annual Conference on Neural Information Processing Systems. NIPS, 2017: 5998-6008.
    [15]
    SUTSKEVER I, VINYALS O, Le Q V. Sequence to sequence learning with neural networks[C]//28th Annual Conference on Neural Information Processing Systems. NIPS, 2014: 3104-3112.
    [16]
    FREEMAN P, VILLEGAS E, KAMALU J. Storytime-end to end neural networks for audiobooks[R/OL].[2018-08-28]. http://web.stanford.edu/class/cs224s/reports/PierceFreeman.pdf.
    [17]
    GRIFFIN D, LIM J. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(2):236-243. doi:  10.1109/TASSP.1984.1164317
    [18]
    WANG D, ZHANG X W. Thchs-30: A free chinese speech corpus[J]. arXiv preprint arXiv: 1512.01882, 2015.
    [19]
    CHUNG Y A, WANG Y, HSU W N, et al. Semi-supervised training for improving data efficiency in end-to-end speech synthesis[J]. arXiv preprint arXiv: 1808.10128, 2018.
    [20]
    KUBICHEK R. Mel-cepstral distance measure for objective speech quality assessment[C]//Communications, Computers and Signal Processing, IEEE Pacific Rim Conference on. IEEE, 1993: 125-128.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(2)

    Article views (294) PDF downloads(109) Cited by()
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return