中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种基于Tacotron 2的端到端中文语音合成方案

王国梁 陈梦楠 陈蕾

王国梁, 陈梦楠, 陈蕾. 一种基于Tacotron 2的端到端中文语音合成方案[J]. 华东师范大学学报(自然科学版), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011
引用本文: 王国梁, 陈梦楠, 陈蕾. 一种基于Tacotron 2的端到端中文语音合成方案[J]. 华东师范大学学报(自然科学版), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011
WANG Guo-liang, CHEN Meng-nan, CHEN Lei. An end-to-end Chinese speech synthesis scheme based on Tacotron 2[J]. Journal of East China Normal University (Natural Sciences), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011
Citation: WANG Guo-liang, CHEN Meng-nan, CHEN Lei. An end-to-end Chinese speech synthesis scheme based on Tacotron 2[J]. Journal of East China Normal University (Natural Sciences), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011

一种基于Tacotron 2的端到端中文语音合成方案

doi: 10.3969/j.issn.1000-5641.2019.04.011
详细信息
    作者简介:

    王国梁, 男, 硕士, 高级工程师, 长期从事电力信息化建设和电力信息化管理工作

    通讯作者:

    陈蕾, 女, 博士, 副教授, 研究方向为计算机网络与智能系统.E-mail:lchen@cs.ecnu.edu.cn

  • 中图分类号: TP391

An end-to-end Chinese speech synthesis scheme based on Tacotron 2

  • 摘要: 颠覆性设计的端到端语音合成系统Tacotron 2,目前仅能处理英文.致力于对Tacotron 2进行多方位改进,设计了一种中文语音合成方案,主要包括:针对汉字不表音、变调和多音字等问题,添加预处理模块,将中文转化为注音字符;针对现有中文训练语料不足的情况,使用预训练解码器,在较少语料上获得了较好音质;针对中文语音合成急促停顿问题,采用对交叉熵损失进行加权,并用多层感知机代替变线性变换对停止符进行预测的策略,获得了有效改善;另外通过添加多头注意力机制进一步提高了中文语音合成音质.梅尔频谱、梅尔倒谱距离等的实验对比结果表明了方案的有效性:可以令Tacotron 2较好地适应中文语音合成的要求.
  • 图  1  Tacotron 2结构

    Fig.  1  Architecture of Tacotron 2

    图  2  Transformer系统结构

    Fig.  2  System architecture of transformer

    图  3  中文端到端语音合成方案

    Fig.  3  Proposed Chinese end-to-end speech synthesis scheme

    图  4  万步时预训练和未预训练获得的梅尔频谱

    Fig.  4  100, 000-step pre-trained and un-pretrained Mel spectrum

    表  1  原始Tacotron 2、添加双头(2-head)和4头(4-head)的注意力机制之间的MCD比较

    Tab.  1  Comparison of MCD among original Tacotron 2 and improved ones (2-head, 4-head)

    MCD
    Tacotron 2-Base 20.03
    2-head 18.1
    4-head 17.26
    下载: 导出CSV

    表  2  中文Tacotron 2、英文Tacotron 2的MCD比较

    Tab.  2  Comparison of MCD between Chinese Tacotron 2 and English Tacotron 2

    MCD
    中文Tacotron 2 17.11
    英文Tacotron 2 18.06[21]
    下载: 导出CSV
  • [1] MOHAMMADI S H, KAIN A. An overview of voice conversion systems[J]. Speech Communication, 2017, 88:65-82. doi:  10.1016/j.specom.2017.01.008
    [2] GONZALVO X, TAZARI S, CHAN C A, et al. Recent advances in Google real-time HMM-driven unit selection synthesizer[C]//Interspeech 2016. 2016: 2238-2242.
    [3] ZEN H, AGIOMYRGIANNAKIS Y, EGBERTS N, et al. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices[C]//Interspeech 2016. 2016: 2273-2277.
    [4] TAYLOR P. Text-to-Speech Synthesis[M]. Cambridge:Cambridge University Press, 2009.
    [5] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: Towards end-to-end speech synthesis[J]. arXiv preprint arXiv: 1703.10135, 2017.
    [6] SHEN J, PANG R, WEISS R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 4779-4783.
    [7] VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[J]. arXiv preprint arXiv: 1609.03499, 2016.
    [8] OORD A, LI Y, BABUSCHKIN I, et al. Parallel WaveNet: Fast high-fidelity speech synthesis[J]. arXiv preprint arXiv: 1711.10433, 2017.
    [9] ARIK S O, Chrzanowski M, Coates A, et al. Deep voice: Real-time neural text-to-speech[J]. arXiv preprint arXiv: 1702.07825, 2017.
    [10] ARIK S, DIAMOS G, GIBIANSKY A, et al. Deep voice 2: Multi-speaker neural text-to-speech[J]. arXiv preprint arXiv: 1705.08947, 2017.
    [11] PING W, PENG K, CHEN J. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech[J]. arXiv preprint arXiv: 1807.07281, 2018.
    [12] PRENGER R, VALLE R, CATANZARO B. WaveGlow: A Flow-based Generative Network for Speech Synthesis[J]. arXiv preprint arXiv: 1811.00002, 2018.
    [13] OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[J]. arXiv preprint arXiv: 1601.06759, 2016.
    [14] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Annual Conference on Neural Information Processing Systems. NIPS, 2017: 5998-6008.
    [15] SUTSKEVER I, VINYALS O, Le Q V. Sequence to sequence learning with neural networks[C]//28th Annual Conference on Neural Information Processing Systems. NIPS, 2014: 3104-3112.
    [16] FREEMAN P, VILLEGAS E, KAMALU J. Storytime-end to end neural networks for audiobooks[R/OL].[2018-08-28]. http://web.stanford.edu/class/cs224s/reports/PierceFreeman.pdf.
    [17] GRIFFIN D, LIM J. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(2):236-243. doi:  10.1109/TASSP.1984.1164317
    [18] WANG D, ZHANG X W. Thchs-30: A free chinese speech corpus[J]. arXiv preprint arXiv: 1512.01882, 2015.
    [19] CHUNG Y A, WANG Y, HSU W N, et al. Semi-supervised training for improving data efficiency in end-to-end speech synthesis[J]. arXiv preprint arXiv: 1808.10128, 2018.
    [20] KUBICHEK R. Mel-cepstral distance measure for objective speech quality assessment[C]//Communications, Computers and Signal Processing, IEEE Pacific Rim Conference on. IEEE, 1993: 125-128.
  • 加载中
图(4) / 表(2)
计量
  • 文章访问数:  285
  • HTML全文浏览量:  132
  • PDF下载量:  109
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-10-28
  • 刊出日期:  2019-07-25

目录

    /

    返回文章
    返回