Citation: | WANG Guo-liang, CHEN Meng-nan, CHEN Lei. An end-to-end Chinese speech synthesis scheme based on Tacotron 2[J]. Journal of East China Normal University (Natural Sciences), 2019, (4): 111-119. doi: 10.3969/j.issn.1000-5641.2019.04.011 |
[1] |
MOHAMMADI S H, KAIN A. An overview of voice conversion systems[J]. Speech Communication, 2017, 88:65-82. doi: 10.1016/j.specom.2017.01.008
|
[2] |
GONZALVO X, TAZARI S, CHAN C A, et al. Recent advances in Google real-time HMM-driven unit selection synthesizer[C]//Interspeech 2016. 2016: 2238-2242.
|
[3] |
ZEN H, AGIOMYRGIANNAKIS Y, EGBERTS N, et al. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices[C]//Interspeech 2016. 2016: 2273-2277.
|
[4] |
TAYLOR P. Text-to-Speech Synthesis[M]. Cambridge:Cambridge University Press, 2009.
|
[5] |
WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: Towards end-to-end speech synthesis[J]. arXiv preprint arXiv: 1703.10135, 2017.
|
[6] |
SHEN J, PANG R, WEISS R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 4779-4783.
|
[7] |
VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: A generative model for raw audio[J]. arXiv preprint arXiv: 1609.03499, 2016.
|
[8] |
OORD A, LI Y, BABUSCHKIN I, et al. Parallel WaveNet: Fast high-fidelity speech synthesis[J]. arXiv preprint arXiv: 1711.10433, 2017.
|
[9] |
ARIK S O, Chrzanowski M, Coates A, et al. Deep voice: Real-time neural text-to-speech[J]. arXiv preprint arXiv: 1702.07825, 2017.
|
[10] |
ARIK S, DIAMOS G, GIBIANSKY A, et al. Deep voice 2: Multi-speaker neural text-to-speech[J]. arXiv preprint arXiv: 1705.08947, 2017.
|
[11] |
PING W, PENG K, CHEN J. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech[J]. arXiv preprint arXiv: 1807.07281, 2018.
|
[12] |
PRENGER R, VALLE R, CATANZARO B. WaveGlow: A Flow-based Generative Network for Speech Synthesis[J]. arXiv preprint arXiv: 1811.00002, 2018.
|
[13] |
OORD A, KALCHBRENNER N, KAVUKCUOGLU K. Pixel recurrent neural networks[J]. arXiv preprint arXiv: 1601.06759, 2016.
|
[14] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Annual Conference on Neural Information Processing Systems. NIPS, 2017: 5998-6008.
|
[15] |
SUTSKEVER I, VINYALS O, Le Q V. Sequence to sequence learning with neural networks[C]//28th Annual Conference on Neural Information Processing Systems. NIPS, 2014: 3104-3112.
|
[16] |
FREEMAN P, VILLEGAS E, KAMALU J. Storytime-end to end neural networks for audiobooks[R/OL].[2018-08-28]. http://web.stanford.edu/class/cs224s/reports/PierceFreeman.pdf.
|
[17] |
GRIFFIN D, LIM J. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(2):236-243. doi: 10.1109/TASSP.1984.1164317
|
[18] |
WANG D, ZHANG X W. Thchs-30: A free chinese speech corpus[J]. arXiv preprint arXiv: 1512.01882, 2015.
|
[19] |
CHUNG Y A, WANG Y, HSU W N, et al. Semi-supervised training for improving data efficiency in end-to-end speech synthesis[J]. arXiv preprint arXiv: 1808.10128, 2018.
|
[20] |
KUBICHEK R. Mel-cepstral distance measure for objective speech quality assessment[C]//Communications, Computers and Signal Processing, IEEE Pacific Rim Conference on. IEEE, 1993: 125-128.
|