ECE 598: Speech Synthesis

Richard Sproat

Fall 2004

MW 2-3:20, Everitt 168

Office Hours: Wednesdays 10-11, Beckman 2057

Overview Syllabus Requirements Online Resources Bibliography

Overview

This course will cover all aspects of Text-to-Speech synthesis (TTS), the technology of getting computers to talk like humans. After a general introduction to the field and history, we will examine in depth each of the areas of linguistics and speech technology involved in building a TTS system. TTS is an inherently multidisciplinary area, and understanding TTS involves understanding areas ranging from articulatory and acoustic phonetics, the linguistic analysis of text, prosody, and signal processing.

Syllabus

Week Topic Reading Reading for Discussion Homework
1: 8/25 History, Overview, Festival Dutoit, 1 ----- Homework 1
2: 8/30, 9/1 The Vocal Tract (Acoustic Phonetics) (See UCLA pages on Sounds of World's Languages) Fant, 1970, 1.4--2.4 (pp. 63--161) Klatt, 1987 (discussant: Hahn Koo)  
3: 9/8 Text Normalization; Morphological Analysis and Word Pronunciation; Syntactic Analysis Dutoit, 2, 3, 4, 5; Sproat 2.5, 3, 4; Sproat et al. 2001 Fant, 1970, 1.4 (pp. 63--90) (discussant: Sandeep Phatak) Homework 2
4: 9/13, 9/15 More on text analysis
""
Yarowsky, 1996, chapters 1-6 (discussant: Cecilia Alm)  
5: 9/20, 9/22 Even more on text analysis
""
Golding&Rosenbloom, 1996, Jannedy&Moebius, 1997 Marchand&Damper, 2000 (discussant: Liam Moran) Homework 3
6: 9/27, 9/29 Theories and Models of Intonation and Prosody (Warning 20Mb);
9/29: Guest Lecture, Chilin Shih, FLB G96
Dutoit 6; Sproat 6; Silverman et al. 1992; van Santen & Möbius, 2000 ; Taylor, 2000; Kochanski & Shih, 2003. No discussion this week.  
7: 10/4, 10/6 Finite-state approaches to multilingual text processing
Project proposals due
  Liberman & Pierrehumbert 1984 (discussant: Taejin Yoon) Homework 4
8: 10/11, 10/13
Accent and Phrasing Prediction. Segmental Duration
Sproat 5 Bachenko & Fitzpatrick, 1990
Wang & Hirschberg, 1992
Taylor & Black, 1998
(discussant: Jason Strohmaier)
Homework 5
9: 10/18, 10/20 Linear Predictive Coding; Formant Synthesizers; The Klatt Synthesizer Dutoit 8
Klatt, 1980
No discussion this week  
10: 10/25, 10/27 Time-domain models; Fixed-unit systems; Unit-Selection Approaches (CHATR and progeny); Large Numbers of Rare Events (LNRE) Distributions Dutoit, 9, 10
Sproat 7
Möbius (2001)
Boersma, 1998, Chs. 1--3 (discussant: Andreas Ehmann)  
11: 11/1, 11/3 Unit-Selection Approaches Continued, LNRE   Project progress oral reports  
12: 11/8, 11/10 "Limited Domain" Applications: talking clock; Evaluation Methods; Document Structure: email readers, reading for the blind. Sproat 8, 9;
Sproat, Hu and Chen, 1998;
Raman, 1994
van Santen, 1993,
van Santen et al, 1998,
SpeechWorks, 2002
(discussant: Alla Rozovskaya)
 
13: 11/15, 11/17 TTS and the WWW, Markup for TTS, SABLE and the W3C; Concept-to-Speech Systems Sproat, 9,
Sproat et al. 1998,
Sproat and Raman (no date),
SABLE specification,
W3C's Speech Synthesis Markup Language Version 1.0
Davis and Hirschberg 1988.
No discussion this week.  
14: 11/29, 12/1 Other issues: Visual TTS, Future Directions and Wrap-up Ezzat and Poggio, 1998;
Graf and Cosatto, 2001;
Cassell et al. 1994;
Sproat, Ostendorf and Hunt, 1999
Project final presentations.  
15: 12/6, 12/8     Project final presentations.  

Course Requirements

There are two texts for this course:
Dutoit's book is a good overall introduction to text-to-speech synthesis. Dutoit works mostly at the signal processing end of TTS, so his coverage of signal processing approaches is particularly thorough.

The volume I edited is an overview of one particular system, the Bell Labs Multilingual TTS system, a system that was quite influential in its day. There is an extended coverage on the approach to text normalization using finite-state transducers.

We will also be using the Festival TTS System as well as the AT&T lextools and fsm packages. Documentation on the Festival system can be found here. A description of how to use the lextools system in TTS can be found here (look for the link to the paper).

The requirements for a grade in this course are:

Online Resoures Relating to Speech Synthesis

The following are some of the online demos and other resources on text-to-speech synthesis. With two exceptions, I am restricting this list to sites where you can type your own text: there are many sites available that have canned examples, but for reasons discussed in the course, these are of limited utility. A Google search on "text-to-speech synthesis" will yield many hits.

Bibliography

  1. Baayen, H. 2000. Word Frequency Distributions. Kluwer. Dordrecht.
  2. Bacchiani, M. 1999. Speech Recognition Design Based on Automatically Derived Units. Boston University Ph.D Dissertation. PDF.
  3. Bachenko, Joan and Fitzpatrick, Eileen. "A computational grammar of discourse-neutral prosodic phrasing in English." Computational Linguistics, 16(3):155--170, 1990. PDF
  4. Black, A. and Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. Proceedings of Eurospeech 97, vol2 pp 601-604, Rhodes, Greece.
  5. Black, A. and Lenzo, K. (2001): "Optimal Data Selection for Unit Selection Synthesis". Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis (Blair Atholl, UK). PDF
  6. Boersma, Paul. 1998. Functional Phonology: Formalizing the Interactions between Articulatory and Perceptual Drives. Ph.D. Dissertation, University of Amsterdam. Available here.
  7. Breen, A. and Jackson, P. 1998. Non-uniform Unit Selection and the Similarity Metric within BT's Laureate TTS System. Third ESCA/COCOSDA Workshop on Speech Synthesis. November 26-29, 1998 Jenolan Caves House, Blue Mountains, NSW, Australia
  8. Mark Beutnagel and Alistair Conkie, Interaction of units in a unit selection database. Proceedings of the European Conference on Speech Communication and Technology (Budapest, Hungary), 1999, vol. 3, pp. 1063-1066.
  9. Cassell, J.; Pelachaud, C.; Badler, N.; Steedman, M.; Achorn, B.; Becket, T.; Douville, B.; Prevost, S.; and Stone, M. 1994. "Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents." In Proceedings of ACM SIGGRAPH '94. Citeseer.
  10. Conkie, A. and Isard, S., "Optimal coupling of diphones," Progress in Speech Synthesis, Springer, 1997, pp. 293-304.
  11. Cruttenden, A. (1986). Intonation. Cambridge: CUP.
  12. Davis, J. and Hirschberg, J. 1988. "Assigning intonational features in synthesized spoken directions" ACL 26 187 - 193. PDF.
  13. Donovan, R. (1996). Trainable Speech Synthesis. Cambridge University Ph.D. dissertation. PDF
  14. R. Donovan and P. Woodland, "Improvements in an HMM-based speech synthesiser," in Eurospeech95, Madrid, Spain, 1995, vol. 1, pp. 573--576.
  15. Dusterhoff, K. E., Black, A. W., and Taylor, P. (1999). Using decision trees within the tilt intonation model to predict f0 contours. In Eurospeech 1999. PDF.
  16. Ezzat, T. and Poggio, T. "MikeTalk: A Talking Facial Display Based on Morphing Visemes", Proceedings of the Computer Animation Conference, Philadelphia, Pennsylvania, June 1998. PDF
  17. Fant, Gunnar. Acoustic Theory of Speech Production. 1970. Mouton, The Hague.
  18. Fitzpatrick, Donal. 1999. Towards Accessible Technical Documents: Production of Speech and Braille Output from Formatted Documents. Ph.D. thesis, Dublin City University.
  19. Fujisaki, H. (1981). "A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour." In Fujimura, O., editor, Vocal Fold Physiology: Voice Production, Mechanisms and Functions, pages 347-355. Raven, New York.
  20. Fujisaki, H. (1983). Dynamic characteristics of voice fundamental frequency in speech and singing. In MacNeilage, P. F., editor, The Production of Speech, pages 39-55. Springer-Verlag.
  21. Golding, A. R. and Rosenbloom, P. S. A comparison of Anapron with seven other name-pronunciation systems. Journal of the American Voice Input/Output Society, 14, 1993. http://citeseer.ist.psu.edu/golding96comparison.html
  22. Graf, Hans-Peter and Cosatto, Eric. 2001. "Sample-based Synthesis of Talking Heads". PDF
  23. 't Hart, J., & Cohen, A. (1973). Intonation by rule a perceptual quest. Journal of Phonetics, 1, 309-327.
  24. 't Hart, J., & Collier, R. (1975). Integrating different levels of intonation analysis. Journal of Phonetics, 3, 235-55.
  25. Hirschberg, J. 1993. "Pitch Accent in Context: Predicting Intonational Prominence from Text". Artificial Intelligence. PDF
  26. Hunt, M., Zwierynski, D. and Carr, R. "Issues in high quality LPC analysis and synthesis", Eurospeech89, vol. 2, pp 348-351, Paris, France. 1989.
  27. Hunt, A. and Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of ICASSP 96, vol 1, pp 373-376, Atlanta, Georgia. PS. PDF.
  28. Jannedy, S. and Möbius, B. "Name Pronunciation in German Text-to-Speech Synthesis" Proceedings of the Fifth Conference on Applied Natural Language Processing, Association for Computational Linguisitics, pp. 49-56. March 31- April 3, 1997. Washington, DC. http://acl.ldc.upenn.edu/A/A97/A97-1009.pdf
  29. Klatt, D.H., "Software for a Cascade/Parallel Formant Synthesizer", Journal of the Acoustical Society of America, 67.3, 971-995, March 1980. Online version
  30. Klatt, D.H., "Review of Text-to-Speech Conversion for English", Journal of the Acoustical Society of America, 82.3, 737-793, Sept 1987. Online version
  31. Kochanski, G. P. and Shih, C. (2003). Prosody modeling with soft templates. Speech Communication. 39, 311--352 PDF.
  32. Ladd, D. R. (1996). Intonational Phonology. Cambridge University Press, Cambridge.
  33. Liberman, M. Y. and Pierrehumbert, J. B. (1984). "Intonational invariance under changes in pitch range and length." In Aronoff, M. and Oehrle, R., editors, Language Sound Structure, pages 157-233. M.I.T. Press, Cambridge, Massachusetts.
  34. Malfrere, F. and Dutoit, T. 1997. High-Quality Speech Synthesis for Phonetic Speech Segmentation. In Proceedings of Eurospeech97, Rhodes, Greece,
  35. Marchand, M. and Damper, R. "A multistrategy approach to improving pronunciation by analogy". Computational Linguistics archive Volume 26(2) (June 2000) Pages: 195 - 219 http://acl.ldc.upenn.edu/J/J00/J00-2003.pdf
  36. Bernd Möbius. (2001): "Rare events and closed domains: two delicate concepts in speech synthesis". Proceedings of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis (Blair Atholl, UK), 41-46 PDF. (Also in International Journal of Speech Technology, 6(1), 57-71, 2003)

  37. Öhman, S. (1967). "Word and sentence intonation, a quantitative model." Technical report, Department of Speech Communication, Royal Institute of Technology (KTH).

  38. Shimei Pan, Kathleen McKeown (1997), Integrating Language Generation with Speech Synthesis in a Concept to Speech system , In the Proceedings of ACL/EACL'97 Concept to Speech workshop, pages 23-28 Madrid, Spain. PS
  39. Shimei Pan and Kathy McKeown (1999), "Word Informativeness and Automatic Pitch Accent Modeling" , In the Proceedings of EMNLP/VLC'99, College Park, Maryland. PS
  40. de Pijper, J. R. (1983). Modelling British English Intonation. Foris Publications, Dordrecht, Holland.
  41. Raman. T.V. 1994. Audio System for Technical Readings. Ph.D. Thesis. Cornell University. PDF.
  42. Sagisaka, Y.; Campbell, N.; Higuchi, N. Computing Prosody, Springer, 1997.
  43. Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, ``Speech parameter generation algorithms for HMM-based speech synthesis,'' Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, vol.3, pp.1315-1318, June 2000. PDF
  44. van Santen, J. P. H., Perceptual experiments for diagnostic testing of text-to-speech systems, Computer Speech and Language, 1993, 7:49--100. PDF
  45. van Santen, J.P.H., "Assignment of segmental duration in text-to-speech synthesis", Computer Speech and Language, 8, 95-128, 1994.
  46. van Santen, Jan; Pols, Louis; Abe, Masanobu; Kahn, Dan; Keller, Eric; and Vonwiller, Julie. 1998. "Report on the Third ESCA TTS Workshop Evaluation Procedure". Proceedings of the Third ESCA TTS Worksohp. PDF
  47. van Santen, J., and Möbius, B. 2000. "A quantitative model of F0 generation and alignment". In Antonis Botinis (ed.) Intonation - Analysis, Modelling and Technology (Kluwer, Dordrecht), 269-288, Prepublication version.
  48. Shih, C; Kochanski, G. 2001. "Synthesis of Prosodic Styles". Proceedings of Fourth ISCA Workshop on Speech Synthesis, Scotland. PDF.
  49. Shih, C; Kochanski, G. 2003. "Modeling of Vocal Styles Using Portable Features and Placement Rules." International Journal of Speech Technology. 6:393-408.
  50. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). "ToBI: A standard for labeling english prosody." Proceedings of the International Conference on Spoken Language Processing, volume 2. PDF
  51. SpeechWorks. 2002. "Evaluating TTS Systems". Whitepaper. SpeechWorks International. PDF
  52. Sproat, R. W., "English noun-phrase accent prediction for text-to-speech," Computer Speech & Language, 1994.
  53. Sproat, R; Hunt, A; Ostendorf, M; Taylor, P; Black, A; Lenzo, K; Edgington, M. "SABLE: A Standard for TTS Markup", International Conference on Spoken Language Processing, 1998. Online version. PDF
  54. Sproat, R.; Hu, J.; Chen, H. 1998. "EMU: An E-mail Preprocessor for Text-to-Speech," IEEE Signal Processing Society Workshop on Multimedia Signal Processing, Los Angeles, CA. PDF; HTML version.
  55. Sproat, R; Ostendorf, M; Hunt, A. 1999. "The Need for Increased Speech Synthesis Research: Report of the 1998 NSF Workshop for Discussing Research Priorities and Evaluation Strategies in Speech Synthesis". PDF
  56. Sproat, R; Black, A; Chen, S; Kumar, S; Ostendorf, M; Richards, C. "Normalization of non-standard words." Computer Speech and Language, 15(3), 287-333, 2001. PDF
  57. Sproat, R; Raman, T.V., no date. "SABLE: An XML-based Aural Display List for the WWW". Online Version.
  58. R. Sproat, P. Taylor, M. Tanenblatt and A. Isard, 1997. "A markup language for text-to-speech synthesis," Proceedings of EUROSPEECH 97, Rhodes, Greece. PDF
  59. Taylor, P. A. 2000. "Analysis and synthesis of intonation using the Tilt model." Journal of Acoustical Society of America, 107(3):1697-1714. Prepublication version
  60. P. Taylor, "Concept-to-Speech synthesis by phonological structure matching," Philosophical Transactions of the Royal Society, Series A. 356(1769):1403--1416, 2000. http://citeseer.ist.psu.edu/taylor00concepttospeech.html
  61. Taylor, P.; Black, A. "Assigning phrase breaks from part-of-speech sequences," Computer Speech and Language, vol. 12, pp. 99-117, 1998. PDF.
  62. Paul Taylor and Alan W Black (1999). Speech Synthesis by Phonological Structure Matching. Eurospeech99. PS
  63. P. A. Taylor and A. Isard, "SSML: A speech synthesis markup language," Speech Communication, no. 21, pp. 123-133. 1997. http://citeseer.ist.psu.edu/taylor97ssml.html
  64. Yarowsky, D. Three Machine Learning Algorithms for Lexical Ambiguity Resolution, Ph.D. Dissertation, U. of Pennsylvania, 1996
  65. Wang, M. and Hirschberg, J., 1992. "Automatic Classification of Intonational Phrase Boundaries," Computer Speech and Language, 6:175-196. PDF.
  66. Wouters, J.; Rundle, B.; Macon, M. 1999. "Authoring tools for Speech Synthesis using the SABLE Markup Standard." EUROSPEECH 99. PDF