A key problem with using text to speech (TTS) services like Amazon Polly is the foreign accent it renders certain words. One such word is Cebu, which systems often pronounce as “say bu.”
TTS services have improved exponentially from the robotic quality of years back into something closer to being humanlike nowadays. The quality of voices is such that they are good enough to be used for voice overs for news reports and marketing materials. A workflow highlighted by Amazon Polly, for example, suggests automating the rendering of articles into audio clips via RSS feeds.
The standard way by which TTS renders text to speech is by concatenation or the stitching together of syllables to produce a word and then a sentence. Services like Amazon Polly, however, now have a neural option to produce even higher quality voices using AI and machine learning.
You can then tweak the voices further by using speech synthesis markup language or SSML.
Marlen and I did a demo of SSML and Amazon Polly for her class on New Media Platforms and Literacies with the College of Communication, Art and Design (CCAD) in UP Cebu last week and the students were amazed at the ability to program the voice output.
Marlen and I have handled several learning sessions where journalists and communications students and teachers learn technical stuff such as coding through our New Media Bootcamp. In our experience, the best approach with non-techie students, especially writers, is by playing up its “markup” character. Languages like HTML or SSML are at its core about marking up documents and text.
SSML with Amazon Polly is literally just marking words up to instruct how Polly reads them.
The code used to generate the voice in the video above:
<speak> <prosody rate="medium"> How do you make AI text to speech services like Amazon Polly say <phoneme alphabet="ipa" ph="seˈbu">Cebu</phoneme> instead of Cebu? You do it by using SSML or speech synthesis markup language and specifying that it reads <phoneme alphabet="ipa" ph="seˈbu">Cebu</phoneme> using a phonetic pronunciation such as <sub alias="I P A">IPA</sub> or International Phonetic Alphabet. </prosody> </speak>
The tag to use in telling Polly to use the IPA pronunciation of a word like Cebu is <phoneme alphabet=”ipa” ph=”seˈbu”>Cebu</phoneme>. The tag tells Polly to pronounce the enclosed word “Cebu” using the IPA phonetic symbol “seˈbu.” The other alphabet that Polly supports is x-sampa for the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA).
The other tag useful for journalists is the one that activates Newscaster speaking style. This only works on the neural format and only for certain voices: Matthew or Joanna for American English, Amy for British English, and Lupe for US Spanish.
The tag to use for newscaster style is <amazon:domain name=”news”>News text here</amazon:domain>.
The code used to generate the audio report above:
<speak> <amazon:domain name="news"> This news report is from the Monday, October 17th, 2022 edition of SunStar <phoneme alphabet="ipa" ph="seˈbu">Cebu</phoneme>. <p>A 19-YEAR-OLD nursing student died after he reportedly jumped from the 10th floor of a university building in <phoneme alphabet="ipa" ph="mændawi">Mandaue</phoneme> City on Monday morning, October 17, 2022.</p> <p>Police Major John Libres, chief of the Opao Police Station of Mandaue City Police Office, said the victim (name withheld) was a first year student of the University of <phoneme alphabet="ipa" ph="seˈbu">Cebu</phoneme> <phoneme alphabet="ipa" ph="lapuˈlapu">Lapu-Lapu</phoneme> <phoneme alphabet="ipa" ph="mændawi">Mandaue</phoneme>.</p> <p>Libres said investigation is ongoing and that they have yet to confirm whether or not the victim committed suicide.</p> <p>Based on the closed circuit television or CCTV footage obtained by the police at the ninth floor, the victim reportedly went to the 10th floor where he allegedly jumped off the building around 9 a.m.</p> </amazon:domain> </speak>
It was refreshing to watch the communications students programming Polly and producing a news report. Quite a few were able to do so on the spot.
An oft-repeated cliche is how journalism or communication students pick their course because of their aversion to Math and anything technical. It is a cliche that should already be set aside. Tech is the core of any task or job nowadays, moreso in depleted newsrooms such as what we’re seeing in community media.
There are fewer reporters and editors in today’s newsrooms and yet the demand for new media products is higher than ever. With TTS and SSML, journalists have the option to use services like Amazon Polly for their audio and video news products instead of having to spend even more time working by recording voice overs.