Erudite talking robots
- Rockets and Robots

- Sep 19, 2019
- 4 min read
Updated: Sep 20, 2019
At work, we were working on leveraging AWS’s Polly service to create text-to-speech audio files for our training programs. One of the inputs for Polly API calls is a text string that you want to be spoken. While the early tests with simple strings were fine in terms of modern robot voices, I hate “Hello World” tests and was looking for something more interesting to listen to as I experimented with the different voices available. Well, my robot friend, Ladvien (a real person with whom I build robots, not an actual robot), happens to be working on a neural network that will write in the style of H.P. Lovecraft. Wonderful! A Lovecraft passage is much more interesting to listen to than Hello World. So I grabbed a random Lovecraft passage to give to Polly.
“At this horror I sank nearly to the lichened earth, transfixed with a dread not of this nor any world, but only of the mad spaces between the stars. Out of the unimaginable blackness beyond the gangrenous glare of that cold flame, out of the Tartarean leagues through which that oily river rolled uncanny, unheard, and unsuspected, there flopped rhythmically a horde of tame, trained, hybrid winged things that no sound eye could ever wholly grasp, or sound brain ever wholly remember. They were not altogether crows, nor moles, nor buzzards, nor ants, nor vampire bats, nor decomposed human beings; but something I cannot and must not recall.”
The results left a lot to be desired.
Educational, testing data (i.e., fair use)
I was using the standard Polly engine (default) and a basic text string as input parameters. I could have tried to spruce things up with SSML (Speech Synthesis Markup Language), which I’ve used in a separate project to control how Alexa pronounced things, but that wasn’t a viable solution. The problems here were too great to make SSML alone a valid option.
Polly has a newer feature that uses a neural network to generate speech (NTTS, neural text-to-speech), including a fairly simple SSML wrapper to output text as a newscaster. New features mean updating SDKs, which means I had time to think…
Important note:
At the time of this writing, the cost for neural Polly is 4 times that of standard Polly ($16 per million characters, versus $4 per million characters, respectively). If you’re doing short, occasional passages you probably won’t notice the difference between the $0.01 and $0.04 cent charges. If your doing a lot of TTS, (the 1M characters mentioned above is about 23 hours of audio) those differences are going to add up. It might be worth it though - Do your customers really want to listen to a crappy robot voice?
One of the characteristics of Lovecraft’s writing style, particularly to the modern reader, is the use of words and phrases that are as alien to our ears as the creatures in his tales. It’s much easier for a neural network to do something properly when it has a good training set to work from. Pass in a cliché and the bot should do a pretty good job of parroting that phrase, much like people do. What happens, though, when an author uses words, or invents words, that the neural net has no training data on?
Another project Ladvien and I are working on is a Lego sorting bot(s). One of the main parts of that is robot vision to accurately identify pieces for sorting. We were having trouble with training data, so I built a separate machine to scan in Lego parts to use for training data. What would the equivalent of that be for a NTTS for reading Lovecraft, or other passages with prose that has largely been lost to time.
Crowdsourcing? Can we get erudite, cunning linguists to read ancient passages in their best voices to provide training data to robots? Would they be willing to do so, or are they likely to be luddites? There were many who pushed back against e-readers, nostalgic for the texture and aroma of the printed page. I’m sure we can find a value proposition to get them onboard.
I wonder if anyone worked on a nerd cologne/perfume that smells of printed books?
The SDK is done updating. Let’s see how the NTTS handles Lovecraft.
Interesting... the neural version doesn't sound too bad to me. Sure, it could be improved, but for out-of-the-box, it's good enough. Newscaster - which is supposed to be a more advanced NTTS engine - sounds confused, as if it isn't certain whether the now uncommon words are being pronounced correctly (lichened?). It reminds me of reading with my kids, them looking at me quizzically as they guess at a pronunciation.
I'm not certain about the details the training data they used for the newscaster version, but let's think about it for a moment. Newscasters read off of teleprompters. The writing tends to be aimed at the general populace. They aren't likely to pronounce arcane words. In the instances where the writers don't have a choice, due to the story content, but to include a word, the newscasters may in fact struggle when they come upon it.
While it might be entertaining to watch newscaster blooper reels to illustrate the point, Key & Peele will do.
Language warning
Back to business, Polly has 8 American English voices. Joanna is the default for a reason - the others sound a bit antiquated. They do, however, have Ivy and Justin, which sound like children. So let's have some fun with those and have them read Lovecraft, or Stephen King, to us.


Comments