Is Voice the Interface of the Metaverse?
Updated: May 27
Is Voice the interface of the future, in the real world and the Metaverse? If you attended Voice Summit 2022 earlier this month you would certainly believe so.
Synthetic Voice simulation, Voice cloning, and Voice-based Natural Language Processing and Analysis have progressed to a level of maturity where we are seeing voice-based client interaction starting to move to human-computer interaction not just at the lowest end of the interaction spectrum – basic IVR, to highly interactive, complex business interactions. Synthetic voice-based bots can today replace humans in a broad range of applications all the way to real-world eCommerce transactions. The prevalence of voice-based digital assistants (Alexa, Siri, Google, to name a few) and multi-purpose smart speakers, have brought voice interfaces to the common man’s home, work, and automobile. Advances in AI and Machine Learning are bringing conversational transactions which can accurately mimic human interactions and even emotions to the mainstream. And we are just beginning to scratch the surface of what is possible with a voice-based computer interface. More importantly, the Metaverse will soon change that all.
For many of my generation, especially those who grew up outside the Developed world where Tandy computers were being sold at the local Radio Shack, the first exposure to computers came from the greatest Science Fiction show ever – Star Trek (The Original Series)*. I still remember watching my first episode of Star Trek and early in the episode Captain Kirk addresses the computer by speaking to it! And she responds vocally too, and they have a complex conversation! For me, the expectation of computers was set then and there. Imagine my surprise when I got to see a human in the real world interact with a computer for the first time (I was not allowed to touch the computer. I even had to remove my shoes outside in order to enter the computer room). I saw the human (my school teacher) type into the computer using a keyboard (not even a mouse yet). I know what Scotty felt like when in Star Trek: The Movie he travels back in time and has to interact with a computer in the present day time, and he picks up the mouse and speaks into it mistaking it for a microphone.
Now, over four decades later, we can actually talk to a computer like Captain Kirk did. And get the computer to respond. IBM’s Watson beat a human at Jeopardy years ago, and their Debate Bot can hold her own (yes her, it uses a female voice) against some of the world’s best debaters. I use Siri to interact with multiple applications on my phone while driving. Multiple shopping sites allow me to add items to my orders by just talking to my smart speaker. No need to wash hands while cooking to pick up a phone. Celebrities and brand ambassadors can create endorsement voiceovers without even waking up, using licensed voice cloning tech. Now we are taking ‘make money while you sleep’ to the next level.
Voice Summit 2022, one of the largest Voice tech conferences on the planet, created and hosted by my good friend of over a decade Pete Erickson, had all of this tech and more in full display. Synthetic and ethical voice cloning, emotion recognition by AI-powered voice bots, full conversational User Experience Design for complex business transactions, and scale, scale to do more faster and better. The voice interface world is already here.
That being said, several serious challenges remain and until they can be properly addressed we will not see voice become the predominant interface. Most of these are partially solved or are close to being solved. And some of them have the tech to solve them but the regulations and standards have not yet caught up to allow them to be properly implemented. Here are a few I mulled upon as I heard speaker after speaker talks about what was available today, and the promise of tomorrow in a voice-centric world:
Identity: Biometrics using voice are not new. Voice imprints can be recorded at high enough fidelity to be used to accurately identify the speaker. But is a voice imprint reliable enough to be a single factor identity mechanism? – obviously no. But how about reliable enough to be a factor in multi-factor authentication? Not really, in my opinion. There are several reasons – first and foremost are of course deepfakes and voice cloning. If we can clone celebrity voices to be licensed for endorsements, they can certainly be cloned unethically. Gathering voice samples of regular people is not difficult to do, and once you have a enough samples, cloning entire sentences can be done by commercially available software. And even if one figures out a way to detect whether a voice is synthetic or real, and this technology exists but is being beaten as cloning tech improves, the challenge remains that the human voice is subject to too much variance. Our voice can change due to an illness, or even due to stress. Try logging in to your airline app while running late for a flight and yelling into it! Will the system recognize ‘late for a flight’ me from the ‘I just had a cup of relaxing chamomile tea before I recorded this voice sample’ me? And human voice changes with age too. An extreme example shared by a speaker at the conference showed how former President Obama’s voice changed as he (rapidly) aged during the very first years of his presidency, actually causing voice recognition to fail!
Data Loss Prevention: ‘Please speak or type your account #’ is a phrase we have been hearing on IVR systems for years. This works if I am in a secure environment where no one can hear me, or alternatively I am on a device with a keyboard where I can type in my account #. But what happens if I am interacting on a smart speaker or a kiosk with no physical input device? Do I speak out my # out loud for the person behind me to hear and maybe even record? What happens when the voice interfaced program wants to share with me information that I do not want anyone else around me to hear – transactions, balances, medical record information? How do we secure PII or PHI data from leaking out into the environment around us as we interact and others can inadvertently hear both us and the voice agent? There is also the issue of Data Loss Prevention of a conversation that has already occurred. I may have a conversation from a secure environment and share freely with the voice agent and the voice agent with me. But how is that conversation stored and secured? Is the sensitive data obfuscated from the recording as it is stored? Does the obfuscation happen at the Edge ensuring that my sensitive information never leaves the point of interaction, or does it happen at the sever end? Is the metadata from the conversation being used to train the voice AI? And if yes, what steps are being taken to prevent reverse engineering the metadata to recover my identity? Who is my conversation recording shared with and how?
Regulations and Standards: Who owns voice data? Is a recording of my interaction with a voice agent considered my data that I control how it is stored, used and shared? Where does it get stored and how is it secured? Does it get the same regulatory treatment as other sensitive information? In some regulatory regimes like GDPR, the answer is yes, voice is personal data and gets the same treatment that any other data of mine, including the ‘right to be forgotten’. But not all regulatory regimes are created equal as we well know. Voice recordings in the United States which has no data residency requirements can live on an offshore server and be subject to the regime of that country’s regulations and laws (note – this is not unique to voice for the United States, but for all personal data). Furthermore, Regulations and standards need to be developed to define how voice can be used – what are the minimum requirements to be able to use a biometric voice imprint as even one factor for Multi factor authentication (MFA)? What are the penalties for voice cloning without permission? What are the standards for adding ‘watermarks’ to ethically cloned voice? Can an NFT be made of a voice sample – a VFT?
Ethics: And finally we come to Ethics. What are the ethics of using cloned or synthetic voice when interacting with clients or users? As voice interfaces and the AI behind it improve and get better at interactions, especially in detecting and replicating emotions, becoming more Turing complete, do we need to disclose that the voice one is interacting with is not a real human? Is a celebrity endorsing a product using licensed voice clone need to disclose which products she endorsed by actually showing up to the studio and which was cloned without her even being on the same continent? What about purely synthetic voices – who gets the Grammy for a song sung by a synth? The algorithm creator, the developers, or the ‘producer’ who thought it up?
The Metaverse is coming
That brings me to the #1 takeaway I walked away with from the Voice Summit 2022 – the Metaverse is coming. Say what one may of Mark Zuckerberg’s recent pitch of the Metaverse and of Facebook’s parent company changing its name to Meta, the Metaverse is coming, and its interface will be VR/AR goggles and, you guessed it, voice! Once inside the Metaverse, either in full VR ala ‘Ready Player One’ or in a hybrid Metaverse using an AR interface, as I interact with the virtual world around me, the last thing I am likely to do is to walk up to a keyboard and type. The whole idea is to break free from the console, desktop/laptop, and handheld devices. To use all senses and move freely in the meta-universe. The primary interaction will hence be via voice. Interactions with other citizens (not just players) of the meta-universe and with the agents and bots in this virtual world will have to be via voice. We need to be prepared for this. From consumers who are leary of interacting via voice, and to companies who want to leverage the Metaverse to engage with customers in the virtual. Virtual assets and virtual identities, virtual currencies all exist today. Fiat money is traded every day to acquire and trade virtual assets – from avatars to crypto. These will be the norm in the Metaverse and will require all companies and brands to establish their virtual and voice-based identities. Who will be the Voice of your brand in the Metaverse – a celebrity to a synth? This is not unlike developing a style guide for your brand or company like you do in the physical world today. This will be imperative for establishing an identity in the Metaverse. And the race to land-grab presence in the Metaverse has already begun.
by Sanjeev Sharma