• Jaime González Gasque

Why Speech Tech Is Key to a Well-Rounded Metaverse


Silicon Valley’s tech giants are betting big on the metaverse, but what exactly is the metaverse? Depends on who you ask, but to put it simply, the metaverse is envisioned as a next-generation internet where the physical and virtual worlds seamlessly blend to create immersive experiences. Just like the present internet is not any single thing but an agglomeration of several technologies, this futuristic metaverse is also a tapestry of several technologies and fields. The metaverse uses augmented reality, virtual reality, video games, artificial intelligence, and blockchain technology to create 3-D virtual worlds where users can play, learn, work, socialize, etc.

How’s this different from what we already have? There are several differences, but I’ll highlight a few that are relevant from a speech technology perspective:

  • Metaverse experiences are meant to be multisensory, more immersive, and ultimately more satisfying for users.

  • Users can control metaverse experiences with more intuitive interfaces, not just the default keyboard or a point-and-click device like the mouse.

  • Users waltz in and out of these persistent virtual worlds using digital avatars. Digital avatars can be more interesting than current, flat 2-D user profiles made up of pictures and text. They provide an opportunity for creative self-expression and make it more fun.

In all the above cases, speech technologies have an important role to play.


The Role of Speech Tech in the Metaverse

Gaming will be a big part of the metaverse, and adding voice to games has long been a quest for game developers. With integrated voice controls, the game flow is more natural. Gamers can control in-game action and characters by simply using their voices. The learning curve for new users is reduced as well, as voice controls can be more intuitive.


But developing games is already a huge and costly endeavor. Adding voice controls that work well for global audiences adds to this complexity, and voice has not become mainstream in games. But advances in speech technology enabled by artificial intelligence make adding voice elements to games easier than ever before. For example, Facebook/Meta, is enhancing the speech recognition capabilities of its Oculus virtual reality headsets. More in-game voice elements and even voice-based games in the metaverse are something to look forward to.


Digital avatars will be a key element of the metaverse, and as avatars hang out and interact with other avatars, just text-based communication won’t suffice; there will be a need for voice communications. A range of speech technologies—automatic speech recognition, text-to-speech, speech-to-text, and machine translation—must be deployed in the background to enable smooth voice interactions. A word of caution here: Today’s social media networks employ a variety of content moderation tools to flag abusive content or filter out content that violates the platform’s safety and harassment prevention policies. These content moderation tools are primarily for text and image content, but we’ll need similar tools for real-time conversations happening in the metaverse.


The sale of of in-game goods and items like skins used to personalize user avatars already rakes in billions of dollars each year. Crypto enthusiasts are betting that non-fungible tokens (or NFTs, digital goods whose provenance can be verified on the blockchain) will further increase the commerce around such digital goodies. A thriving market for NFT-based digital avatars might take off or not, as some large game developers don’t seem enthused by the idea, but a digital avatar needs its own voice for personalization. Synthetic voice tools have matured in recent years, and users will be able to easily add customer voices to their metaverse avatars based on their preferences. For example, Nvidia offers a 3-D avatar creation and personalization tool that includes speech recognition and synthetic speech.

The first era of movies were silent films. They did not have synchronized recorded sound and they had no audible dialogues. “Talkies” came much later, as technology matured and audience expectations evolved. Similarly, the adoption of speech technology in the metaverse will happen gradually as the metaverse itself matures over time. x


Kashyap Kompella is CEO of rpa2ai Research, a global AI industry analyst firm, and co-author of Practical Artificial Intelligence: An Enterprise Playbook.

13 views