5 Common Challenges for Agencies and Enterprises

Shaun Withers

Shaun Withers

multimodal challenges


As head of partnerships at Jargon, I speak with many teams at large brands and digital agencies that are pushing the envelope with the multimodal experiences they build on Alexa. They are bringing use cases that go well beyond the simple commands and FAQ voice interactions that are popular today. Their goal is to design products that put the user in control and take full advantage of the unique benefits of voice-first interfaces. To accomplish this, they need sophisticated, purpose-built tools to handle their advanced use cases for their voice design and development teams.

Beyond building a robust interaction model that understands the user’s intents, product teams must consider the quality of the content that’s being delivered to the user. Content (the response given back to the user) can come in the form of any combination of text-to-speech, audio recording, and visuals. With the multitude of devices and user environments available, delivering relevant, efficient, and fresh content is no easy task.

As a quick fix, many teams resort to spreadsheets or other general-purpose tools to manage multimodal content. This presents a number of challenges throughout the development lifecycle. While there are multiple approaches to managing multimodal content effectively, one approach teams can take is to use power tools that are built for the medium and allow for flexibility.

Common Multimodal Content Challenges

Multimodal content is very different from traditional web and mobile content. Responses need to address the user’s specific situation and environment. Because of this, the logic to assemble the right response becomes complicated very quickly. Here are a few common multimodal content challenges:

  1. Managing a large number of responses
    Depending on the use case, the number of responses associated with each conversational state can become quite large. Oftentimes, it’s helpful to add variety to each response so the experience sounds fresh and natural. For example, a simple greeting - if identical in every session - can be off-putting. But with just a bit of variety, the greeting can delight the user. Authoring, editing, and using these responses become an overwhelming challenge without the right structure.
  2. Dynamically assembling the right response
    Each response typically includes a few components such as a speech component, re-prompt, and a visual. Depending on the user’s request and the environment they’re in, a different assembly of these components might be necessary. If I ask for my flight details while driving to the airport, I want to hear the critical information about my flight while I keep my eyes on the road. If I’m waiting in line at security and I want to upgrade my seat, I want to see a visual of the seats available before I confirm. Stitching together these different response components requires repetitive, nontrivial development work.
  3. Handing off content
    After designing conversation flows using voice design tools like Skill Flow Builder, Voiceflow, and Botmock, voice designers need a mechanism for handing off content to developers and iterating on the content after the hand-off takes place. Other stakeholders, such as marketing, legal, upper management, and clients, may also want to edit or approve the content. With this many collaborators, version control and error prevention become essential.
  4. Maintaining consistency
    Lots of voice content is reusable, like certain names and entities, brand messaging, common phrases, and prompts. If this reusable content is scattered throughout hundreds of responses, maintaining consistency becomes a tedious task. Having this reusable content centrally managed where a single change can replace all references is a significant time saver.
  5. Achieving global scale
    If localization isn’t a consideration from the beginning and factored into the structure of the content, it becomes very challenging to update and maintain content across multiple languages while sharing the same app logic. Language complexities such as plurality, gender, and cultural/local references become complicated content puzzles. Starting with solid development and using content best practices pays large dividends over the life of the voice experience.

Effective Multimodal Content Management

The most effective way to manage multimodal content is by separating the content logic - how response components assemble together - from the core business logic of the application. This involves building a library of potential responses in a content management system (CMS) that can be pulled into the application when triggered by an intent. Using a CMS for multimodal content allows teams to focus on the quality of the content while the CMS does the heavy lifting to stitch together the different components. Here are some of the benefits of using a CMS built for voice:

  1. Collaboration and version control
    Having content managed in a single hub allows for a single version of the truth. Non-technical team members can update content easily without the risk of inadvertently breaking the application. Content validation and version control provide built-in safeguards for the content. Approvals from other stakeholders can be managed within the platform to ensure the content is ready for public eyes and ears.
  2. Assembly of multimodal assets
    Video, audio, images, and card content can be essential components of voice experiences and should be managed as such. A voice CMS provides a place where assets are managed and can be referenced by multiple responses. For example, if a new video is available to replace an existing video, it’s ideal to have a single location in the CMS to swap it out, reflecting the change across the application, wherever it is referenced.
  3. Voice editing tools and simulation
    There are several elements that can be added to voice responses to enrich the sound. With easy-to-use tooling in a CMS for voice, audio clips and dynamic variables can be inserted into text-to-speech content and marked up with SSML tags to direct the pronunciation of the synthetic voice. A simulator can play back the response for an immediate quality check.
  4. Multilingual support
    The ICU format should be used to handle things like pluralization and gender for variables. It’s best to keep track of currencies and other locale-specific metrics and take note of local or cultural references that don’t easily apply across geographies. If structured correctly, the same application logic can be used across multiple languages managed in the CMS.
  5. Live updates
    To speed up iteration cycles, content changes that do not impact the core business logic can be released directly to the application without the need to involve the development team. This is commonplace for web and mobile content and essential to keep the voice experience updated and relevant.

We’ve observed that large voice design and development teams choose a best-of-breed approach to the tools they use. Some of Jargon’s customers use voice design tools such as Skill Flow Builder, Voiceflow, and Botmock to map out the flow of their voice experience. They then author, maintain, and deliver the content through Jargon.

Click here to get started today on Jargon.

This post was originally posted by Shaun Withers on the Alexa Skills Kit Blog.



Interested in learning more about Jargon? Check out these additional resources below:
Features of Jargon
Jargon Careers


Jargon Logo

About Jargon

Jargon empowers teams to author, deliver, and optimize voice experiences. Join our mailing list to stay current with Jargon news and product updates!

Join our Mailing List