An improved approach to choose NLU tools. • Nieves Ábalos

🇪🇸 Spanish version of this post: “Un enfoque mejorado para elegir herramientas NLU.”.

In order to develop conversational interfaces (chatbots, virtual assistants…), we need to understand what our users speak or write to answer them accordingly. That is the reason why NLU tools (as api.ai ↗ and wit.ai ↗) are commonly used in these interfaces.

As part of our research in BEEVA Labs, we’ve developed an improved benchmark to help us decide which tool fits better when developing conversational interfaces, as other benchmarks didn’t (especially in Spanish).

In the first version of our NLU benchmark, high-level features are analyzed in api.ai and wit.ai, for both English and Spanish. We have included Spanish in our benchmark as it’s difficult to find a benchmark evaluating NLU tools for Spanish. It’s really useful in case you need to choose a NLU for Spanish, as some functionalities differ from the language.

In our backlog, a second version including metrics as the number of training samples, precision, F1 score or recall. There is already work done in this field (although not using Spanish) by Intento ↗ or Snips ↗ (read this post ↗).

Our benchmark explained#

Based on our experience, we have divided a set of high-level characteristics into the following areas:

basics (as pricing, if it’s open-source, language support, etc),
transactional vs conversational degree,
intents (the ability to recognize what the user wants),
entities (the ability to recognize keywords in the utterance),
dialogue management (DM),
response generation (RG),
automatic speech recognition (ASR),
text-to-speech (TTS), and
external services integration.

An interface is transactional if has the ability to perform tasks or services for somebody. It’s goal-oriented, and its users use this in order to make any request. Two example sentences in ordering a pizza, both with a high transactional degree: “/order favorite-pizza” (command line style) and “Alexa ask Domino’s to order my favorite pizza”.

An interface is conversational when its intent is to create a natural dialog with users, with an informal interchange of information and thoughts through words with them. It should provide functionality to create this conversations with users such as: smalltalk (hello, thank you, bye, personal details and hobbies..), context management (is the answer “yes” related to the previous sentence?), or memory (the user is giving information about where he lives and where he wants to go on a trip, don’t ask always where he lives).

The complete list of characteristics is explained at the end of this post.

Results#

You can find the results of this benchmark in this Github repository ↗ (updated by 25/08/2017) or in the following image.

Results of our benchmark (updated 25/08/2017)

Conclusions#

In order to develop a conversational interface, api.ai has a higher degree of conversational features (including response generation and dialogue management) than wit.ai, that is basically transactional since the deprecation of its Bot Engine ↗, focusing its development in a comprehensive understanding of entities and intents.

In terms of understanding Spanish, api.ai behavior (in beta) is slightly different from English in terms of the number of built-in intents and no smalltalk support by default, unlike wit.ai.

Also, regarding the channels where users can interact with your conversational interface, api.ai is supported by the Google ecosystem (as Google Actions), and wit.ai focus its support to Facebook Messenger, but it doesn’t mean you can’t use other tools to support any channel (as Botkit ↗) and also make with requests via API to wit.ai or api.ai.

Future work#

The second part of this benchmark, including metrics as the number of training samples, precision, F1 score or recall, and a second post with it (stay tuned!).

Our benchmark explained: the complete list#

1. Basics#

Free / Pricing: Is it free to use? Pricing plans? Restricted to a number of API requests?
Open-source
On-device / Cloud-based: ASR / NLU on device is private and secure
Language support (en-us/en-gb/es-es/…) How many languages does it support in this moment?

2. Transactional vs Conversational#

Transactional degree: Is it goal oriented? Select a degree from 1 (minimum = only to chat and talk) to 5 (maximum = command line, e.g. /order pizza)
Conversational degree: Provides functions to create conversations with the users? From 1 (minimum) to 5 (maximum, has smalltalk, context and memory)

3. Intents#

Smalltalk support (e.g.: hello, bye, thanks): It allows you to easily import a lot of predefined answers for simple questions and phrases like “Hi!”, “How are you?”, “Are you robot?”, “What’s your hobby?”, “How old are you?”
Customizable smalltalk: You can easily open any intent, change response and add more training data
Built-in intents (e.g.: getWeather intent): These are intents for common actions that you can choose to implement without providing any sample utterances
Custom intents: You can create your own intents to train providing sample utterances
Fallback intents: Fallback intents are triggered if a user’s input is not matched by any of the regular intents
Follow-up intents: Natural human dialogs are filled with follow-ups and confirmations. These intents make these types of natural conversation flows easily to build and customize

4. Entities#

Pre-build / Built-in entities (e.g.: $date) / Slot types: System entities in order to facilitate handling the most popular common concepts
Lookup strategies: Are you able to select how is your entity recognized in the text (lookup strategy)?
Free text entities: When you need to extract a substring of the message, and this substring does not belong to a predefined list of possible values
Custom entities / Developer entities / User entities: Closed entities, defined by the developer or custom for the user
Allow automated expansion / Freetext & Keywords: This feature of mapping entities allows an agent to recognize values that have not been explicitly listed in the entity
Lists of values in entities: The ability to recognize an enumeration of different values of a entity as a list. E.g.: “I want some oranges, a few apples and two apricots” are recognized as this list [oranges, apples, apricots]
Quantities / Plural vs Singular Detection
Synonyms: Mapping of synonyms to a reference value. For example, a food type entity could have an entry with a reference value of “vegetarian” with synonyms of “veg” and “veggie”

5. Dialogue Management#

Slot-filling support: Look for specific pieces of information with respect to something to fill in the slots or information required to fulfill the action
Confirmation (yes/no) support
Context support: Context pass on information from previous conversations or external sources (e.g., user profile, device information, etc). It can be used to manage conversation flow
Sessions: Time in which the system remembers the data provided
Fulfillment: Ability to execute the action or answer the query asked by the user. Inside or outside but providing the link to the URL where this fulfillment is done

6. Response Generation#

Response generation capabilities: Is able to answer directly the user with info about his request or query

7. ASR (Automatic Speech Recognizer)#

Provides ASR: Is able to recognize voice and transform it into text

8. TTS (Text to Speech)#

Provides TTS: Transforms text to speech

9. External services#

Channel support: FB Messenger, Slack, etc.
Integrations / 3rd-party services support: E.g.: web demo / Actions on Google / Amazon Alexa / Twitter, etc.

Originally published at labs.beeva.com on October 2, 2017.