This post is part of a series on creating a natural language processor in C#. The other entries in this series are:
Writing a Natural Language Parser in C# Part 1–Why?
Writing a Natural Language Parser in C# Part 2 – Architecture
Writing a Natural Language Parser in C# Part 3–CommandProcessor and ConversationContext
Writing a Natural Language Parser in C# Part 5 - Questions and Rules
This week, I’d like to look deeper into how the speech processor tokenizes the incoming command.
When I first started thinking about creating a speech processor for my smart home software, I did quite a bit of research on the internet looking to see what the state of this technology was and what people were doing with it. What I found was that there were several pieces of software out there like SharpNLP and Antelope that are capable of taking a sentence and breaking it down into its constituent phrases and words. They can identify the part of speech each is, the definition of words and their synonyms. The output from such a tool might look something like this.
Let/VB 's/PRP see/VB how/WRB tokenization/NN works/VBZ in/IN SmartNLP/NNP ./.
I was amazed at all this until I tried to figure out how to apply this technology to my problem of communicating with my machine. I found that even with all this information about what was said, I still couldn’t understand it enough to act on it. I discovered a different approach which is the conversion of the input into tokens containing a different set of properties. Once the input was tokenized, it was no longer a string, but a collection of objects that could be processed to discover what was being communicated.
Let’s look at how the system tokenizes its input.
Anatomy of A Token
A token has three members; A collection of phrases, a value and one method that takes the input and returns results. A skeleton would look like this:
Token
- [DataContract]
- public class Token
- {
- protected List<string> Words;
-
- [DataMember]
- public object Value { get; set; }
-
- public virtual IEnumerable<TokenResult> Parse(string input, Guid userId)
- {
-
- }
- }
Here, the list of strings, Words, holds a collection of words or phrases that the class will locate and generate a result for. The Value property will hold the value that has been parsed. Depending on the token, this could be a string, a DateTime or a number. Last, the Parse method is called to do the actual parsing. All tokens inherit from the base Token class. The parse method is implemented in this class to provide a basic functionality of locating phrases that are in the Words collection and return a TokenResult for each.
Some of the tokens are very simply implemented and others are quite complicated depending on what is being parsed. For example, the token that parses the user’s request for information is TokenList, as in “list reminders”. Its entire implementation looks like this:
TokenList
- [DataContract]
- [Export(typeof(IParseToken))]
- public class TokenList : Token, IParseToken
- {
- public TokenList()
- {
- Words = new List<string> { "list", "show", "get", "lists", "what are my", "whats" };
- }
- }
This class takes advantage of the base class’ Parse implementation. Also, notice that the Words lists contains synonyms for list. If the user specifies any of these values, it will be parsed as TokenList.
As an example of a more complicated Token, consider a token that parses a DateTime. Here, we would override the base class’ implementation of parse and look for portions of the input that could be parsed as a DateTime. This can get quite complicated when you consider that the user could say something like “remind me to call bob next saturday” This token would need to be able to recognize that “next saturday” specifies a date and then calculate what that date is.
TokenResult
The Token classes all return an instance of TokenResult for each value they parse out. The TokenResult class is listed below.
TokenResult
- [DataContract]
- [KnownType("GetKnownTypes")]
- public class TokenResult
- {
- [DataMember]
- public object Value { get; set; }
-
- [DataMember]
- public string TokenType { get; set; }
-
- [DataMember]
- public int Start { get; set; }
-
- [DataMember]
- public int Length { get; set; }
-
- [DataMember]
- public Token Token { get; set; }
-
- private static IEnumerable<Type> GetKnownTypes()
- {
- return new List<Type>
- {
- typeof (Token),
- typeof (TokenInt),
- typeof (TokenLong),
- typeof (TokenNumeric),
- typeof (TokenPercentage),
- typeof (TokenQuotedPhrase),
- typeof (TokenResult),
- typeof (Tokens.Nouns.TokenNoun),
- typeof (Tokens.Nouns.TokenToDo),
- typeof (Tokens.Nouns.TokenEmail),
- typeof (Tokens.Nouns.TokenSms),
- typeof (Tokens.Nouns.TokenWeather),
- typeof (Tokens.Nouns.TokenNews),
- typeof (Tokens.Nouns.TokenIm),
- typeof(Tokens.Nouns.TokenNeither),
- typeof(Tokens.Nouns.TokenYesNo),
- typeof(TokenReminder),
- typeof(TokenDefinedList),
- typeof(TokenNamed),
- //typeof (Tokens.Nouns.TokenDevice),
- //typeof (Tokens.Nouns.TokenRoom),
- typeof (Tokens.Nouns.TokenState),
- //typeof (Tokens.Nouns.TokenStructure),
- //typeof (Tokens.Nouns.TokenZone),
- typeof (Tokens.Nouns.TokenDim),
- typeof (Tokens.Prepositions.TokenPreposition),
- typeof (Tokens.Temporal.TokenDeterminateSeries),
- typeof (Tokens.Temporal.TokenExactTime),
- typeof (Tokens.Temporal.TokenIndeterminateSeries),
- typeof (Tokens.Temporal.TokenTemporal),
- typeof(Tokens.Temporal.TemporalParts.TokenDayOfWeek),
- typeof(Tokens.Temporal.TemporalParts.TokenApril),
- typeof(Tokens.Temporal.TemporalParts.TokenAugust),
- typeof(Tokens.Temporal.TemporalParts.TokenDayAfterTomorrow),
- typeof(Tokens.Temporal.TemporalParts.TokenDayBeforeYesterday),
- typeof(Tokens.Temporal.TemporalParts.TokenDecember),
- typeof(Tokens.Temporal.TemporalParts.TokenEach),
- typeof(Tokens.Temporal.TemporalParts.TokenEighteenth),
- typeof(Tokens.Temporal.TemporalParts.TokenEighth),
- typeof(Tokens.Temporal.TemporalParts.TokenEleventh),
- typeof(Tokens.Temporal.TemporalParts.TokenFebruary),
- typeof(Tokens.Temporal.TemporalParts.TokenFifteenth),
- typeof(Tokens.Temporal.TemporalParts.TokenFifth),
- typeof(Tokens.Temporal.TemporalParts.TokenFirst),
- typeof(Tokens.Temporal.TemporalParts.TokenForteenth),
- typeof(Tokens.Temporal.TemporalParts.TokenForth),
- typeof(Tokens.Temporal.TemporalParts.TokenFriday),
- typeof(Tokens.Temporal.TemporalParts.TokenInt),
- typeof(Tokens.Temporal.TemporalParts.TokenJanuary),
- typeof(Tokens.Temporal.TemporalParts.TokenJuly),
- typeof(Tokens.Temporal.TemporalParts.TokenJune),
- typeof(Tokens.Temporal.TemporalParts.TokenLong),
- typeof(Tokens.Temporal.TemporalParts.TokenMarch),
- typeof(Tokens.Temporal.TemporalParts.TokenMay),
- typeof(Tokens.Temporal.TemporalParts.TokenMonday),
- typeof(Tokens.Temporal.TemporalParts.TokenMonth),
- typeof(Tokens.Temporal.TemporalParts.TokenNinteenth),
- typeof(Tokens.Temporal.TemporalParts.TokenNinth),
- typeof(Tokens.Temporal.TemporalParts.TokenNovember),
- typeof(Tokens.Temporal.TemporalParts.TokenNumeric),
- typeof(Tokens.Temporal.TemporalParts.TokenOctober),
- typeof(Tokens.Temporal.TemporalParts.TokenOrdinal),
- typeof(Tokens.Temporal.TemporalParts.TokenOther),
- typeof(Tokens.Temporal.TemporalParts.TokenPercentage),
- typeof(Tokens.Temporal.TemporalParts.TokenRelativeTemporalOrdinal),
- typeof(Tokens.Temporal.TemporalParts.TokenSaturday),
- typeof(Tokens.Temporal.TemporalParts.TokenSecond),
- typeof(Tokens.Temporal.TemporalParts.TokenSeptember),
- typeof(Tokens.Temporal.TemporalParts.TokenSeventeenth),
- typeof(Tokens.Temporal.TemporalParts.TokenSeventh),
- typeof(Tokens.Temporal.TemporalParts.TokenSixteenth),
- typeof(Tokens.Temporal.TemporalParts.TokenSixth),
- typeof(Tokens.Temporal.TemporalParts.TokenSpecifiedDate),
- typeof(Tokens.Temporal.TemporalParts.TokenSunday),
- typeof(Tokens.Temporal.TemporalParts.TokenTenth),
- typeof(Tokens.Temporal.TemporalParts.TokenThird),
- typeof(Tokens.Temporal.TemporalParts.TokenThirteenth),
- typeof(Tokens.Temporal.TemporalParts.TokenThirtieth),
- typeof(Tokens.Temporal.TemporalParts.TokenThirtyFirst),
- typeof(Tokens.Temporal.TemporalParts.TokenThursday),
- typeof(Tokens.Temporal.TemporalParts.TokenTime),
- typeof(Tokens.Temporal.TemporalParts.TokenToday),
- typeof(Tokens.Temporal.TemporalParts.TokenTomorrow),
- typeof(Tokens.Temporal.TemporalParts.TokenTuesday),
- typeof(Tokens.Temporal.TemporalParts.TokenTwelth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentieth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentyEighth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentyFifth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentyFirst),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentyFourth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentyNinth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentySecond),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentySeventh),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentySixth),
- typeof(Tokens.Temporal.TemporalParts.TokenTwentyThird),
- typeof(Tokens.Temporal.TemporalParts.TokenWednesday),
- typeof(Tokens.Temporal.TemporalParts.TokenYesterday),
- typeof (Tokens.Verbs.TokenCreate),
- typeof (Tokens.Verbs.TokenDelete),
- typeof (Tokens.Verbs.TokenList),
- typeof (Tokens.Verbs.TokenRemind),
- typeof (Tokens.Verbs.TokenReset),
- typeof (Tokens.Verbs.TokenWhatIs),
- typeof(Tokens.Verbs.TokenWhereIs),
- typeof(Tokens.Verbs.TokenWhoIs),
- typeof (Tokens.Verbs.TokenWhoSang),
- typeof (Tokens.Verbs.TokenWhoWasIn),
- typeof (Tokens.Verbs.TokenRemindMeTo),
- typeof (Tokens.Verbs.TokenRemindMeAt),
- typeof(System.Type),
- typeof(Questions.Question)
- //typeof(StructuredSpeech2.House.Structure.Device),
- //typeof(StructuredSpeech2.House.Structure.House),
- //typeof(StructuredSpeech2.House.Structure.Room),
- //typeof(StructuredSpeech2.House.Structure.X10Device),
- //typeof(StructuredSpeech2.House.Structure.Zone),
- //typeof(StructuredSpeech2.House.Devices.X10LampDevice),
- //typeof(StructuredSpeech2.Tokens.Verbs.TokenTurn),
- //typeof(StructuredSpeech2.Tokens.Nouns.TokenDeviceList)
- };
- }
- }
This class wraps a token instance and additionally holds the start position and length of the parsed value. You’ll also notice code here that facilitates the serializing of tokens. The application often stores tokens and token results in the database and this code allows the types to be serialized and persisted.
Recall that the CommandProcessor class calls into the TokenManager which calls into each token, in turn, and compiles all the results into buckets.
TokenManager
The TokenManager holds a collection of tokens and manages giving each a shot at parsing the input. It, then, uses the start position and length properties on the results to organize then into a dictionary that can be used to determine a matching rule to be executed. The TokenManager class is listed, below.
TokenManager
- [Export]
- public class TokenManager
- {
- [ImportMany(typeof(IParseToken))]
- private List<IParseToken> Tokens { get; set;}
-
- public Dictionary<int, List<TokenResult>> TokenizeInput(
- string input, Guid userId)
- {
- var results = new List<TokenResult>();
-
- try
- {
- foreach (var token in Tokens)
- {
- results.AddRange(token.Parse(input, userId));
- }
- }
- catch (Exception e)
- {
- Logger.Log(e.Message);
- }
-
-
- CreateQuotedPhraseTokens(results, input);
-
- //arrange all token results by their start positions
- var buckets = new Dictionary<int, List<TokenResult>>();
-
- foreach (var result in results.OrderBy(r => r.Start))
- {
- if (!buckets.ContainsKey(result.Start))
- {
- buckets[result.Start] = new List<TokenResult>();
- }
-
- buckets[result.Start].Add(result);
- }
-
- return buckets;
- }
-
- private void CreateQuotedPhraseTokens(
- List<TokenResult> results, string input)
- {
- int index = 0;
- List<WordInfo> words = new List<WordInfo>();
- string accumulator = "";
-
- for (index = 0; index < input.Length - 1; index++)
- {
- if (input[index] == ' ')
- {
- words.Add(new WordInfo
- {
- Found = false,
- Length = accumulator.Length,
- Start = index - accumulator.Length,
- Value = accumulator
- });
-
- accumulator = "";
- continue;
- }
-
- accumulator += input[index];
- }
-
- accumulator += input[index];
-
- words.Add(new WordInfo
- {
- Found = false,
- Length = accumulator.Length,
- Start = (index + 1) - accumulator.Length,
- Value = accumulator
- });
-
- accumulator = "";
-
- foreach (var word in words)
- {
- var match = results.Where(r =>
- word.Start >= r.Start && (word.Start + word.Length) <=
- (r.Start + r.Length)).FirstOrDefault();
-
- if (match != null)
- {
- if (accumulator.Length > 0)
- {
- results.Add(new TokenResult
- {
- Length = accumulator.Trim().Length,
- Start = word.Start - 1 - accumulator.Trim().Length,
- Token = new TokenQuotedPhrase { Value = accumulator.Trim() },
- TokenType = typeof(TokenQuotedPhrase).ToString(),
- Value = accumulator.Trim()
- });
- accumulator = "";
- }
- }
- else
- {
- accumulator += word.Value + " ";
- }
- }
-
- if (accumulator.Length > 0)
- {
- results.Add(new TokenResult
- {
- Length = accumulator.Trim().Length,
- Start = input.Length - 1 - accumulator.Trim().Length,
- Token = new TokenQuotedPhrase { Value = accumulator.Trim() },
- TokenType = typeof(TokenQuotedPhrase).ToString(),
- Value = accumulator.Trim()
- });
- }
- }
- }
In lines 4 and 5, you can see that we’re using MEF to load all the Tokens into a collection. The TokenizeInput method loops through the tokens and passes the input to each. It then calls the CreateQuotedPhraseTokens method, which I’ll discuss shortly. Next, the results are iterated through and organized into a dictionary.
It’s quite possible the user will sometimes specify words or phrases that we have no token for. In fact, there are situations where we expect the user to do this. For example, when the user asks the system to create a reminder for them they will say something like, “Remind me to cook the golden goose next Friday”. We can parse out enough of the input to determine the users would like a reminder created and when they would like to be reminded. We don’t have tokens, however, to represent the “cook the golden goose” portion of the input. For this reason, after all the token classes have parsed out their results from the input, the TokenManager tokenizes the “left out” portions of the input as a TokenQuotedPhrase type. This allows us to use these values when locating rules to execute and inside those rules we can use that portion of the input as data.
The tokenization of the input is an important part of understanding what the user is asking for. It allows us to work with the input as a collection of objects as opposed to dealing with a string. The last part of the process is matching the tokens to a rule. We’ll look at how this is done next time.