Writing a Natural Language Parser in C# Part 2 - Architecture
March 18, 2012 at 11:16 AM
This post is part of a series on creating a natural language processor in C#. The other entries in this series are:
Writing a Natural Language Parser in C# Part 1–Why?
Writing a Natural Language Parser in C# Part 3–CommandProcessor and ConversationContext
Writing a Natural Language Parser in C# Part 4–Tokens
Writing a Natural Language Parser in C# Part 5 - Questions and Rules
In our last post we discussed why one might be interested in building and using a natural language processor in a business or home project. This week, I’d like to look at the architecture of the processor I’ve written. This will be a high-level look at the various pieces of the system, the flow of processing a single sentence and how each piece contributes to the process.
An Activity Diagram
Below is an activity diagram depicting the interaction of the major parts of the system and how they work together to process a sentence sent to the system by the user.
The steps involved in this process are as follows:
- The string the user submits, along with a User ID and other bits of information are submitted to the Command Processor.
- The command processor creates an instance of a conversation context. This is a very important part of the system as a whole. It keeps track of what user made the request, what method was used to communicate the request (email, IM, etc.) and has a history of all statements that have occurred in both directions for the duration of the conversation.
- The system contains a collection of Tokens that know how to inspect the input for certain strings or conditions. They generate a TokenResult instance for each interesting piece of the statement and records where the interesting part begins in the string and how long it is. The token result also records a strong type that indicates the kind of thing that was found such as a certain phrase, a date or the name of something it knows about.
- After all the tokens have had a chance to process the string, the resulting TokenResults all exist in a single collection. This collection is then organized into buckets where each bucket corresponds to a start position in the string. For example, all interesting things found to have begun at position zero would be in a bucket together. How can there be more than one token result for the same phrase? Well consider that the string contained a numeral “1”. This could represent an integer, a long, a decimal, and ordinal (think “first” as in the first day of the month or first day of the week), etc,
- Rules are methods that return void and have any number of parameters which are each a type of token. These rules are all defined in classes that are marked with a particular interface and are obtained via reflection. After all the token result instances have been organized into buckets, each of these methods is inspected and its parameters are compared to the contents of each bucket, in order, to see if a match for the parameter is found in the corresponding bucket.
- Once a matching rule is found, it is executed by passing in the conversation context and all the matching tokens. The context can be used from within the method to inspect the history of the conversation and also to send responses back to the user.
Lets’ look at a simple example. Even if this example does not make complete sense to you right now, as we look into each piece of the system in more detail, it will become clear to you.
Let’s say the user sends in a request, “What’s the weather”. Once the tokens have all had a chance to look at this string, we will have a collection of token results. In fact, we will have two results. The first will say that “What’s the” can be tokenized into a TokenList beginning at position 0 in our string. The second will say that the remainder of the string can be tokenized into a TokenWeather beginning at position 10 (I know this seems like an incorrect position, but I will explain in a future post).
Now, these two tokens will be put into a couple of buckets and the system will begin to go through all the defined rules. Once of these rules has a signature like this:
- public static void GetWeather(ConversationContext cContext, TokenList list, TokenWeather weather)
Since TokenList matches what’s in our first bucket and TokenWeather matches what’s in the second, this will be the matching rule and will be executed.
It should be noted that the entire process diagrammed above executes on a single thread. However, much like a web server, each request is handled on a separate thread that is initiated by classes that listen on different protocols.