Part 2 in the C#/NLP Series
In the spirit of continuing things, here’s part 2. This has been written in dribs and drabs over the past two weeks, so please bear with me if it’s a little jumpy…
So before I get started on coding, I’ll need to define some terms and phrases that will help you better understand what’s going on, and what we’re up to. I’m going to loosely define/explain some high-level NLP concepts, then discuss some programming topics you’ll need to brush up on to continue.
NOTE: Rather than write out fresh descriptions and examples, I’ve quickly thrown together a list of necessary concepts/terminology and quickly acquired [read pilfered] other people definitions and pushed them in.
Some fundamental grammar concepts:
Noun
refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.
Verb
word that describe events and actions, e.g. fall, eat. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.
Adjective
describes a noun, and can be used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). English adjectives can have internal structure (e.g. fall+ing in the falling stocks).
Adverb
modify verbs to specify the time, manner, place or direction of the event described by the verb (e.g. quickly in the stocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary's teacher was really nice).
Preposition
A word governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause, as in “the man on the moon"
Conjunction
a part of speech that connects two words, sentences, phrases or clauses together, e.g Mary and Tom
Determiner
a noun-modifier that expresses the reference of a noun or noun-phrase in the context, rather than attributes expressed by adjectives
NLP High-Level Terminology
So starting off, we’ll need a body of text to work on. We’ll call our body of text a ‘Corpus’. A Corpus should consist of 1+ Sentences.
We determine where a Sentence ends via ‘Sentence Segmentation’. Sentence Segmentation is the process of looking at punctuation encountered during processing a Corpus and seeing if it terminates the preceding Sentence.
Some examples of terminating punctuation are ‘!’, ‘.’, ‘?’
We can divide sentences into tokens (via tokenisation) so we can ‘POS Tag’ them.
Part of Speech (POS) Tagging is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context —i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. [http://en.wikipedia.org/wiki/Part-of-speech_tagging]
For Example:
"at" is a preposition. Therefore we'd assign the POS Tag "P" to "at". ("at", "P")
This is fine in the case of prepositions, but for some words – like “collapse” that can have multiple POS Tags (e.g. can be a noun or a verb) depending on context – then we’ll need to look at adjacent and related words to correctly tag them.
For Example:
"After the collapse, we fled." - [Noun]
"He preceded to collapse onto the ground." - [Verb]
Part of this can be achieved through ‘Hidden Markov Models’ (HMM). With HMM, scores are assigned to possible tags based on adjacency to other tags and tag combinations.
Consider the following:
Taking part of sentence e.g. ‘the fight’ – we know that “fight” can be a verb or a noun in this case.
By applying HMM theory to our Corpus, we can determine the possibility of a noun following a determiner (the) is 40%, and adjective 30% and a number 20%. This way we can view the weights of the choices available and make decisions on which tag to apply to the ambiguous token.
SO in this case, we can (correctly) assume the fight should be tagged as a noun.
[ ["the", "DET"] ["fight", "N"] ]
In this case we checked a bigram, but we could easily apply this to trigrams, etc…
Once we’ve POS Tagged a Corpus, we can do a ‘frequency distribution’ on both the tokenised corpus text and also the POS-tagged tokens – e.g. counting the frequency of occurrences of each distinct element. This will help us with statistical analysis and further HMM discovery/use.
We can also look at ‘Named Entities’ in a corpus – which are Nouns classified as Entity Types.
e.g. “Philip Gannon” can be classified with an entity type of [PERSON], while “Microsoft” would be classified as an entity type of [ORGANISATION].
The process of detecting named entities is know as “Named Entity Recognition” – which finds and tags all mentions of named entities. Named Entity Recognition is composed of two tasks: finding Named Entities by identifying their boundaries and then identifying it’s entity type. Named Entity Recognition is achieved through ‘Chunking’ – segmenting and labeling multi-token sequences. Chunking makes use of a well-defined ‘grammar’ – A declarative specification of well-formedness.
Once we’ve tagged our corpus with POS Tags and Named Entities, we can proceed to identify the relationships between named entities, via “Relation Detection”
For Example:
Given a corpus that indicates that the company Facebook is located in California, it could generate the triple ([ORG: 'Facebook'] ‘in’ [LOC: 'California']).
Through Relation Detection, we can identify and store triples, and from this form an ‘Ontology’ of our corpus.
An Ontology is a formal representation of knowledge as a set of concepts within a domain.
At every stage we’re going to be looking at ‘Accuracy’ – basically the percentage degree of correctness.
We’ll check for accuracy after POS tagging, Named Entity Recognition and finally Relation Detection.
Useful Programming Concepts to review:
Strings
Check out the MSDN Reference on string methods
Simple things like .split() will give you a lot of mileage (in this case for Tokenisation and/or Sentence Segmentation).
For Example:
string corpus.....
char[] sentenceSegmenters = { '! ', '.,', '?', '\n'};
string[] sentences = corpus.Split(sentenceSegmenters );
It’s also a good idea to check out the MSDN article on Best Practices when using strings
Regex
Check out the MSDN Reference on Regex and the Regular Expression Object Model
I’m going to be using regex for a number of things, especially for defining the grammar for chunking and for some parts of POS Tagging.
For Example:
Chunking - DET ?<JJ>* <N> - would match:
[ ["the", DET] ["sleeping", JJ] ["cat", N] ]
or [ ["the", DET] ["quietly", JJ]["sleeping", JJ] ["cat", N] ]
but not [ ["the", DET] ["cat", N] ]
So we’ve covered some basic concepts and some high-level explanations of topics, as well as recommending some programming concepts to review before continuing.
In the next article I’ll show you the six-part process of parsing a corpus, and I’ll code up that process and some additional basic, yet necessary, functionality.
I’ve become enamored over the past while with NLP (Natural Language Programming), as it’s a large part of my fascination with Human/Computer interaction.
For those of you not in the know, NLP is basically
a [computer] system that will analyze, understand, and generate natural languages.
What’s a natural language? Good ‘aul spoken/written word. Effectively, the idea is to get a computer to understand (to some degree) the written or spoken word.
The applications for this technology are massive, but the focus I have in mind is as a Question/Answer based system. In this manner, you ask the system a question, and it will return a relevant answer.
I’ve also been doing a lot of research on Ontologies recently, and have been himming and hawing about organising a speedy triple store. My aim is to make an enhanced question/answer system that will be able to expand it’s own knowledge base as it it used. (continues to dream away…)
The reason I’m describing all this to you is simple – As a .NET developer, I’m surprised how little there is out there for NLP in the .NET environment. Over the next few articles, I’ll be showing you how to make a … “Natural Language Jackknife”, of sorts, with my own personal take on what algorithms to use and how to implement them. You should be able to use this to tackle common NLP problems and could be the basis for your own NLP Library down the line.
I’ll be covering topics like
- Part of Speech(POS) Tagging
- Chunking and (Named) Entity Detection
- Relation Detection between Entities
Which will lead us on to Using/Storing an Ontology, and later onto some Natural Language Generation.
All of this will be viewed from a C#/.NET perspective, and I’ll also be adding some cool stuff along the way (interface, etc…)
[Note]
Along the way, I’ll be making reference to “Natural Language Processing with Python” [Natural Language Processing with Python, O'Reilly, 2009][Web-Based Version] and the fantastic accompanying website[www.nltk.org].
This book, although written with python (and specifically the Natural Language Toolkit for Python), is an excellent resource for starting from scratch with NLP and provides a very solid foundation for getting to an intermediate level and provides many tasty nuggets and hooks for further research. Even though there is a free online version, I highly recommend you do these men a favour and buy a printed copy Link.
As with any code posted here (or will be posted here) -> I’m experimenting with a lot of this. My day-job does not involve NLP in any way, shape or form and thus I can’t claim I’m an expert, by any means! Take everything with a pinch of salt. If you’d like to present any improvements to any code here (speed, efficiency or otherwise) please do comment. This will help others understand too, helping everybody.
If you’ve any recommendations, fire me an email or comment below!