Named Entity Recognition and Classification with Scikit-Learn

By Susan Li, Sr. Data Scientist

Named Entity Recognition and Classification (NERC) is a process of recognizing information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions from unstructured text. The goal is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.

Last week, we gave an introduction on Named Entity Recognition (NER) in NLTK and SpaCy. Today, we go a step further, training machine learning models for NER using some of Scikit-Learn’s libraries. Let’s get started!

The Data

The data is feature engineered corpus annotated with IOB and POS tags that can be found at Kaggle. We can have a quick peek of first several rows of the data.

Figure 1

Essential info about entities:

geo = Geographical Entity org = Organization per = Person gpe = Geopolitical Entity tim = Time indicator art = Artifact eve = Event nat = Natural Phenomenon

Inside–outside–beginning (tagging)

The IOB**(short for inside, outside, beginning) is a common tagging format for tagging tokens.

I- prefix before a tag indicates that the tag is inside a chunk. B- prefix before a tag indicates that the tag is the beginning of a chunk. An O tag indicates that a token belongs to no chunk (outside).

The entire data set can not be fit into the memory of a single computer, so we select the first 100,000 records, and use Out-of-core learning algorithms to efficiently fetch and process the data.

Figure 2

Figure 3

Data Preprocessing

We notice that there are many NaN values in ‘Sentence #” column, and we fill NaN by preceding values.

(4544, 10922, 17)

We have 4,544 sentences that contain 10,922 unique words and tagged by 17 tags.The tags are not evenly distributed.

Figure 4

The following code transform the text date to vector using **DictVectorizer**and then split to train and test sets.

((67000, 15507), (67000,))

Out-of-core Algorithms

We will try some of the out-of-core algorithms that are designed to process data that is too large to fit into a single computer memory that support **partial_fit****method.

Perceptron

Figure 5

Because tag “O” (outside) is the most common tag and it will make our results look much better than they actual are. So we remove tag “O” when we evaluate classification metrics.

Figure 6

Figure 7

Linear classifiers with SGD training

Figure 8

Figure 9

Naive Bayes classifier for multinomial models

Figure 10

Figure 11

Passive Aggressive Classifier

Figure 12

Figure 13

None of the above classifiers produced satisfying results. It is obvious that it is not going to be easy to classify named entities using regular classifiers.