NLP for Log Analysis – Tokenization

This is part 1 of a series of posts based on a presentation I gave at the Silicon Valley Cyber Security Meetup on behalf of my company, Insight Engines. Some of the ideas are speculative and I do not know if they are used in practice. If you have any experience applying these techniques on logs, please share in the comments below.

Natural language processing is the art of applying software algorithms to human language. However, the techniques operate on text, and there’s a lot of text that is not natural language. These techniques have been applied to code authorship classification, so why not apply them to log analysis?

Tokenization

To process any kind of text, you need to tokenize it. For natural language, this means splitting the text into sentences and words. But for logs, the tokens are different. Some tokens may be words, but other tokens could be symbols, timestamps, numbers, and more.

Another difference is punctuation. For many human languages, punctuation is mostly regular and predictable, although social media & short text writing has been challenging this assumption.

Logs come in a whole variety of formats. If you only have 1 type of log, then you may be able to tokenize it with a regular expression, like apache access logs for example. But when you have multiple types of logs, regular expressions can become overwhelming or even unusable. Many logs are written by humans, and there’s few rules or conventions when it comes to formatting or use of punctuation. A generic tokenizer could be a useful first pass at parsing arbitrary logs.

Whitespace Tokenizer

Tokenizing on whitespace is an obvious thing to try first. Here’s an example log and the result when run through my NLTK tokenization demo.

As you can see, it does get some tokens, but there’s punctuation in weird places that would have to be cleaned up later.

Wordpunct Tokenizer

My preferred NLTK tokenizer is the WordPunctTokenizer, since it’s fast and the behavior is predictable: split on whitespace and punctuation. But this is a terrible choice for log tokenization.

Logs are filled with punctuation, so this produces far too many tokens.

Treebank Tokenizer

Out of curiosity, I tried the TreebankWordTokenizer on the log example. This tokenizer uses a statistical model trained on news text, and it does surprisingly well.

There’s no weird punctuation in the tokens, some is separated out and other punctuation is within tokens. It all looks pretty logical & useful. This was an unexpected result, and indicates that perhaps logs are often closer to natural language text than one might think.

Next Steps

After tokenization, you’ll want to do something with the log tokens. Maybe extract certain features, and then cluster or classify the log. These topics will be covered in a future post.