The series “Data Mining with Python on Medical Datasets for Data Mining” is a series in which several data mining techniques are highlighted. The series are written in collaboration with John Snow Labs which provided me the medical datasets. In this article basic Text Mining techniques will be highlighted and some of the results are presented.
By the way, if you are interested in Deep Learning you should definitely read this article on implementing a GRU in Python using Tensorflow.
Introduction
A medical dataset is given which contains written diagnoses of people. The goal of this article is to extract causal relationships from these diagnoses. For example, a diagnosis could be that Bob has broken his leg due to falling from a cliff. Then the cause of Bob’s broken leg is the falling from a cliff. In order to extract such a patterns, we need to dive a little into text mining. Before that can happen, we need to clean the data. A modified sample of the original dataset which will be used in this article can be downloaded here:
Download the dataset sample here
Data preprocessing
First, we need to load the data into memory. This can be done easily by using pandas (Python Data Analysis Library):
Now the text mining part can begin!
Text mining
Text mining is data mining applied on text. In this article, some of the techniques of text mining are applied on our toy dataset.
Information extraction
The main goal of this article is to extract causal relations from the descriptions. This could be done by using entity recognition and then applying information extraction but that is outside the scope of this article. Here, we will use a simple template based alternative.
First, we extract all seperate clauses from a description. For example, if we have the description “rain causing wet grass, wet grass due to sprinkler”, then there are two clauses: “rain causing wet grass” and “wet grass due to sprinkler”. Notice that there are two types of causal relations. “A causing B” and “B due to A”, where A is the subject of the causal relation and B is the caused entity (the event). Note that A and B are swapped by the relation label “due to”. This is taken into account in the information extraction method. The relation labels “causing” and “due to” are both replaced by the relation token “->” (for simplicity). The information extraction is quite simple this way:
Results
The results are displayed using the following piece of code:
For the toy dataset, the following results were obtained:
So, our text mining application works on this particular dataset. You see for example that “hot water” causes “burn”. Notice that it will not work in general! Therefore, often applications are tested using a separate train dataset and test dataset (and there are even better testing techniques).
Exercise
Extend the given dataset and add a new relation. Extend the application such that it can extract the new relation from the dataset.