Data Cleaning, Categorization and Normalization

Definition of Clean Data

Happy families are all alike; every unhappy family is unhappy in its own way – Leo Tolstoy

Like families, clean datasets are all alike but every messy dataset is unreadable by our modeling algorithms.Clean datasets provide a standardized way to link the structure of a dataset with its semantics.

We will take the following text file with dirty data:

%% DataSonu ,1861, 1892, maleArun , 1892, M1871, Monica, 1937, Female1880, RUCHI, FGeetu, 1850, 1950, femBaLa, 1893,1863% Names, birth dates, death dates, gender

Let us start with tidying the above data. We’ll adopt following steps for the same:

  1. Read the unclean data from the text file and analyse the structure, content and quality of data.

The following functions lets us read the data that is technically correct or close to it:

  • read.table

  • read.csv

  • read.csv2

  • read.delim

  • read.delim2

When the rows in the data file are not uniformly formatted, we can consider reading in the text line-by-line and transforming the data to rectangular text ourself.

1
2
3
4
5
6
7
8
## [1] "%% Data " 
## [2] " Sonu ,1861, 1892, male " 
## [3] "Arun , 1892, M" 
## [4] "1871, Monica, 1937, Female" 
## [5] "1880, RUCHI, F" 
## [6] "Geetu, 1850, 1950, fem" 
## [7] "BaLa, 1893,1863" 
## [8] "% Names, birth dates, death dates, gender"
1
## [1] "character"
1
## chr [1:8] "%% Data " " Sonu ,1861, 1892, male " ...

The variable txt is a vercot of type “character” having 8 elements, equal to number of lines in our txt file.

  1. Delete the irrelevant / duplicate data. This improves data protection, increases the speed of processing and reduces the overall costs.

In our eg., we will delete the comments from the data. Comments are the lines followed by “%” sign using following code.

Following is the code:

1
2
3
i=grep("%", txt)
txt=txt[-i]
txt
1
2
3
## [1] " Sonu ,1861, 1892, male " "Arun , 1892, M" 
## [3] "1871, Monica, 1937, Female" "1880, RUCHI, F" 
## [5] "Geetu, 1850, 1950, fem" "BaLa, 1893,1863"
  1. Split lines into seperate fields. This can be done using strsplit function:
1
2
fields=strsplit(txt,",")
fields
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## [[1]]
## [1] " Sonu " "1861" " 1892" " male " 
## 
## [[2]]
## [1] "Arun " " 1892" " M" 
## 
## [[3]]
## [1] "1871" " Monica" " 1937" " Female" 
## 
## [[4]]
## [1] "1880" " RUCHI" " F" 
## 
## [[5]]
## [1] "Geetu" " 1850" " 1950" " fem" 
## 
## [[6]]
## [1] "BaLa" " 1893" "1863"
1
fields[[1]][2]
1
## [1] "1861"
1
class(fields)
1
## [1] "list"

Here, txt was a vector of type characters, while, after splitting we have our output stored in the variable “fields” which is of “list” type.

  1. Standardize and categorize fields

It is a crucial process where the data is defined, formatted, represented and structured in all data layers. Development of schema or attributes is involved in this process.

The goal of this step is to make sure every row has same number of fields, and the fields are in the same order. In read.table command, any fields that are less than the maximum number of fields in a row get appended by NA. One advantage of do-it-yourself approach is that we don’t have to make this assumption. The easiest way to standardize rows is to write a function that takes a single character vector as input, and assigns the values in the right order.

1
2
3
4
5
6
7
8
9
10
11
12
Categorize <- function(x)
{ 
out <- character(length(x))
i<-grep("[[:alpha:]]",x) 
out[1] <- x[i[1]] 
out[4] = ifelse(length(i)==2, x[i[2]], NA)
i <- which(as.numeric(x) < 1890) 
out[2] <- ifelse(length(i)>0, x[i], NA) 
i <- which(as.numeric(x) > 1890) 
out[3] <- ifelse(length(i)>0, x[i], NA)
return(out) 
}

The above function takes each line in the txt as the input (as x). The function returns a vector of class character in a standard order: Name, Birth Year, Death Year, Gender. Where-ever, the field will be missing, NA will be introduced. The grep statement is used to get the location of alphabetical characters in our input variable x. There are 2 kinds of alphabetical characters in our input string, one representing name and other representing sex. Both are accordingly assigned in the out vector at 1st and 4th positions. The year of birth and year of death can be recognized as “less than 1890” or “greater than 1890” respectively.

To retrieve the fields for each row in the example, we need to apply this function to every row of fields.

stdfields=lapply(fields, Categorize)

1
stdfields
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## [[1]]
## [1] " Sonu " "1861" " 1892" " male " 
## 
## [[2]]
## [1] "Arun " NA " 1892" " M" 
## 
## [[3]]
## [1] " Monica" "1871" " 1937" " Female" 
## 
## [[4]]
## [1] " RUCHI" "1880" NA " F" 
## 
## [[5]]
## [1] "Geetu" " 1850" " 1950" " fem" 
## 
## [[6]]
## [1] "BaLa" "1863" " 1893" NA

Our Categorize function here is quite fragile, it crashes for eg. when our input vector contains 3 or more character fields. Only the data analyst should determine how generalized should our categorize function be.

  1. Transform to Data.Frame type

First we will copy the elements of list to a matrix which is then coerced into a data-frame.

1
2
3
4
a=unlist(stdfields)
M=matrix(a, nrow=length(stdfields), byrow=TRUE)
colnames(M)=c("Name", "Birth_Date", "Death_Date","Gender")
M
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender 
## [1,] " Sonu " "1861" " 1892" " male " 
## [2,] "Arun " NA " 1892" " M" 
## [3,] " Monica" "1871" " 1937" " Female"
## [4,] " RUCHI" "1880" NA " F" 
## [5,] "Geetu" " 1850" " 1950" " fem" 
## [6,] "BaLa" "1863" " 1893" NA
1
2
dat=as.data.frame(M, stringsAsFactors = FALSE)
dat
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender
## 1 Sonu 1861 1892 male 
## 2 Arun <NA> 1892 M
## 3 Monica 1871 1937 Female
## 4 RUCHI 1880 <NA> F
## 5 Geetu 1850 1950 fem
## 6 BaLa 1863 1893 <NA>

Here, we have made a matrix by row. The number of rows in the matrix will be same as number of elements in stdfields. There-after we have converted the matrix in data.frame formats.

  1. Data Normalization

It is the systematic process to ensure the data structure is suitable or serves the purpose. Here the undesirable characteristics of the data are eliminated or updated to improve the consistency and the quality. The goal of this process is to reduce redundancy, inaccuracy and to organize the data.

String normalization techniques are aimed at transforming a variety of strings to a smaller set of string values which are more easily processed.

  • Remove the white spaces. We can use str_trim function from stringr library.
1
2
3
4
library(stringr)
dat=sapply(dat,str_trim)
dat=as.data.frame(dat, stringsasFactors=FALSE)
dat
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender
## 1 Sonu 1861 1892 male
## 2 Arun <NA> 1892 M
## 3 Monica 1871 1937 Female
## 4 RUCHI 1880 <NA> F
## 5 Geetu 1850 1950 fem
## 6 BaLa 1863 1893 <NA>

sapply function will return object of type matrix. So we again recovert it into data-frame type

  • Converting all the letters in the column “Name” to upper-case for standardization
1
2
dat$Name=toupper(dat$Name)
dat
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender
## 1 SONU 1861 1892 male
## 2 ARUN <NA> 1892 M
## 3 MONICA 1871 1937 Female
## 4 RUCHI 1880 <NA> F
## 5 GEETU 1850 1950 fem
## 6 BALA 1863 1893 <NA>
  • Normalize the gender variable We have to normalize the gender variable. It is in 5 different formats. Following are the two ways to normalize gender variable:

One is using ^ operator. This will find the words in the Gender column which begin with m and f respectively.

Following is the code:

1
2
3
4
5
6
datf=dat #Making a copy of the dataframe
locm=grep("^m", dat$Gender, ignore.case = TRUE)
dat$Gender[locm]="M"
locf=grep("^f", dat$Gender, ignore.case = TRUE)
dat$Gender[locf]="F"
dat
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender
## 1 SONU 1861 1892 M
## 2 ARUN <NA> 1892 M
## 3 MONICA 1871 1937 F
## 4 RUCHI 1880 <NA> F
## 5 GEETU 1850 1950 F
## 6 BALA 1863 1893 <NA>

There is another method of approximate string matching using string distances. A string distance measures how much 2 strings differ from each other. It measures how many operations are required to turn one string to another. Eg.

1
adist("pqr","qpr")
1
2
## [,1]
## [1,] 2

So, here there are 2 operations required to convert pqr to qpr

  • replace q: pqr -> ppr

  • replace p: ppr -> qpr

We’ll use adist function in the gender column in the following manner:

1
2
3
codes=c("male", "female")
D=adist(datf$Gender, codes)
D
1
2
3
4
5
6
7
## [,1] [,2]
## [1,] 0 2
## [2,] 4 6
## [3,] 2 1
## [4,] 4 6
## [5,] 4 3
## [6,] NA NA

First is male – 0 replacement away from “male” and 2 replacements away from “female”.Second is M is 4 replacements away from “male” and 6 replacements away from “female”.Third is Female 2 replacements away from “male” and 1 replacement away from “female”.

We can use which.min() function on each row to get minimum distance from the codes.

1
2
i=apply(D,1,which.min)
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 2
## 
## [[4]]
## [1] 1
## 
## [[5]]
## [1] 2
## 
## [[6]]
## integer(0)

i contains the column number where the distance is minimum from our codes. If the distance from column number 1 is minimum, then we will substitute gender with code[1], and if the distance from column number 2 is minimum, then we will substitute the gender with code[2]. Following is the code:

1
2
3
4
5
6
7
i[[6]]=0
datf$Gender=as.character(datf$Gender)
for(j in 1:nrow(datf))
{
datf$Gender[j]=ifelse(i[[j]]==1, codes[1],(ifelse(i[[j]]==2, codes[2], NA)))
}
datf
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender
## 1 SONU 1861 1892 male
## 2 ARUN <NA> 1892 male
## 3 MONICA 1871 1937 female
## 4 RUCHI 1880 <NA> male
## 5 GEETU 1850 1950 female
## 6 BALA 1863 1893 <NA>
  • Normalize the data-types

Let’s retreive the classes of each column in data-frame dat:

1
sapply(dat, class)
1
2
## Name Birth_Date Death_Date Gender 
## "character" "factor" "factor" "factor"

We will convert Birth and Death columns to Numeric data-types using transform function.

1
2
dat=transform(datf, Birth_Date=as.numeric(as.character(Birth_Date)), Death_Date=as.numeric(as.character(Death_Date)))
dat
1
2
3
4
5
6
7
## Name Birth_Date Death_Date Gender
## 1 SONU 1861 1892 male
## 2 ARUN NA 1892 male
## 3 MONICA 1871 1937 female
## 4 RUCHI 1880 NA male
## 5 GEETU 1850 1950 female
## 6 BALA 1863 1893 <NA>

The data as represented in dat dataframe is clean, standardized, categorized and normalized.