R 101

HAPPY HOLIDAYS!!!�⛄��� In the spirit of the coming new year and new beginnings, we created a tutorial for getting started or restarted with R. If you are new to R or have dabbled in R but haven’t used it much recently, then this post is for you. We will focus on data classes and types, as well as data wrangling, and we will provide basic statistics and basic plotting examples using real data. Enjoy!

By C.Wright

As with most programming tutorials, let’s start with a good’ol “Hello World�.

1) First Command
1
print("Hello World")
1
1
## [1] "Hello World"
2) Install and Load Packages and Data

Now we need some data. Packages are collections of functions and/or data. There are published packages that you can use from the community such as these two packages, or you can make your own package for your own private use.

1
2
install.packages("babynames") 
install.packages("titanic")

Now that we have installed the packages, we need to load them.

1
2
library("babynames")
library("titanic")

Each installation of R comes with quite a bit of data! Now we want to load the “quake� data – there are lots of other options.

1
2
data("quakes")
data() #this will list all of the datasets available
3) Assigning Objects

Objects can be many different things ranging from a simple number to a giant matrix, but they refer to things that you can manipulate in R.

1
2
myString <- "Hello World" #notice how we need "" around words, aka strings
myString #take a look at myString
1
1
## [1] "Hello World"
1
2
A <- 14 #now we do not need "" around numbers
A #take a look at A
1
## [1] 14
1
2
A = 5 #can also use the equal sign to assign objects
A #notice how A has changed
1
## [1] 5
4) Assigning Objects with Multiple Elements

Now lets assign a more complex object

1
2
height <- c(5.5, 4.5, 5, 5.6, 5.8, 5.2, 6, 6.2, 5.9, 5.8, 6, 5.9) #this is called a vector
colors_to_use <- c("red", "blue")# a vector of strings
5) Classes

There are a variety of different object classes. We can use the function class() to tell us what class an object belongs to.

1
class(height) #this is a numeric vector
1
## [1] "numeric"
1
class(colors_to_use) #this is a character vector
1
## [1] "character"
1
2
heightdf<-data.frame(height, gender =c("F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M"))
heightdf #take a look at the dataframe
1
2
3
4
5
6
7
8
9
10
11
12
13
## height gender
## 1 5.5 F
## 2 4.5 F
## 3 5.0 F
## 4 5.6 F
## 5 5.8 F
## 6 5.2 F
## 7 6.0 M
## 8 6.2 M
## 9 5.9 M
## 10 5.8 M
## 11 6.0 M
## 12 5.9 M
1
class(heightdf) #check the class
1
## [1] "data.frame"
1
heightdf$height # we can refer to indivdual columns based on the column name
1
## [1] 5.5 4.5 5.0 5.6 5.8 5.2 6.0 6.2 5.9 5.8 6.0 5.9
1
class(heightdf$gender)#here we see a factor(categorical variable - stored in R as with integer levels
1
## [1] "factor"
1
2
logical_variable<-height == heightdf$height #this shows that all the elements in the height column of the heightdf dataframe are equivalent to those of the height vector
class(logical_variable)
1
## [1] "logical"
1
2
matrix_variable <- matrix(height, nrow = 2, ncol = 3)#now we will make a matrix
matrix_variable #take a look at the matrix
1
2
3
## [,1] [,2] [,3]
## [1,] 5.5 5.0 5.8
## [2,] 4.5 5.6 5.2
1
class(matrix_variable)
1
## [1] "matrix"
6) Subsetting Data

Now that we can assign or instantiate objects, let’s try to look at or manipulate specific parts of more complex objects.

Lets create an object of male heights by grabbing rows from heightdf.

1
2
maleIndex<-which(heightdf$gender == "M") #lets try subsetting just the male data out of the heightdf - first we need to determine which rows of the dataframe are male
maleIndex # this is now just a list of rows
1
## [1] 7 8 9 10 11 12
1
2
heightmale<-heightdf[maleIndex,] #now we will use the brackets to grab these rows - we use the comma to indicate that we want rows not columns
heightmale # now this is just the males
1
2
3
4
5
6
7
## height gender
## 7 6.0 M
## 8 6.2 M
## 9 5.9 M
## 10 5.8 M
## 11 6.0 M
## 12 5.9 M

Here is another way using a package called dpylr:

1
install.packages("dplyr")

Here we are creating an object of height data for males over 6 feet.

1
library(dplyr) #load a useful package for subsetting data
1
2
## 
## Attaching package: 'dplyr'
1
2
3
## The following objects are masked from 'package:stats':
## 
## filter, lag
1
2
3
## The following objects are masked from 'package:base':
## 
## intersect, setdiff, setequal, union
1
2
3
4
#heightmale_over6feet <- dplyr::subset(heightdf, gender =="M" & height >=6)#need to use column names to describe what we want to pull out of our data
heightmale_over6feet <- subset(heightdf, gender =="M" & height >=6)#need to use column names to describe what we want to pull out of our data

heightmale_over6feet#now we just have the males 6 feet or over
1
2
3
4
## height gender
## 7 6.0 M
## 8 6.2 M
## 11 6.0 M

Now let’s create an object by grabbing part of an object based on its columns.

1
2
gender1<-heightdf[2]#notice how here we use the brackets but no comma
gender1
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
## gender
## 1 F
## 2 F
## 3 F
## 4 F
## 5 F
## 6 F
## 7 M
## 8 M
## 9 M
## 10 M
## 11 M
## 12 M
1
2
gender2<-heightdf$gender#this does the same thing #notice this way we loose the data architecture - no longer a dataframe
gender2
1
2
## [1] F F F F F F M M M M M M
## Levels: F M
1
2
gender2<-data.frame(gender =heightdf$gender)# this however stays as a dataframe
gender2
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
## gender
## 1 F
## 2 F
## 3 F
## 4 F
## 5 F
## 6 F
## 7 M
## 8 M
## 9 M
## 10 M
## 11 M
## 12 M
1
2
genderindex<- which(colnames(heightdf) == "gender") #now wwe will use which() to select columns named gender
genderindex
1
## [1] 2
1
2
gender3 <-heightdf[genderindex]#now we will use the brackets to grab just this column
gender3
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
## gender
## 1 F
## 2 F
## 3 F
## 4 F
## 5 F
## 6 F
## 7 M
## 8 M
## 9 M
## 10 M
## 11 M
## 12 M
1
identical(gender1, gender2)#lets see if they are identical - this is a helpful function - can only compare two variables at a time
1
## [1] TRUE
1
gender1==gender2 # are they the same? should say true if they are
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
## gender
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
## [9,] TRUE
## [10,] TRUE
## [11,] TRUE
## [12,] TRUE
1
gender2==gender3 # are they the same? should say true if they are
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
## gender
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
## [9,] TRUE
## [10,] TRUE
## [11,] TRUE
## [12,] TRUE

Now let’s try to look at/grab specific values.

1
2
height2<-c(6, 5.5, 6, 6, 6, 6, 4.3) #6 and 5.5 are in our orignal height vector but not 4.3
which(height %in% height2) # what of our orignial data is also found in height2
1
## [1] 1 7 11
1
heightdf[which(height %in% height2),] # here we skipped making another variable for the index
1
2
3
4
## height gender
## 1 5.5 F
## 7 6.0 M
## 11 6.0 M
1
2
3
#we can also use a function clalled grep
wanted_heights_index<-grep(5.9, heightdf$height)
heightdf[wanted_heights_index,] #now we just have the samples who are 5.9
1
2
3
## height gender
## 9 5.9 M
## 12 5.9 M
1
2
#say we want to know the value of an element at a particular location
heightdf$height[2] #second value in the height column
1
## [1] 4.5
1
heightdf$height[1:3]# first three valeus in the height column
1
## [1] 5.5 4.5 5.0

This allows you to grab random data points.

1
sample(heightdf$height, 2)#takes a random sample from a vector of the specified number of elements
1
## [1] 5.5 5.9
1
sample.int(1000999, 2)#takes a random sample from 1 to the first whole number specified. Thue number of random values to output is given by the second number.
1
## [1] 161093 430020
8) More Statistical Tests
1
t.test(heightdf$height~heightdf$gender)#try a t test between male height and female height - this is significant!
1
2
3
4
5
6
7
8
9
10
## Welch Two Sample t-test
## 
## data: heightdf$height by heightdf$gender
## t = -3.4903, df = 5.8325, p-value = 0.01359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.194177 -0.205823
## sample estimates:
## mean in group F mean in group M 
## 5.266667 5.966667
1
2
3
#if p<0.05 it is generally considered significant
fit <-aov(heightdf$height~heightdf$gender + heightdf$age)#now lets perform an anova or multiple regression
summary(fit)# here are the results
1
2
3
4
5
6
## Df Sum Sq Mean Sq F value Pr(>F) 
## heightdf$gender 1 1.4700 1.4700 12.167 0.00685 **
## heightdf$age 1 0.1193 0.1193 0.987 0.34638 
## Residuals 9 1.0874 0.1208 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1
anova(fit)# same results
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
## Analysis of Variance Table
## 
## Response: heightdf$height
## Df Sum Sq Mean Sq F value Pr(>F) 
## heightdf$gender 1 1.47000 1.47000 12.1668 0.006851 **
## heightdf$age 1 0.11928 0.11928 0.9872 0.346383 
## Residuals 9 1.08739 0.12082 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1
2
fit <-lm(heightdf$height~heightdf$gender + heightdf$age)# performing as multiple regression
summary(fit) #gives the same result as above - this is an anova but the results are presented differently
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## 
## Call:
## lm(formula = heightdf$height ~ heightdf$gender + heightdf$age)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -0.83944 -0.07318 0.02918 0.12062 0.36706 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 5.01995 0.28600 17.552 2.86e-08 ***
## heightdf$genderM 0.68225 0.20148 3.386 0.00805 ** 
## heightdf$age 0.01065 0.01072 0.994 0.34638 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3476 on 9 degrees of freedom
## Multiple R-squared: 0.5938, Adjusted R-squared: 0.5035 
## F-statistic: 6.577 on 2 and 9 DF, p-value: 0.01736
1
anova(fit)#also gives the same result
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
## Analysis of Variance Table
## 
## Response: heightdf$height
## Df Sum Sq Mean Sq F value Pr(>F) 
## heightdf$gender 1 1.47000 1.47000 12.1668 0.006851 **
## heightdf$age 1 0.11928 0.11928 0.9872 0.346383 
## Residuals 9 1.08739 0.12082 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s do a more classic anova – using a categorical variable with more than two categories.

1
2
3
heightdf$country <-c("British", "French", "British", "Dutch", "Dutch", "French", "Dutch", "Dutch", "British", "French", "British", "French")
fit <-aov(heightdf$height~heightdf$gender + heightdf$age + heightdf$country)
summary(fit)
1
2
3
4
5
6
7
## Df Sum Sq Mean Sq F value Pr(>F) 
## heightdf$gender 1 1.4700 1.4700 19.157 0.00325 **
## heightdf$age 1 0.1193 0.1193 1.554 0.25258 
## heightdf$country 2 0.5503 0.2751 3.586 0.08471 . 
## Residuals 7 0.5371 0.0767 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1
2
fit <-aov(heightdf$height~ heightdf$country)
summary(fit)
1
2
3
## Df Sum Sq Mean Sq F value Pr(>F)
## heightdf$country 2 0.6067 0.3033 1.319 0.315
## Residuals 9 2.0700 0.2300
1
anova(fit)# we see the results of country but not each country
1
2
3
4
5
6
## Analysis of Variance Table
## 
## Response: heightdf$height
## Df Sum Sq Mean Sq F value Pr(>F)
## heightdf$country 2 0.60667 0.30333 1.3188 0.3146
## Residuals 9 2.07000 0.23000
1
TukeyHSD(fit)# this is how we get these results - none are significant
1
2
3
4
5
6
7
8
9
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## Fit: aov(formula = heightdf$height ~ heightdf$country)
## 
## $`heightdf$country`
## diff lwr upr p adj
## Dutch-British 0.30 -0.6468152 1.2468152 0.6627841
## French-British -0.25 -1.1968152 0.6968152 0.7484769
## French-Dutch -0.55 -1.4968152 0.3968152 0.2860337
1
2
fit <-lm(heightdf$height~heightdf$gender + heightdf$age + heightdf$country)
summary(fit) #gives the same result as above - this is an anova just different output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## 
## Call:
## lm(formula = heightdf$height ~ heightdf$gender + heightdf$age + 
## heightdf$country)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -0.36474 -0.13454 0.03405 0.15333 0.33097 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 5.46051 0.28225 19.346 2.46e-07 ***
## heightdf$genderM 0.71905 0.16131 4.458 0.00294 ** 
## heightdf$age -0.01143 0.01263 -0.905 0.39547 
## heightdf$countryDutch 0.46574 0.26813 1.737 0.12596 
## heightdf$countryFrench -0.25286 0.19590 -1.291 0.23778 
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.277 on 7 degrees of freedom
## Multiple R-squared: 0.7993, Adjusted R-squared: 0.6847 
## F-statistic: 6.971 on 4 and 7 DF, p-value: 0.01375
Free courses and tutorials

Also follow our blog for more helpful posts.

This image came from: https://www.pinterest.com/pin/89790586304535333/

Acknowledgements

This blog post was made possible thanks to:

Reproducibility

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value 
## version R version 3.4.0 (2017-04-21)
## os macOS Sierra 10.12.6 
## system x86_64, darwin15.6.0 
## ui X11 
## language (EN) 
## collate en_US.UTF-8 
## ctype en_US.UTF-8 
## tz America/New_York 
## date 2018-11-19 
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source 
## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.4.0) 
## babynames * 0.3.0 2017-04-14 [1] CRAN (R 3.4.0) 
## backports 1.1.2 2018-04-18 [1] Github (r-lib/[email protected]) 
## bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.4.1) 
## bindr 0.1 2016-11-13 [1] CRAN (R 3.4.0) 
## bindrcpp 0.2 2017-06-17 [1] CRAN (R 3.4.0) 
## BiocStyle * 2.6.1 2017-11-30 [1] Bioconductor 
## blogdown 0.5.9 2018-03-08 [1] Github (rstudio/[email protected])
## bookdown 0.7 2018-02-18 [1] CRAN (R 3.4.3) 
## cli 1.0.0 2017-11-05 [1] CRAN (R 3.4.2) 
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.1) 
## curl 3.2 2018-03-28 [1] CRAN (R 3.4.4) 
## digest 0.6.15 2018-01-28 [1] CRAN (R 3.4.3) 
## dplyr * 0.7.4 2017-09-28 [1] CRAN (R 3.4.2) 
## evaluate 0.11 2018-07-17 [1] CRAN (R 3.4.4) 
## glue 1.3.0 2018-07-17 [1] CRAN (R 3.4.4) 
## htmltab * 0.7.1 2016-12-29 [1] CRAN (R 3.4.0) 
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.4.0) 
## httr 1.3.1 2017-08-20 [1] CRAN (R 3.4.1) 
## jsonlite 1.5 2017-06-01 [1] CRAN (R 3.4.0) 
## knitcitations * 1.0.8 2017-07-04 [1] CRAN (R 3.4.1) 
## knitr 1.20 2018-02-20 [1] CRAN (R 3.4.3) 
## lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.4.4) 
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.4.0) 
## pillar 1.2.1 2018-02-27 [1] CRAN (R 3.4.3) 
## pkgconfig 2.0.1 2017-03-21 [1] CRAN (R 3.4.0) 
## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.4.0) 
## R6 2.2.2 2017-06-17 [1] CRAN (R 3.4.0) 
## Rcpp 0.12.16 2018-03-13 [1] CRAN (R 3.4.4) 
## RefManageR 1.2.0 2018-04-25 [1] CRAN (R 3.4.4) 
## reshape2 * 1.4.3 2017-12-11 [1] CRAN (R 3.4.3) 
## rlang 0.2.0 2018-02-20 [1] CRAN (R 3.4.3) 
## rmarkdown 1.10 2018-06-11 [1] CRAN (R 3.4.4) 
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.3) 
## sessioninfo * 1.1.1 2018-11-05 [1] CRAN (R 3.4.4) 
## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.4.4) 
## stringr 1.3.1 2018-05-10 [1] CRAN (R 3.4.4) 
## tibble 1.4.2 2018-01-22 [1] CRAN (R 3.4.3) 
## titanic * 0.1.0 2015-08-31 [1] CRAN (R 3.4.0) 
## utf8 1.1.3 2018-01-03 [1] CRAN (R 3.4.3) 
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4) 
## xfun 0.3 2018-07-06 [1] CRAN (R 3.4.4) 
## XML 3.98-1.10 2018-02-19 [1] CRAN (R 3.4.3) 
## xml2 1.2.0 2018-01-24 [1] CRAN (R 3.4.3) 
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.4.4) 
## 
## [1] /Library/Frameworks/R.framework/Versions/3.4/Resources/library

Related