How to work with strings in base R – An overview of 20+ methods for daily use.

In this post in the R:case4base series we will look at string manipulation with base R, and provide an overview of a wide range of functions for our string working needs.

We will use simple examples to learn to perform basic string operations, concatenate strings, work with substrings, switch cases, quote, find and replace within strings and more. Some interesting bonuses will also be included.

As always, some popular alternatives to base R will also be suggested and many useful references provided for further reading.

This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to character string, refer to character vectors of length 1.

Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# String constants can be assigned using
# double quotes 
a <- "this is a character string"
# or single quotes 
b <- 'this is a character string, too'
# To use literal quotes, we can escape with `

In this post in the R:case4base series we will look at string manipulation with base R, and provide an overview of a wide range of functions for our string working needs.

We will use simple examples to learn to perform basic string operations, concatenate strings, work with substrings, switch cases, quote, find and replace within strings and more. Some interesting bonuses will also be included.

As always, some popular alternatives to base R will also be suggested and many useful references provided for further reading.

This post is aimed to serve as an overview of functionality provided by base R to work with strings. Note that the term “string” is used somewhat loosely and refers to character vectors and character strings. In R documentation, references to character string, refer to character vectors of length 1.

Also since this is an overview, we will not examine the details of the functions, but rather list examples with simple, intuitive explanations trading off technical precision.

: c <- "this is \"it\"" # To make a character vector with multiple elements: d <- c("this", "vector", "has", "five", "elements") # To get the length of a character vector # (how many elements are in a character vector) length(d)
1
## [1] 5
1
2
3
# To get the number of characters in elemets of a vector
# ("how many characters in each of the elements")
nchar(d)
1
## [1] 4 6 3 4 8
1
2
# To create a missing character value
NA_character_
1
## [1] NA
1
2
# To test if an object is a character vector
is.character("is this a character vector?")
1
## [1] TRUE
1
2
3
4
5
6
7
# To convert other objects to character vectors
# Can surprise the unwary
as.character(c(
 42,
 Sys.time(),
 factor("A", levels = LETTERS)
))
1
## [1] "42" "1543060557.44415" "1"
1
2
# One of the ways to output a vector is `cat`
cat("Show me this")
1
## Show me this
1
2
3
# To include line breaks use `"\n"`
# To include tabs use `"\t"`:
cat("Break\ta\ta\nline")
1
2
## Break a a
## line
1
2
3
4
5
6
7
8
9
10
11
12
13
# When in doubt about an object
# str or summary may help
weirdList <- list(
 "What is this?",
 Sys.time(),
 b = 5L,
 c = c("one", 2),
 d = factor(c("red", "blue")),
 e = NA_character_,
 f = NA_integer_
)

str(weirdList)
1
2
3
4
5
6
7
8
## List of 7
## $ : chr "What is this?"
## $ : POSIXct[1:1], format: "2018-11-24 11:55:57"
## $ b: int 5
## $ c: chr [1:2] "one" "2"
## $ d: Factor w/ 2 levels "blue","red": 2 1
## $ e: chr NA
## $ f: int NA
1
summary(weirdList)
1
2
3
4
5
6
7
8
## Length Class Mode 
## 1 -none- character
## 1 POSIXct numeric 
## b 1 -none- numeric 
## c 2 -none- character
## d 2 factor numeric 
## e 1 -none- character
## f 1 -none- numeric

String concatenation is the process of “joining” two strings together and one the most common operations.

Simple concatenation

1
2
# We will use these vectors for our examples:
1:3
1
## [1] 1 2 3
1
month.name
1
2
3
## [1] "January" "February" "March" "April" "May" 
## [6] "June" "July" "August" "September" "October" 
## [11] "November" "December"
1
2
3
# Use paste to concatenate
# R recycles 1:3 4 times to fit the length of month.name
paste(1:3, month.name)
1
2
3
## [1] "1 January" "2 February" "3 March" "1 April" "2 May" 
## [6] "3 June" "1 July" "2 August" "3 September" "1 October" 
## [11] "2 November" "3 December"
1
2
3
# Specify the sep argument to 
# separate the elements differently
paste(1:3, month.name, sep = ": ")
1
2
3
1
2
3
## [1] "1: January" "2: February" "3: March" "1: April" 
## [5] "2: May" "3: June" "1: July" "2: August" 
## [9] "3: September" "1: October" "2: November" "3: December"
1
2
# A shorthard for sep = ""
paste0(1:3, month.name)
1
2
3
## [1] "1January" "2February" "3March" "1April" "2May" 
## [6] "3June" "1July" "2August" "3September" "1October" 
## [11] "2November" "3December"
1
2
# Alternatively, sprintf is very useful
sprintf("%s: %s", 1:3, month.name)
1
2
3
1
2
3
## [1] "1: January" "2: February" "3: March" "1: April" 
## [5] "2: May" "3: June" "1: July" "2: August" 
## [9] "3: September" "1: October" "2: November" "3: December"

Concatenate a vector into a single character string

1
2
3
# Provide the collapse argument to paste
# to get a character string (length 1 vector):
paste(1:3, month.name, sep = ": ", collapse = ", ")
1
1
## [1] "1: January, 2: February, 3: March, 1: April, 2: May, 3: June, 1: July, 2: August, 3: September, 1: October, 2: November, 3: December"
1
2
# Or, use toString
toString(paste(1:3, month.name, sep = ": "))
1
1
## [1] "1: January, 2: February, 3: March, 1: April, 2: May, 3: June, 1: July, 2: August, 3: September, 1: October, 2: November, 3: December"

String lengths

1
2
# How many elements does a vector have?
length(month.name)
1
## [1] 12
1
2
3
# To get the number of characters in elemets of a vector
# ("how many characters in each of the elements?")
nchar(month.name)
1
## [1] 7 8 5 5 3 4 4 6 9 7 8 8
1
2
# Are the elements non-empty strings?
nzchar(month.name)
1
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Switching to upper/lower case

1
2
# Switch to all lower case
tolower(month.name)
1
2
3
1
2
3
## [1] "january" "february" "march" "april" "may" 
## [6] "june" "july" "august" "september" "october" 
## [11] "november" "december"
1
2
# Switch to all upper case
toupper(month.name)
1
2
3
1
2
3
## [1] "JANUARY" "FEBRUARY" "MARCH" "APRIL" "MAY" 
## [6] "JUNE" "JULY" "AUGUST" "SEPTEMBER" "OCTOBER" 
## [11] "NOVEMBER" "DECEMBER"
1
2
# Casefold is a wrapper for S-PLUS compatibility
casefold(month.name, upper = FALSE)
1
2
3
1
2
3
## [1] "january" "february" "march" "april" "may" 
## [6] "june" "july" "august" "september" "october" 
## [11] "november" "december"
1
casefold(month.name, upper = TRUE)
1
2
3
1
2
3
## [1] "JANUARY" "FEBRUARY" "MARCH" "APRIL" "MAY" 
## [6] "JUNE" "JULY" "AUGUST" "SEPTEMBER" "OCTOBER" 
## [11] "NOVEMBER" "DECEMBER"
1
2
# Also, custom translation:
chartr("OIZEASGTC", "01234567(" , toupper(month.name))
1
2
3
## [1] "J4NU4RY" "F3BRU4RY" "M4R(H" "4PR1L" "M4Y" 
## [6] "JUN3" "JULY" "4U6U57" "53P73MB3R" "0(70B3R" 
## [11] "N0V3MB3R" "D3(3MB3R"

Removing white spaces

1
2
# Remove all leading and trailing whitespaces
trimws(" This has trailing spaces. ")
1
## [1] "This has trailing spaces."
1
2
# Remove leading whitespaces
trimws(" This has trailing spaces. ", which = "left")
1
## [1] "This has trailing spaces. "
1
2
# Remove trailing whitespaces
trimws(" This has trailing spaces. ", which = "right")
1
## [1] " This has trailing spaces."

Encoding conversion

1
2
# Convert a character vector between encodings
iconv("šibrinkuje", "UTF-8", "ASCII", "?")
1
## [1] "??ibrinkuje"

Quoting

1
2
# Quoting text for fancier priting:
sQuote(month.name)
1
2
3
## [1] "'January'" "'February'" "'March'" "'April'" "'May'" 
## [6] "'June'" "'July'" "'August'" "'September'" "'October'" 
## [11] "'November'" "'December'"
1
dQuote(month.name)
1
2
3
## [1] "\"January\"" "\"February\"" "\"March\"" "\"April\"" 
## [5] "\"May\"" "\"June\"" "\"July\"" "\"August\"" 
## [9] "\"September\"" "\"October\"" "\"November\"" "\"December\""
1
2
3
4
5
# Not to be confused with quoting strings for passing to OS shell
system(paste("echo", shQuote("Weird\nstuff")))

# Also not be confused with quoting expressions
str(quote(1 + 1))
1
## language 1 + 1

Retrieving and working with substrings

1
2
# Get the first three characters from all the month.names
substr(month.name, 1, 3)
1
2
1
2
1
2
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
1
2
# Get the last three characters from all the month.names
substr(month.name, nchar(month.name) - 2, nchar(month.name))
1
2
## [1] "ary" "ary" "rch" "ril" "May" "une" "uly" "ust" "ber" "ber" "ber"
## [12] "ber"
1
2
# Wrapper around substr for S Compability:
substring(month.name, 1, 3)
1
2
1
2
1
2
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
1
2
# Check whether elements start with a string
startsWith(month.name, "J")
1
2
## [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [12] FALSE
1
2
# Check whether elements end with a string
endsWith(month.name, "ember")
1
2
1
2
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [12] TRUE
1
2
# Trim character strings to specified display widths.
strtrim(month.name, 3)
1
2
1
2
1
2
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
1
2
# Abbreviate strings to at least minlength characters
abbreviate(month.name, minlength = 3)
1
2
3
4
## January February March April May June July 
## "Jnr" "Fbr" "Mrc" "Apr" "May" "Jun" "Jly" 
## August September October November December 
## "Ags" "Spt" "Oct" "Nvm" "Dcm"

Pattern matching and replacement using regular expressions in an extremely powerful feature, however it is out of scope of this overview to cover them.

Check the references for better resources if you are interested. A lot more useful detail can also be found in R’s documentation.

The following is just to show very basic use and list useful functions.

Replace substring with other strings

1
2
3
4
5
myStrings <- paste(1:3, month.name, sep = ". ")

# Replace all ones with zeros:
# fixed will match the first argument as is
gsub("1", "0", myStrings, fixed = TRUE)
1
2
3
## [1] "0. January" "2. February" "3. March" "0. April" 
## [5] "2. May" "3. June" "0. July" "2. August" 
## [9] "3. September" "0. October" "2. November" "3. December"
1
2
# Replace only the first "a" in each for "A"
sub("a", "A", myStrings, fixed = TRUE)
1
2
3
## [1] "1. JAnuary" "2. FebruAry" "3. MArch" "1. April" 
## [5] "2. MAy" "3. June" "1. July" "2. August" 
## [9] "3. September" "1. October" "2. November" "3. December"
1
2
3
# Replace any number with 0
# note that the fixed argument is now FALSE (default)
gsub("[0-9]", "0", myStrings)
1
2
3
## [1] "0. January" "0. February" "0. March" "0. April" 
## [5] "0. May" "0. June" "0. July" "0. August" 
## [9] "0. September" "0. October" "0. November" "0. December"
1
2
# Replace literal dots with 0
gsub(".", "0", myStrings, fixed = TRUE)
1
2
3
## [1] "10 January" "20 February" "30 March" "10 April" 
## [5] "20 May" "30 June" "10 July" "20 August" 
## [9] "30 September" "10 October" "20 November" "30 December"
1
2
# This will replace all characters with zeros
gsub(".", "0", myStrings)
1
2
3
## [1] "0000000000" "00000000000" "00000000" "00000000" 
## [5] "000000" "0000000" "0000000" "000000000" 
## [9] "000000000000" "0000000000" "00000000000" "00000000000"

Check if a pattern is present within elements of a character vector

1
2
3
4
myStrings <- paste(1:3, month.name, sep = ". ")

# Is a pattern present (returns a logical vector)?
grepl("ember", myStrings)
1
2
1
2
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [12] TRUE
1
2
# In which elements is a pattern present (returns indices)?
grep("ember", myStrings)
1
## [1] 9 11 12
1
2
# In which elements is a pattern present (returns the values)?
grep("ember", myStrings, value = TRUE)
1
## [1] "3. September" "2. November" "3. December"

Check where the matches are within the elements of a character vector

1
2
3
4
5
myStrings <- paste(1:3, month.name, sep = ". ")

# Where is the first "a" located in each of the elements?
# pattern if not found in that element, returns -1
regexpr("a", myStrings)
1
2
3
4
5
## [1] 5 9 5 -1 5 -1 -1 -1 -1 -1 -1 -1
## attr(,"match.length")
## [1] 1 1 1 -1 1 -1 -1 -1 -1 -1 -1 -1
## attr(,"useBytes")
## [1] TRUE
1
2
3
# Where are all the "a" located in each of the elements?
# If pattern not found in that element, returns -1
gregexpr("a", myStrings)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
## [[1]]
## [1] 5 8
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 9
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[4]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[5]]
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[6]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[7]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[8]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[9]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[10]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[11]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[12]]
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
1
2
# Where are all the "a" located in the first element?
gregexpr("a", myStrings[1])
1
2
3
4
5
6
## [[1]]
## [1] 5 8
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE
1
2
# or also 
gregexpr("a", myStrings)[[1]]
1
2
3
4
5
## [1] 5 8
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE

We skip regexec here as parenthesized sub-expressions are very much out of scope of this post.

Strings

1
2
# The Levenshtein distance between strings
adist(c("lazy", "lasso", "lassie"), c("lazy", "lazier", "laser"))
1
2
3
4
## [,1] [,2] [,3]
## [1,] 0 3 3
## [2,] 3 4 2
## [3,] 4 3 3
1
2
# Repeat elements of a character vector a given number of times 
strrep(c(":)", ":P ", ";) "), 1:3)
1
## [1] ":)" ":P :P " ";) ;) ;) "
1
2
# Convert strings to integers of a given base
strtoi(c("101010", "11111000101"), base = 2L)
1
1
## [1] 42 1989
1
strtoi(c("2A", "7C5"), base = 16L)
1
1
## [1] 42 1989
1
2
3
# Symbolic Number Coding
cors <- lapply(split(iris, iris$Species), function(x) cor(x[, 1:4]))
lapply(cors, symnum, abbr.colnames = 6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## $setosa
## Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1 
## Sepal.Width , 1 
## Petal.Length 1 
## Petal.Width . 1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
## 
## $versicolor
## Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1 
## Sepal.Width . 1 
## Petal.Length , . 1 
## Petal.Width . , , 1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
## 
## $virginica
## Spl.Ln Spl.Wd Ptl.Ln Ptl.Wd
## Sepal.Length 1 
## Sepal.Width . 1 
## Petal.Length + . 1 
## Petal.Width . . 1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Using the tidyverse’s stringr and glue

Stringr is built on top of stringiand focuses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine.

glue strings to data in R. Small, fast, dependency free interpreted string literals.

Using stringi

  • Stringi is an R package for very fast, correct, consistent, and convenient string/text processing in each locale and any native character encoding.

Did you find the article helpful or interesting? Help others find it by sharing

Related