# a number
4.5
[1] 4.5
# check structure
str(4.5)
num 4.5
# stored as double
is.double(4.5)
[1] TRUE
Read through the R Basics section and then complete all actions in the Tidyverse basics section. This lab is for your own benefit and no submission is expected.
Objectives:
Review R basics (data types and object classes);
introduce core tidyverse libraries (dplyr, tidyr, and ggplot);
replicate portions of in-class analysis of class survey data
Here we’ll cover some bare essentials in R. You can find a more thorough introduction in MSDR Appendix B.
There are five main data types in R.
Numeric (double- or single-precision floating point) data represent real numbers. Numeric data are abbreviated num
and by default are stored as double-precision floating point.
# a number
4.5
[1] 4.5
# check structure
str(4.5)
num 4.5
# stored as double
is.double(4.5)
[1] TRUE
Integer data are integers. For the most part they behave like numeric data, except occupy less memory, which can in some cases be convenient. To distinguish integers from doubles, R uses a trailing L
after values; the data type is abbreviated int
.
# an integer
4L
[1] 4
# check structure
str(4L)
int 4
Logical data are binary and represented in R as having values TRUE
and FALSE
. They are abbreviated logi
in R. Often they are automatically coerced to integer data with values 0 (false) and 1 (true) to perform arithmetic and other operations.
# logical value
TRUE
[1] TRUE
# check structure
str(TRUE)
logi TRUE
# arithmetic
TRUE + FALSE
[1] 1
# check structure
str(FALSE + FALSE)
int 0
Character data represent strings of text and are sometimes called ‘strings’. They are abbreviated chr
in R and values are surrounded by quotation marks; this distinguishes, for example, the character 4 from the number 4. Single quotations can be used to input strings as well as double quotations. Arithmetic is not possible with strings for obvious reasons.
# a character string
'yay'
[1] "yay"
# check structure
str('yay')
chr "yay"
# string arithmetic won't work
'4' + '1'
Error in "4" + "1": non-numeric argument to binary operator
# but can be performed after coercing character to string
as.numeric('4') + as.numeric('1')
[1] 5
Factor data represent categorical variables. In R these are encoded numerically according to the number of ‘levels’ of the factor, which represent the unique values of the categorical variable, and each level is labeled. R will print the labels, not the levels, of factors; the data type is abbreviated fct
.
# a factor
factor(1, levels = c(1, 2), labels = c('blue', 'red'))
[1] blue
Levels: blue red
# less verbose definition
factor('blue', levels = c('blue', 'red'))
[1] blue
Levels: blue red
# check structure
str(factor('blue', levels = c('blue', 'red')))
Factor w/ 2 levels "blue","red": 1
Usually factors won’t be defined explicitly, but instead interpreted from character data. The levels and labels of factors can be manipulated using a variety of helper functions.
The most basic type of object in R is a vector. Vectors are concatenations of data values of the same type. They are defined using the concatenation operator c()
and are indexed by consecutive integers; subvectors can be retrieved by specifying the indices between square brackets.
# numeric vector
c(1, 4, 7)
[1] 1 4 7
# character vector
c('blue', 'red')
[1] "blue" "red"
# indexing
c(1, 4, 7)[1]
[1] 1
c(1, 4, 7)[2]
[1] 4
c(1, 4, 7)[3]
[1] 7
c(1, 4, 7)[2:3]
[1] 4 7
c(1, 4, 7)[c(1, 3)]
[1] 1 7
Usually objects are assigned names for easy retrieval. Vectors will not show any special object class if the structure is examined; str()
will simply return the data type, index range, and the values.
# assign a name
<- c(1, 4, 7)
my_vec
# check structure
str(my_vec)
num [1:3] 1 4 7
Next up in complexity are arrays. These are blocks of data values of the same type indexed along two or more dimensions. For arrays, str()
will return the data type, index structure, and data values; when printed directly, data values are arranged according to the indexing.
# an array
<- array(data = c(1, 2, 3, 4, 5, 6, 7, 8),
my_ary dim = c(2, 4))
my_ary
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
str(my_ary)
num [1:2, 1:4] 1 2 3 4 5 6 7 8
# another array
<- array(data = c(1, 2, 3, 4, 5, 6, 7, 8),
my_oth_ary dim = c(2, 2, 2))
my_oth_ary
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
str(my_oth_ary)
num [1:2, 1:2, 1:2] 1 2 3 4 5 6 7 8
For arrays, elements can be retrieved by index coordinates, and slices can be retrieved by leaving index positions blank, which will return all elements along the corresponding indices.
# one element
1, 2] my_ary[
[1] 3
# one element
1, 2, 1] my_oth_ary[
[1] 3
# a slice (second row)
2, ] my_ary[
[1] 2 4 6 8
# a slice (first layer)
1] my_oth_ary[ , ,
[,1] [,2]
[1,] 1 3
[2,] 2 4
Next there are lists, which are perhaps the most flexible data structure. A list is an indexed collection of any objects.
# a list
list('cat', c(1, 4, 7), TRUE)
[[1]]
[1] "cat"
[[2]]
[1] 1 4 7
[[3]]
[1] TRUE
# a named list
list(animal = 'cat',
numbers = c(1, 4, 7),
short = TRUE)
$animal
[1] "cat"
$numbers
[1] 1 4 7
$short
[1] TRUE
List elements can be retrieved by index in double square brackets, or by name.
# assign a name
<- list(animal = 'cat',
my_lst numbers = c(1, 4, 7),
short = TRUE)
# check structure
str(my_lst)
List of 3
$ animal : chr "cat"
$ numbers: num [1:3] 1 4 7
$ short : logi TRUE
# retrieve an element
1]] my_lst[[
[1] "cat"
# equivalent
$animal my_lst
[1] "cat"
Finally, data frames are type-heterogeneous lists of vectors of equal length. More informally, they are 2D arrays with columns of differing data types. str()
will essentially show the list structure; but when printed, data frames will appear arranged in a table.
# a data frame
<- data.frame(animal = c('cat', 'hare', 'tortoise'),
my_df has.fur = c(TRUE, TRUE, FALSE),
weight.lbs = c(9.1, 8.2, 22.7))
str(my_df)
'data.frame': 3 obs. of 3 variables:
$ animal : chr "cat" "hare" "tortoise"
$ has.fur : logi TRUE TRUE FALSE
$ weight.lbs: num 9.1 8.2 22.7
my_df
animal has.fur weight.lbs
1 cat TRUE 9.1
2 hare TRUE 8.2
3 tortoise FALSE 22.7
The data frame is the standard object type for representing datasets in R. For the most part, modern computing in R is designed around the data frame.
R packages are add-ons that can include special functions, datasets, object classes, and the like. They are published software and can be installed using install.packages('PACKAGE NAME')
and, once installed, loaded via library('PACKAGE NAME')
or require('PACKAGE NAME')
.
We will illustrate the use of tidyverse functions to reproduce the analysis shown in class of the survey data.
Load class survey data
library(tidyverse)
# retrieve class survey data
<- 'https://raw.githubusercontent.com/pstat197/pstat197a/main/materials/labs/lab2-tidyverse/data/'
url
<- paste(url, 'background-clean.csv', sep = '') %>%
background read_csv()
<- paste(url, 'interest-clean.csv', sep = '') %>%
interest read_csv()
<- paste(url, 'survey-metadata.csv', sep = '') %>%
metadata read_csv()
You can view the data in one of two ways:
# print the data frame for inspection in the console
background
# A tibble: 59 × 30
response.id prog.prof prog.comf math.prof math.comf stat.prof stat.comf
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1 Adv 5 Int 3 Adv 4
2 2 Int 5 Int 3 Int 4
3 3 Int 3 Int 3 Int 4
4 4 Int 3 Int 3 Int 3
5 5 Int 3 Adv 4 Int 3
6 7 Int 3 Adv 4 Int 3
7 8 Adv 4 Adv 5 Adv 4
8 9 Int 3 Int 3 Int 3
9 10 Int 4 Int 3 Adv 4
10 12 Int 5 Int 4 Int 4
# … with 49 more rows, and 23 more variables: updv.num <chr>, dom <chr>,
# consent <chr>, PSTAT100 <chr>, `PSTAT120A-B` <chr>, PSTAT126 <chr>,
# PSTAT127 <chr>, PSTAT134 <chr>, CS9 <chr>, PSTAT131 <chr>, PSTAT174 <chr>,
# LING104 <chr>, LING105 <chr>, PSTAT115 <chr>, PSTAT122 <chr>,
# `PSTAT160A-B` <chr>, CS16 <chr>, `CS5A-B` <chr>, POLS15 <chr>,
# LING110 <chr>, `CS130A-B` <chr>, `CS165A-B` <chr>, rsrch <lgl>
# open as a spreadsheet in a separate viewer
view(background)
Open the metadata in the viewer and have a look. Take a moment to inspect the datasets.
The tidyverse is a collection of packages for data manipulation, visualization, and statistical modeling. Some are specialized, such as forcats
or lubridate
, which contain functions for manipulating factors and dates and times, respectively. The packages share some common underyling principles.
%>%
The tidyverse facilitates programming in readable sequences of steps that are performed on dataframe. For example:
%>% STEP1() %>% STEP2() %>% STEP3() my_df
If it helps, imagine that step 1 is defining a new variable, step 2 is selecting a subset of columns, and step 3 is fitting a model of some kind.
tidyverse packages leverage a slight generalization of the data frame called a tibble. For the most part, tibbles behave as data frames do, but they are slightly more flexible in ways you’ll encounter later.
For now, think of a tibble as just another name for a data frame.
%>%
In short, x %>% f(y)
is equivalent to f(x, y)
.
In other words, the pipe operator ‘pipes’ the result of the left-hand operation into the first argument of the right-hand function.
# a familiar example
<- c(1, 2, 5)
my_vec str(my_vec)
num [1:3] 1 2 5
# use the pipe operator instead
%>% str() my_vec
num [1:3] 1 2 5
The dplyr
package contains functions for manipulating data frames (tibbles). The functions are named with verbs that describe common operations.
For each verb listed below, copy the code chunk into your script and execute.
Go through the list with your neighbor and check your understanding by describing what the code example accomplishes.
filter – filter the rows of a data frame according to a condition and return a subset of rows meeting that condition
# filter rows
%>%
background filter(math.comf > 3)
# A tibble: 37 × 30
response.id prog.prof prog.comf math.prof math.comf stat.prof stat.comf
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 5 Int 3 Adv 4 Int 3
2 7 Int 3 Adv 4 Int 3
3 8 Adv 4 Adv 5 Adv 4
4 12 Int 5 Int 4 Int 4
5 13 Int 4 Adv 4 Adv 4
6 15 Adv 3 Adv 4 Adv 4
7 17 Int 3 Int 4 Int 4
8 18 Int 4 Adv 5 Adv 5
9 19 Adv 5 Adv 5 Adv 5
10 20 Adv 4 Int 4 Adv 5
# … with 27 more rows, and 23 more variables: updv.num <chr>, dom <chr>,
# consent <chr>, PSTAT100 <chr>, `PSTAT120A-B` <chr>, PSTAT126 <chr>,
# PSTAT127 <chr>, PSTAT134 <chr>, CS9 <chr>, PSTAT131 <chr>, PSTAT174 <chr>,
# LING104 <chr>, LING105 <chr>, PSTAT115 <chr>, PSTAT122 <chr>,
# `PSTAT160A-B` <chr>, CS16 <chr>, `CS5A-B` <chr>, POLS15 <chr>,
# LING110 <chr>, `CS130A-B` <chr>, `CS165A-B` <chr>, rsrch <lgl>
select – select a subset of columns from a data frame
# select a column
%>%
background select(math.comf)
# A tibble: 59 × 1
math.comf
<dbl>
1 3
2 3
3 3
4 3
5 4
6 4
7 5
8 3
9 3
10 4
# … with 49 more rows
pull – extract a single column from a data frame
# pull a column
%>%
background pull(rsrch)
[1] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE
[13] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
[25] FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[37] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[49] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
mutate – define a new column as a function of existing columns
# define a new variable
%>%
background mutate(avg.comf = (math.comf + prog.comf + stat.comf)/3)
# A tibble: 59 × 31
response.id prog.prof prog.comf math.prof math.comf stat.prof stat.comf
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1 Adv 5 Int 3 Adv 4
2 2 Int 5 Int 3 Int 4
3 3 Int 3 Int 3 Int 4
4 4 Int 3 Int 3 Int 3
5 5 Int 3 Adv 4 Int 3
6 7 Int 3 Adv 4 Int 3
7 8 Adv 4 Adv 5 Adv 4
8 9 Int 3 Int 3 Int 3
9 10 Int 4 Int 3 Adv 4
10 12 Int 5 Int 4 Int 4
# … with 49 more rows, and 24 more variables: updv.num <chr>, dom <chr>,
# consent <chr>, PSTAT100 <chr>, `PSTAT120A-B` <chr>, PSTAT126 <chr>,
# PSTAT127 <chr>, PSTAT134 <chr>, CS9 <chr>, PSTAT131 <chr>, PSTAT174 <chr>,
# LING104 <chr>, LING105 <chr>, PSTAT115 <chr>, PSTAT122 <chr>,
# `PSTAT160A-B` <chr>, CS16 <chr>, `CS5A-B` <chr>, POLS15 <chr>,
# LING110 <chr>, `CS130A-B` <chr>, `CS165A-B` <chr>, rsrch <lgl>,
# avg.comf <dbl>
These operations can be chained together, for example:
# sequence of verbs
%>%
background filter(stat.prof == 'Adv') %>%
mutate(avg.comf = (math.comf + prog.comf + stat.comf)/3) %>%
select(avg.comf, rsrch)
# A tibble: 36 × 2
avg.comf rsrch
<dbl> <lgl>
1 4 FALSE
2 4.33 TRUE
3 3.67 FALSE
4 4 FALSE
5 3.67 FALSE
6 4.67 TRUE
7 5 TRUE
8 4.33 TRUE
9 4.33 TRUE
10 4 TRUE
# … with 26 more rows
Summaries are easily computed across rows using summarize()
. So if for example we want to use the filtering and selection from before to find the proportion of advanced students in statistics with research experience, use:
# a summary
%>%
background filter(stat.prof == 'Adv') %>%
mutate(avg.comf = (math.comf + prog.comf + stat.comf)/3) %>%
select(avg.comf, rsrch) %>%
summarize(prop.rsrch = mean(rsrch))
# A tibble: 1 × 1
prop.rsrch
<dbl>
1 0.611
# equivalent
%>%
background filter(stat.prof == 'Adv') %>%
mutate(avg.comf = (math.comf + prog.comf + stat.comf)/3) %>%
select(avg.comf, rsrch) %>%
pull(rsrch) %>%
mean()
[1] 0.6111111
The advantage of summarize
, however, is that multiple summaries can be computed at once:
%>%
background filter(stat.prof == 'Adv') %>%
mutate(avg.comf = (math.comf + prog.comf + stat.comf)/3) %>%
select(avg.comf, rsrch) %>%
summarize(prop.rsrch = mean(rsrch),
med.comf = median(avg.comf))
# A tibble: 1 × 2
prop.rsrch med.comf
<dbl> <dbl>
1 0.611 4
The variant summarize_all
computes the same summary across all columns. (Notice the use of the helper verb contains()
to select all columns containing a particular string.)
# average comfort levels across all students
%>%
background select(contains('comf')) %>%
summarise_all(.funs = mean)
# A tibble: 1 × 3
prog.comf math.comf stat.comf
<dbl> <dbl> <dbl>
1 3.97 3.85 4.08
Grouped summaries are summaries computed separately among subsets of observations. To define a grouping structure using an existing column, use group_by()
. Notice the ‘groups’ attribute printed with the output.
# create a grouping
%>%
background group_by(stat.prof)
# A tibble: 59 × 30
# Groups: stat.prof [3]
response.id prog.prof prog.comf math.prof math.comf stat.prof stat.comf
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 1 Adv 5 Int 3 Adv 4
2 2 Int 5 Int 3 Int 4
3 3 Int 3 Int 3 Int 4
4 4 Int 3 Int 3 Int 3
5 5 Int 3 Adv 4 Int 3
6 7 Int 3 Adv 4 Int 3
7 8 Adv 4 Adv 5 Adv 4
8 9 Int 3 Int 3 Int 3
9 10 Int 4 Int 3 Adv 4
10 12 Int 5 Int 4 Int 4
# … with 49 more rows, and 23 more variables: updv.num <chr>, dom <chr>,
# consent <chr>, PSTAT100 <chr>, `PSTAT120A-B` <chr>, PSTAT126 <chr>,
# PSTAT127 <chr>, PSTAT134 <chr>, CS9 <chr>, PSTAT131 <chr>, PSTAT174 <chr>,
# LING104 <chr>, LING105 <chr>, PSTAT115 <chr>, PSTAT122 <chr>,
# `PSTAT160A-B` <chr>, CS16 <chr>, `CS5A-B` <chr>, POLS15 <chr>,
# LING110 <chr>, `CS130A-B` <chr>, `CS165A-B` <chr>, rsrch <lgl>
Sometimes it can be helpful to simply count the observations in each group:
# count observations
%>%
background group_by(stat.prof) %>%
count()
# A tibble: 3 × 2
# Groups: stat.prof [3]
stat.prof n
<chr> <int>
1 Adv 36
2 Beg 2
3 Int 21
To compute a grouped summary, first group the data frame and then specify the summary of interest:
# a grouped summary
%>%
background group_by(stat.prof) %>%
select(contains('.comf')) %>%
summarize_all(.funs = mean)
# A tibble: 3 × 4
stat.prof prog.comf math.comf stat.comf
<chr> <dbl> <dbl> <dbl>
1 Adv 4.06 4.14 4.39
2 Beg 3.5 2.5 2.5
3 Int 3.86 3.48 3.71
Grouped summaries
In general, tidyr verbs reshape data frames in various ways. For now, we’ll just cover two tidyr verbs.
Suppose we want to calculate multiple summaries of multiple variables using the techniques above. By default, the output is one row with one column for each summary/variable combination:
# many variables, many summaries
<- background %>%
comf_sum select(contains('comf')) %>%
summarise_all(.funs = list(mean = mean,
median = median,
min = min,
max = max))
comf_sum
# A tibble: 1 × 12
prog.comf_mean math.comf_mean stat.comf_mean prog.comf_median math.comf_median
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3.97 3.85 4.08 4 4
# … with 7 more variables: stat.comf_median <dbl>, prog.comf_min <dbl>,
# math.comf_min <dbl>, stat.comf_min <dbl>, prog.comf_max <dbl>,
# math.comf_max <dbl>, stat.comf_max <dbl>
It would be much better to reshape this into a table. gather will reshape the data frame from wide format to long format by ‘gathering’ the columns together.
# gather columns into long format
%>% gather(stat, val) comf_sum
# A tibble: 12 × 2
stat val
<chr> <dbl>
1 prog.comf_mean 3.97
2 math.comf_mean 3.85
3 stat.comf_mean 4.08
4 prog.comf_median 4
5 math.comf_median 4
6 stat.comf_median 4
7 prog.comf_min 3
8 math.comf_min 2
9 stat.comf_min 2
10 prog.comf_max 5
11 math.comf_max 5
12 stat.comf_max 5
This is a little better, but it would be more legible in a 2x2 table. We can separate the ‘stat’ variable that has the column names into two columns:
# separate into rows and columns
%>%
comf_sum gather(stat, val) %>%
separate(stat, into = c('variable', 'stat'), sep = '_')
# A tibble: 12 × 3
variable stat val
<chr> <chr> <dbl>
1 prog.comf mean 3.97
2 math.comf mean 3.85
3 stat.comf mean 4.08
4 prog.comf median 4
5 math.comf median 4
6 stat.comf median 4
7 prog.comf min 3
8 math.comf min 2
9 stat.comf min 2
10 prog.comf max 5
11 math.comf max 5
12 stat.comf max 5
And then spread the stat column over a few rows, resulting in a table where the rows are the variables and the columns are the summaries:
# spread into table
%>%
comf_sum gather(stat, val) %>%
separate(stat, into = c('variable', 'stat'), sep = '_') %>%
spread(stat, val)
# A tibble: 3 × 5
variable max mean median min
<chr> <dbl> <dbl> <dbl> <dbl>
1 math.comf 5 3.85 4 2
2 prog.comf 5 3.97 4 3
3 stat.comf 5 4.08 4 2
The ggplot package is for data visualization. The syntax takes some getting used to if you haven’t seen it before. We’ll just look at one example.
Suppose we want to summarize the prior coursework in the class.
# summary of classes taken
<- background %>%
classes select(11:29) %>%
mutate_all(~factor(.x, levels = c('no', 'yes'))) %>%
mutate_all(~as.numeric(.x) - 1) %>%
summarize_all(mean) %>%
gather(class, proportion)
classes
# A tibble: 19 × 2
class proportion
<chr> <dbl>
1 PSTAT100 0.271
2 PSTAT120A-B 0.983
3 PSTAT126 0.932
4 PSTAT127 0.169
5 PSTAT134 0.305
6 CS9 0.576
7 PSTAT131 0.424
8 PSTAT174 0.169
9 LING104 0.0339
10 LING105 0.0339
11 PSTAT115 0.102
12 PSTAT122 0.356
13 PSTAT160A-B 0.458
14 CS16 0.492
15 CS5A-B 0.0508
16 POLS15 0.0169
17 LING110 0.0169
18 CS130A-B 0.0508
19 CS165A-B 0.0339
We could report the results in a table, in which case perhaps arranging in descending order may be helpful:
%>% arrange(desc(proportion)) classes
# A tibble: 19 × 2
class proportion
<chr> <dbl>
1 PSTAT120A-B 0.983
2 PSTAT126 0.932
3 CS9 0.576
4 CS16 0.492
5 PSTAT160A-B 0.458
6 PSTAT131 0.424
7 PSTAT122 0.356
8 PSTAT134 0.305
9 PSTAT100 0.271
10 PSTAT127 0.169
11 PSTAT174 0.169
12 PSTAT115 0.102
13 CS5A-B 0.0508
14 CS130A-B 0.0508
15 LING104 0.0339
16 LING105 0.0339
17 CS165A-B 0.0339
18 POLS15 0.0169
19 LING110 0.0169
Let’s say we’d rather plot this data. We’ll put the course number on one axis and the proportion of students who took it on the other.
# plot it
%>%
classes ggplot(aes(x = proportion, y = class)) +
geom_point()
These commands work by defining plot layers. In the chunk above, the first argument to ggplot()
is the data. Then, aes()
defines an ‘aesthetic mapping’ of the columns of the input data frame to graphical elements. This defines a set of axes. Then, a layer of points is added to the plot with geom_point()
; no arguments are needed because the geometric object (‘geom’) inherits attributes (x and y coordinates) from the aesthetic mapping.
Again we might prefer to arrange the classes by descending order in proportion.
<- classes %>%
fig ggplot(aes(x = proportion, y = reorder(class, proportion))) +
geom_point()
fig
And perhaps fix the plot labels:
# adjust labels
+ labs(x = 'proportion of class', y = '') fig
Notice that ggplot allows for a plot to be stored by name and then further modified with additional layers.