Data Visualizing In R

Data Analytics (ECMP 5005B)

Esam Mahdi

School of Mathematics and Statistics
Master of Engineering - Engineering Practice
Carleton University

Wednesday, September 6, 2023

Learning objectives

By the end of this chapter, you should be able to do the following:

  • Setup R language environment with Rstudio.
  • Install tidyverse package and other packages and use functions to:
    • Import data from different sources (txt, csv, xlsx, dta, sav, sas7bdat, etc.)
    • Recap some useful R commands.
    • Manipulate data (cleaning, tidying, transforming, summarizing data, etc.)
  • Visualize data and build effective graphs and elegant interactive plots.
  • Use R Quarto/Markdown to weave together narrative text and code to produce elegantly formatted document.

Typical steps for data analytics


Source: Robert I. Kabacoff. R in Action: Data analysis and graphics with R and Tidyverse. 2nd ed., Manning, 2022.

Getting started

  • By July 27, 2023, the CRAN package repository features 19880 available packages.

Do not trust all of these packages!

  • To see what packages are installed in your computer, type the following command:
library()
  • To show the directory where packages are stored on your computer, type the following command:
.libPaths() # Exercise: Why it shows two library paths?

Getting started

  • By default, R is coming with the following 15 base packages:
    • base, compiler, datasets, grDevices, graphics, grid, methods, parallel, splines, stats, stats4, tcltk, tools, translations, utils.
      • You can check what functions are available from the package stats by typing the following code:
help(package = "stats")
  • In addition to base packages, there are 15 recommended packages available the from CRAN:
    • KernSmooth, MASS, Matrix, boot, class, cluster, nlme, rpart, spatial, codetools, foreign, lattice, mgcv, nnet, survival.
    • For example, you may need to use the function lm.gls() implemented in the package MASS to fit a linear model using the generalized least squares method.
library(MASS) #load the package MASS 
help(lm.gls)  #get help on this function. Same as typing ?lm.gls  

Getting started

One problem that we usually face when we load and attach some libraries in R is that these libraries might have different masked functions share the same namespace. For example, the function lag() is masked by both stats and dplyr packages. It performs a different tasks in both. Thus, you need to be careful if you are using lag() in R while the package dplyr.

stats::lag #to explicitly use the lag() function from the stats package
dplyr::lag #to explicitly use the lag() function from the dplyr package

Example

set.seed(1)    #set reproducible results 
x <- rnorm(5)  #generate 5 observations from the standard normal distribution N(0,1)
stats::lag(x, 2) #shift the time base back by 2 (keep 1st & 2nd observations)
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
attr(,"tsp")
[1] -1  3  1
dplyr::lag(x, 2) #shift the time base back by 2 (replace 1st & 2nd observations)
[1]         NA         NA -0.6264538  0.1836433 -0.8356286

Simple R code

Let’s start with the following R code

# create two numeric vectors, each with 8 observations
wt <- c(60,70,63,55,48,49,58,58)
age = c(20,17,23,24,19,19,16,26) #note "<-" symbol can be replaced by "="
# get a random sample (without replacement) of 8 observations
set.seed(5) #set seed to reproduce the same random sample
z=sample(150:190, size = 8)
# replicat a string "Male" 3 times & get a vector of characters  
Male=rep("Male", times = 3)
# replicat a string "Female" 5 times & get a vector of characters 
Female=rep("Female", times = 5)
# combine the two categorical variables into one nominal variable
s = c(Male, Female)
# create an ordinal categorical variable 
income=c("Low","High","Low","Low","Middle","Middle","Middle","High")
# stores categorical values as vector of integers (factors)
sex=factor(s)
income=factor(income,order=TRUE,levels=c("Low","Middle","High"))
# create a data frame and name the variables
mydata=data.frame(id=1:8,weight=wt,age=age,z=z,sex=sex,Sex=s,income)
# Create a new dataset
status = c("worker","student","worker","worker","student","student",
           "student","student")
my_newdata <- data.frame(id=c(1:4,8:11),status=status)
#return first 6 observations
head(mydata)
  id weight age   z    sex    Sex income
1  1     60  20 151   Male   Male    Low
2  2     70  17 164   Male   Male   High
3  3     63  23 160   Male   Male    Low
4  4     55  24 170 Female Female    Low
5  5     48  19 179 Female Female Middle
6  6     49  19 156 Female Female Middle
#get variables names
names(mydata)
[1] "id"     "weight" "age"    "z"      "sex"    "Sex"    "income"
#display the Structure of the data
str(mydata)
'data.frame':   8 obs. of  7 variables:
 $ id    : int  1 2 3 4 5 6 7 8
 $ weight: num  60 70 63 55 48 49 58 58
 $ age   : num  20 17 23 24 19 19 16 26
 $ z     : int  151 164 160 170 179 156 168 152
 $ sex   : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 1 1
 $ Sex   : chr  "Male" "Male" "Male" "Female" ...
 $ income: Ord.factor w/ 3 levels "Low"<"Middle"<..: 1 3 1 1 2 2 2 3

mat1 <- matrix(wt, nrow = 5, ncol = 2) #create a matrix of dimension 5x2
mat2 <- matrix(age, nrow = 2, ncol = 5) #create a matrix of dimension 2x5
mat3 = cbind(wt,age,sex,income) #combine vectors by columns. Exercise: Type the code: rbind(wt,age,sex,income) and explain the outcome!
mylist <- list(wt,age,sex,income)   #create a list of 4 vectors
dim(mydata) # dimension
[1] 8 7
dim(mat1)
[1] 5 2
dim(mat3)
[1] 8 4
length(mylist)  
[1] 4
class(wt) 
[1] "numeric"
class(sex)
[1] "factor"
class(s)
[1] "character"
class(mydata)
[1] "data.frame"
class(mat1)
[1] "matrix" "array" 
class(mylist)
[1] "list"

income[1:4] #first 4 elements of income 
[1] Low  High Low  Low 
Levels: Low < Middle < High
wt[-2] #all weights without 2nd one 
[1] 60 63 55 48 49 58 58
mylist[1] #elements of the 1st list 
[[1]]
[1] 60 70 63 55 48 49 58 58
mylist[[1]][2] #2nd element of the 1st list
[1] 70
mydata$income #values of the income variable
[1] Low    High   Low    Low    Middle Middle Middle High  
Levels: Low < Middle < High
mydata[2,] #values of the 2nd row
  id weight age   z  sex  Sex income
2  2     70  17 164 Male Male   High
mydata[, 4] #values of the 4th column
[1] 151 164 160 170 179 156 168 152
# Extract a subset data
# rows 1,2,3,5,7 and columns id, weight, sex
mydata[c(1:3,5,7),c("id","weight","sex")] 
  id weight    sex
1  1     60   Male
2  2     70   Male
3  3     63   Male
5  5     48 Female
7  7     58 Female
#value in 3rd row and 2nd column
mydata[3, 2] 
[1] 63
# Merging datasets
merge(my_newdata,mydata, by = "id")
  id  status weight age   z    sex    Sex income
1  1  worker     60  20 151   Male   Male    Low
2  2 student     70  17 164   Male   Male   High
3  3  worker     63  23 160   Male   Male    Low
4  4  worker     55  24 170 Female Female    Low
5  8 student     58  26 152 Female Female   High

mean(mydata$weight) #mean weight
[1] 57.625
#standard deviation of weight variable
sd(mydata[,"weight"]) 
[1] 7.190023
#Pearson correlation coefficient between weight and age
cor(mydata[,2],mydata[,"age"]) 
[1] -0.07085679
#transpose the matrix
t(mat1)
     [,1] [,2] [,3] [,4] [,5]
[1,]   60   70   63   55   48
[2,]   49   58   58   60   70
summary(mydata)
       id           weight           age              z             sex   
 Min.   :1.00   Min.   :48.00   Min.   :16.00   Min.   :151.0   Female:5  
 1st Qu.:2.75   1st Qu.:53.50   1st Qu.:18.50   1st Qu.:155.0   Male  :3  
 Median :4.50   Median :58.00   Median :19.50   Median :162.0             
 Mean   :4.50   Mean   :57.62   Mean   :20.50   Mean   :162.5             
 3rd Qu.:6.25   3rd Qu.:60.75   3rd Qu.:23.25   3rd Qu.:168.5             
 Max.   :8.00   Max.   :70.00   Max.   :26.00   Max.   :179.0             
     Sex               income 
 Length:8           Low   :3  
 Class :character   Middle:3  
 Mode  :character   High  :2  
                              
                              
                              
# cross-product of two matrices
# try: crossprod(mat2,t(mat1))
mat1 %*% mat2 
     [,1] [,2] [,3] [,4] [,5]
[1,] 2033 2556 2071 2234 2033
[2,] 2386 3002 2432 2628 2386
[3,] 2246 2841 2299 2516 2246
[4,] 2120 2705 2185 2440 2120
[5,] 2150 2784 2242 2588 2150
mat.prod <- mat2 %*% mat1
solve(mat.prod) #inverse matrix
           [,1]       [,2]
[1,]  0.1209329 -0.1149430
[2,] -0.1222463  0.1163559
#generate 5 random values from uniform(-1,2) 
set.seed(123) #to reproduce the results
runif(5, min = -1, max = 2)
[1] -0.1372674  1.3649154  0.2269308  1.6490522  1.8214019

# Creating new variables
mydata$status0 <- paste0("grade",1:8)
mydata
  id weight age   z    sex    Sex income status0
1  1     60  20 151   Male   Male    Low  grade1
2  2     70  17 164   Male   Male   High  grade2
3  3     63  23 160   Male   Male    Low  grade3
4  4     55  24 170 Female Female    Low  grade4
5  5     48  19 179 Female Female Middle  grade5
6  6     49  19 156 Female Female Middle  grade6
7  7     58  16 168 Female Female Middle  grade7
8  8     58  26 152 Female Female   High  grade8
# Creating new variables
mydata$status1 <- paste("grade",1:8)
mydata
  id weight age   z    sex    Sex income status0 status1
1  1     60  20 151   Male   Male    Low  grade1 grade 1
2  2     70  17 164   Male   Male   High  grade2 grade 2
3  3     63  23 160   Male   Male    Low  grade3 grade 3
4  4     55  24 170 Female Female    Low  grade4 grade 4
5  5     48  19 179 Female Female Middle  grade5 grade 5
6  6     49  19 156 Female Female Middle  grade6 grade 6
7  7     58  16 168 Female Female Middle  grade7 grade 7
8  8     58  26 152 Female Female   High  grade8 grade 8
# Recoding variables: recode age 20 by a missing value
mydata$age[mydata$age == 20] <- NA 
mydata[1:4,] #display the first 4 rows
  id weight age   z    sex    Sex income status0 status1
1  1     60  NA 151   Male   Male    Low  grade1 grade 1
2  2     70  17 164   Male   Male   High  grade2 grade 2
3  3     63  23 160   Male   Male    Low  grade3 grade 3
4  4     55  24 170 Female Female    Low  grade4 grade 4
#calculate the mean including missing values
sum(mydata$age)
[1] NA
#calculate the mean excluding missing values
sum(mydata$age, na.rm=TRUE)
[1] 144
# Renaming variables
names(mydata)[4] <- "height"
# Dropping variables
mydata[,-c(4,9)] #Dropping the variables "height" and "status1"
  id weight age    sex    Sex income status0
1  1     60  NA   Male   Male    Low  grade1
2  2     70  17   Male   Male   High  grade2
3  3     63  23   Male   Male    Low  grade3
4  4     55  24 Female Female    Low  grade4
5  5     48  19 Female Female Middle  grade5
6  6     49  19 Female Female Middle  grade6
7  7     58  16 Female Female Middle  grade7
8  8     58  26 Female Female   High  grade8
# Removing all rows with missing data
newdata <- na.omit(mydata)

par(mfrow = c(2, 2)) #create a 2 x 2 plotting matrix
plot(wt,age); plot(mydata$weight, mydata$age) #type ?plot to get help about the function plot()
plot(wt,age, xlab = "Weight", ylab = "Age", col = "red")
plot(density(rnorm(500)),col="blue") #plot a density distribution of 500 random data from Gaussian

Reading data into R

After setting up R environment with Rstudio, you can import the data from different structures.

  • The build in R package utils has several functions for reading files from delimited ASCII (.txt, .csv) files.
    • read.table(): Import data with extension .txt.
    • read.csv(): Import data where “,” are used as separators and “.” are used as decimals.
    • read.csv2(): Import data where “;” are used as separators and “,” are used as decimals.
> read.table
  function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) 
> read.csv
  function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) 
> read.csv2
  function (file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...) 

Reading data into R: Example

Bitcoin daily price (in US dollars) from January 22, 2020 to September 1, 2021 (during COVID-19).

# change the working directory to a different location on your computer
dat1 <- read.table("data/BTC.txt",header = T, fill=TRUE) 
dat2 <- read.csv("data/BTC.csv", header = TRUE)
str(dat1)
'data.frame':   589 obs. of  5 variables:
 $ Date : chr  "1/22/2020" "1/23/2020" "1/24/2020" "1/25/2020" ...
 $ Price: num  8664 8404 8447 8354 8622 ...
 $ Open : num  8734 8669 8404 8447 8351 ...
 $ High : num  8800 8669 8522 8447 8622 ...
 $ Low  : num  8581 8297 8248 8280 8304 ...

Use skim() function from skimr package to get a useful summary statistics.

#first, make sure you that have the package "readr" installed on you computer,
#if not installed, you need to install it
install.packages("skimr") 
library("skimr") 
skim(dat1)


Exercise:

  • Explain the arguments row.names, col.names, stringsAsFactors in the function read.table().
  • Give examples to demonstrate how we can use these functions.

Reading data into R

The readr package provides functions to read rectangular data with extension .csv, .txt or .tsv.

  • read_csv() and read_tsv() are special cases of the more general read_delim().
  • To see all the arguments for the function read_csv(), use ?read_csv() within R.
  • By default, read_csv() and read_tsv() convert blank cells to missing data (NA).
  • You can type the following command to install and load this package.
#first, make sure you that have the package "readr" installed on you computer,
#if not installed, you need to install it
install.packages("readr")
library(readr) # Load the package
# Read from a path specifies the location of a data on your computer
name_data <- read_csv("file_data - Sheet1.csv") # import data from a comma delimited file
# Read from a remote path (e.g., mtcars data set from GitHub website)
name_data <- read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv") 
name_data <- read_tsv("file_data.txt") # import data from a tab delimited file separated by tabs
name_data <- read_tsv("file_data.tsv", sheet=1) # import data from a tab delimited file

Reading data into R

The readxl package can import tabular data from Excel workbooks. Both xls and xlsx formats are supported.

#first, make sure you that have the package "readxl" installed on you computer,
#if not installed, you need to install it
install.packages("readxl") 
library(readxl) # Load the package
name_data <- read_excel("file_data.xlsx", sheet=1) # import data from an Excel workbook

The haven package can import data with .sav, .dat, and .sas7bdat extensions.

#first, make sure you that have the package "haven" installed on you computer,
#if not installed, you need to install it
install.packages("haven") 
library(haven) # Load the package
name_data <- read_sav("file_data.sav")      # import data from SPSS
name_data <- read_dat("file_data.dat")      # import data from Stata
name_data <- read_sas("file_data.sas7bdat") # import data from SAS 

Example (reading data into R):

Import the BTC data set using the functions read_tsv() and read_csv() from the package readr.

library(readr)
dat3 <- read_tsv("data/BTC.txt") 
dat4 <- read_csv("data/BTC.csv")
head(dat3) #same results using head(dat4)
# A tibble: 6 × 5
  Date      Price  Open  High   Low
  <chr>     <dbl> <dbl> <dbl> <dbl>
1 1/22/2020 8664. 8734. 8800. 8581.
2 1/23/2020 8404. 8669. 8669. 8297.
3 1/24/2020 8447. 8404. 8522. 8248.
4 1/25/2020 8354. 8447. 8447. 8280 
5 1/26/2020 8622. 8351. 8622  8304.
6 1/27/2020 8912  8622  9002. 8585.

Note that the head() prints differently from before because it’s a tibble. Tibbles are rectangular data frames, but slightly tweaked to work better in the tidyverse package that we will discuss later!

The excel sheet BTC2 has two sheets named BTC and BTC2. The data BTC2 is stored in columns G7:G38-K7:38. The first cell A1 provides a quick description of this data. Data has some missing values.

library(readxl)
dat5 <- read_excel("data/BTC2.xlsx", 
                   sheet= "BTC2",
                   rang = "G7:K38")
head(dat5)
# A tibble: 6 × 5
  Date                Price  Open  High    Low
  <dttm>              <dbl> <dbl> <dbl>  <dbl>
1 2021-01-01 00:00:00 29346 28933 29498 28932 
2 2021-01-02 00:00:00 32185 29346 33168 29192 
3 2021-01-03 00:00:00 32971 32183 34253 32110 
4 2021-01-04 00:00:00    NA    NA    NA    NA 
5 2021-01-05 00:00:00 33996 32020 33996 30979.
6 2021-01-06 00:00:00 36755 33986 36755 33901 

Tibbles for tidy data frames

Many popular packages, such as readr, tidyr, dplyr, and purr, save data frames as tibbles. When you are using the package tibble to import data be aware of the following properties:

  • Data frames can be converted to tibbles using the function as_tibble().
library(tibble)
class(mtcars) #the class of mtcars data before tibble 
[1] "data.frame"
mtcars <- as_tibble(mtcars) 
class(mtcars) #the class of mtcars data after tibble
[1] "tbl_df"     "tbl"        "data.frame"
  • tibbles never change the names of variables.
  • tibbles never convert character variables to factors.
  • tibbles don’t support row names.
  • Subsetting a tibble always returns a tibble and not a vector.
mtcars[, "mpg"] 
# A tibble: 32 × 1
     mpg
   <dbl>
 1  21  
 2  21  
 3  22.8
 4  21.4
 5  18.7
 6  18.1
 7  14.3
 8  24.4
 9  22.8
10  19.2
# ℹ 22 more rows

Exercise: Try the code: mtcars[, "mpg", drop = TRUE]

Basic dplyr functions to manipulate data frames

Function Use Syntax
mutate() Transform or recode variables dataframe <- mutate(dataframe, new_varibles = expression)
select() Select variables/columns dataframe <- select(dataframe, select_variables)
filter() Select observations/rows dataframe <- filter(dataframe, expression)
rename() Rename variables/columns dataframe <- rename(dataframe, new_varaibles_names = old_varaibles_names)
recode() Recode variable values variable <- recode(variable, old_values = new_values)
arrange() Order rows by variable values dataframe <- arrange(dataframe, sort_varaibles)
group_by() Group by one or more variables dataframe <- group_by(varaibles to group by)

Examples:

  • In financial time series, returns are given by r_t =\log\dfrac{P_t}{P_{t-1}}, where r_t and P_t denote the returns and price asset at time t.
library(dplyr)
dat5 <- mutate(dat5,BeforeClose=dplyr::lag(Price), returns=log(Price)-log(BeforeClose)) #use the BTC2 dataset
dat5[1:2,]
# A tibble: 2 × 7
  Date                Price  Open  High   Low BeforeClose returns
  <dttm>              <dbl> <dbl> <dbl> <dbl>       <dbl>   <dbl>
1 2021-01-01 00:00:00 29346 28933 29498 28932          NA NA     
2 2021-01-02 00:00:00 32185 29346 33168 29192       29346  0.0923
  • Use the pipe operator (%>%) to chain statements.
dat5 <- dat5 %>% 
  mutate(Date = lubridate::mdy(Date), #parse dates with month, day, and year components using the function mdy() from the "lubridate" package
         BeforeClose=dplyr::lag(Price),
         returns=log(Price)-log(BeforeClose))
dat5[1:2,] #note that the first returns is missing "NA". To remove "NA", use the code: %>% tidyr::drop_na() 
# A tibble: 2 × 7
  Date   Price  Open  High   Low BeforeClose returns
  <date> <dbl> <dbl> <dbl> <dbl>       <dbl>   <dbl>
1 NA     29346 28933 29498 28932          NA NA     
2 NA     32185 29346 33168 29192       29346  0.0923

# Select the variables id, weight, sex, and income
mydata %>% select(id,weight,sex,income) 
  id weight    sex income
1  1     60   Male    Low
2  2     70   Male   High
3  3     63   Male    Low
4  4     55 Female    Low
5  5     48 Female Middle
6  6     49 Female Middle
7  7     58 Female Middle
8  8     58 Female   High
# A minus sign (-) is used to exclude variables
mydata %>% select(-height, -Sex,-status1) 
  id weight age    sex income status0
1  1     60  NA   Male    Low  grade1
2  2     70  17   Male   High  grade2
3  3     63  23   Male    Low  grade3
4  4     55  24 Female    Low  grade4
5  5     48  19 Female Middle  grade5
6  6     49  19 Female Middle  grade6
7  7     58  16 Female Middle  grade7
8  8     58  26 Female   High  grade8
# Select all females with age 19 or weight greater than 59
mydata %>% filter(sex == "Female" &
                    age == 19 | weight > 59) 
  id weight age height    sex    Sex income status0 status1
1  1     60  NA    151   Male   Male    Low  grade1 grade 1
2  2     70  17    164   Male   Male   High  grade2 grade 2
3  3     63  23    160   Male   Male    Low  grade3 grade 3
4  5     48  19    179 Female Female Middle  grade5 grade 5
5  6     49  19    156 Female Female Middle  grade6 grade 6
mydata$Sex <- recode(mydata$Sex,
       "Male" = "M", "Female" = "F")
mydata[c(1,5),1:6]
  id weight age height    sex Sex
1  1     60  NA    151   Male   M
5  5     48  19    179 Female   F
mydata <- mydata %>% 
  rename(gender = "sex")
mydata[1:3,1:5]
  id weight age height gender
1  1     60  NA    151   Male
2  2     70  17    164   Male
3  3     63  23    160   Male
mydata <- mydata %>% arrange(age, weight)
mydata[c(1:5),c("id","age","weight","gender","income")]
  id age weight gender income
1  7  16     58 Female Middle
2  2  17     70   Male   High
3  5  19     48 Female Middle
4  6  19     49 Female Middle
5  3  23     63   Male    Low

  • Calculate the average mean for each income category
mydata %>% group_by(income) %>%
  summarize(avg_age = mean(age))
# A tibble: 3 × 2
  income avg_age
  <ord>    <dbl>
1 Low       NA  
2 Middle    18  
3 High      21.5

Note that the first age is missing (“NA”). This value is associated with low income. Thus, the average age for those who have low income is missing (“NA”).

Exercise: How do you solve this issue?

  • Use the %>% operator to chain multiple statements
mutate_data <- mydata %>% 
  select(id, age, height, weight,gender,income,status0) %>%
  mutate(height_foot = 0.033 * height) %>% 
  rename(status = status0) %>%
  filter(income == c("Low","Middle")) %>%
  arrange(age, income) # "income" is an ordinal variable
mutate_data
  id age height weight gender income status height_foot
1  6  19    156     49 Female Middle grade6       5.148
2  3  23    160     63   Male    Low grade3       5.280

Exercise:

  • Use the functions group_by() to group data by status and calculate the five numbers (min, max, median, first and third quartiles) of each status.
  • Group data by status and income and calculate the mean and standard deviation of each group based on their age.

Probability functions in R

  • In many scenarios we use probability functions to generate a simulated data.
  • In R, probability functions take the form [dpqr] abbreviation name of distribution, where each letter of [dpqr] refers to the aspect of the distribution returned:
    • d = Density
    • p = Distribution function
    • q = Quantile function
    • r = Random generation
Distribution Syntax Distribution Syntax Distribution Syntax
Beta beta() Binomial binom() Cauchy cauchy()
Chi-squared chisq() Exponential exp() F f()
Gamma gamma() Geometric geom() Hypergeometric hyper()
Lognormal lnorm() Logistic logis() Multinomial multinom()
Negative binomial nbinom() Normal norm() Poisson pois()
Wilcoxon signed rank signrank() T t() Uniform unif()
Weibull weibull() Wilcoxon rank sum wilcox()

Example:

  • Q1: What is the area under the standard normal curve to the left of z = 2.1?
    • Answer: Use the command pnorm(2.1) to get 0.9821356.
  • Q2: What is the value of the 95th percentile of a normal distribution with a mean of 100 and a standard deviation of 20?
    • Answer: Use the command qnorm(0.95, mean =100, sd = 20) to get 132.8971
  • Q3: Generate 300 random normal deviates with a mean of 80 and a standard deviation of 10.
    • Answer: Use the command rnorm(300, mean =80, sd = 10) to get the simulated series.

Rstudio cheat sheets

The Posit Cheatsheets website suggests some favorite data science packages to use!


Visualizing data in the tidyverse

  • You may install and load the tidyverse by running the following code:
> install.packages("tidyverse")
> library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
Warning messages:
1: package ‘tidyverse’ was built under R version 4.3.1 
2: package ‘readr’ was built under R version 4.3.1 

A graphing template

  • The ggplot2 implements the grammar of graphics is a core package in the tidyverse that can be used to visualize data in an elegant way.
  • To get started plotting in ggplot2, you will start with the two basics (data and a geom) and then add additional layers using the + sign.
library(tidyverse) # or type library(ggplot2) to load the package ggplot2 only
ggplot(data = DATAFRAME, mapping = aes(Options)) + 
  geom_TYPE()
#
# You may place the mapping within the aes function, where aes stands for aesthetics.
ggplot(data = DATAFRAME) + 
  geom_TYPE(mapping = aes(Options))
  • Each geom_ function takes a mapping argument which paired with aes().
  • The x and y arguments of aes() specify which variables options to map to the x and y coordinates.
  • You can add a third variable, like z and map it to an aesthetic like the size, shape, or color/colour of the points of your plot.

Geom functions

Function Layers Options
geom_point() Scatterplot color, alpha, shape, size
geom_line() Line graph colorvalpha, linetype, size
geom_jitter() Jittered points color, size, alpha, shape
geom_bar() Bar chart color, fill, alpha
geom_boxplot() Box plot color, fill, alpha, notch, width
geom_histogram() Histogram color, fill, alpha, linetype, binwidth
geom_smooth() Fitted line method, formula, color, fill, linetype, size
geom_density() Density plot color, fill, alpha, linetype
geom_hline() Horizontal lines color, alpha, linetype, size
geom_vline() Vertical lines color, alpha, linetype, size
geom_rug() Rug plot color, side
geom_violin() Violin plot color, fill, alpha, linetype
geom_text() Text annotations see the help for this function

Source: https://nbisweden.github.io/RaukR-2019/ggplot/presentation/ggplot_presentation.html#1.
See also https://clauswilke.com/dataviz/directory-of-visualizations.html

A worked example

Let’s use our first graph to answer the following questions about the mpg data frame available from the package ggplot2:

  • Do cars with big engines use more fuel than cars with small engines?
  • What does the relationship between engine size and fuel efficiency look like?
    • Is it positive or negative?
    • Is it Linear or nonlinear?
head(mpg) # Returns the first 6 rows
# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Creating a ggplot - static plots

# Creates a coordinate system that you can add layers to 
ggplot(data = mpg) 

The first argument is the dataset that you need to use in the plot. The result of this code is an empty graph (default theme used by ggplot2 is theme_gray()).

Now you can add one or more layers to ggplot(). The function geom_point() adds a layer of points (scatterplot) to your plot.

# Put displ on the x-axis and hwy on the y-axis 
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) 

The plot shows a negative relationship between the car engine size (in liters) and the car’s fuel efficiency on the highway (in miles per gallon). The bigger the size of the engine, the less efficient it is in consuming fuel.

Aesthetic mappings

The aes() (stands for aesthetics) function is used to map variables to the visual characteristics of a plot.

# Map the colors of points to the variable "class", which indicates the class of each car
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = class)) 

  • The graph indicates that there are some unusual points (outliers). The colors reveal that many of these points are two-seater cars.
  • Map an aesthetic to a third variable will create a legend that explains which levels correspond to which values.

Aesthetic mappings

# Map the size of points to the variable "class"
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, size = class)) 

Mapping class to the size aesthetic display more clear information about the (outliers) in this data.

Aesthetic mappings

# Map the variable "class" to the alpha aesthetic
# alpha controls the transparency of the points or the shape of the points. 
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) 

The visualization of the plot is not clear. We can fine tune the appearance of the graph using themes and improved visualization.

Aesthetic mappings

# Map the variable "class" to the shape, color, and size aesthetics and use the classic theme
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, shape = class, colour = class, size = class)) +
  theme_classic() 

  • ggplot2 uses only six shapes at a time. Thus, the suv go unplotted when you use the aesthetic shape.
  • Both British and American English words of colour and color work.
  • Exercise: Is it possible to replace the argument (e.g., color) in R by an abbreviations word using the first character strings of this argument (e.g., col)?

Set the aesthetic properties of the geom manually

Here, I will use the theme theme_bw() (for black and white).

ggplot(data = mpg) +
 geom_point(mapping=aes(x=displ, 
                        y=hwy,color="blue")) +
  theme_bw() 

Exercise: Why are the points not blue?

Note: Instead of using the character name of the color “blue”, you can use the “#0000FF” hex code of this color.

ggplot(data = mpg) +
 geom_point(mapping=aes(x=displ, 
                        y=hwy),color="#0000FF") +
  theme_bw()

Exercise: The points are blue now! Why?

ggplot(data = mpg) + 
  geom_point(aes(x=displ, hwy),size = 4)

ggplot(data = mpg) + 
  geom_point(aes(displ,hwy,col=drv),size = 4) #a legend to the right will be created

Scales

Scale functions (which start with scale_) allow you to modify default scaling provided by ggplot2

Function Syntax
scale_x_continuous() Scales the x-axis for quantitative variables. Options include breaks for specifying tick marks, labels for specifying tick mark labels, and limits to control the range of the values displayed
scale_y_continuous() Same as above for y-axis
scale_x_discrete() Same as above for x-axis representing categorical variable
scale_y_discrete() Same as above for y-axis representing categorical variable
scale_color_manual() Specifies the colors (with option values) used to represent the levels of a categorical variable
Facets

facet_wrap() and facet_grid() are used to partition a plot into a matrix of panels (side-by-side graphs), particularly useful for categorical variables.

Function Syntax
facet_wrap(~var, nrow = r) Partition plots for each level of variable (var) arranged into r rows
facet_wrap(~var, ncol = c) Partition plots for each level of variable (var) arranged into c columns
facet_grid(row_var~col_var) Partition plots for combination of rows variable (row_var) and columns variable (col_var)
facet_grid(rows = row_var) Partition plots for for each level of rows variable (row_var), arranged as a single column
facet_grid(cols = col_var) Partition plots for for each level of columns variable (col_var), arranged as a single row

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy,
                     color = drv, shape=drv)) +
  geom_point(alpha = .7, size = 3) +
  scale_x_continuous(breaks = seq(1, 7, 0.5)) +
  scale_y_continuous(breaks = seq(10, 45, 5)) +
  scale_color_manual(values = c("darkgreen","cornflowerblue","indianred3")) +
  theme_bw()

Facets: Facet the plot by a single variable

# Map the variable "class" to the shape, color, and size aesthetics and use the classic theme
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
     facet_wrap(~ class, nrow = 2) + theme_bw()

  • To facet your plot by a single variable, use facet_wrap( ~ name of this variable).
  • Note that the variable that you pass to facet_wrap() should be discrete.
  • Exercise: What happens if you replace facet_wrap(~ class, nrow = 2) by facet_grid(~ class)?

# Bar graphs 
ggplot(data=mpg,mapping=aes(x=hwy))+
     geom_bar() + facet_grid(rows = vars(drv), margins = TRUE, scales = "free_y")

Note: The default argument scales = “fixed” is used if x and y scales are fixed across all panels; scales = “free_x” if x scale is free and y scale is fixed; scales = “free_y” if y scale is free and x scale is fixed; and scales = “free” if x and y scales vary across panels.

Facets: Facet the plot on the combination of two variable

# Map the variable "class" to the shape, color, and size aesthetics and use the classic theme
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
     facet_grid(drv ~ cyl) + theme_bw()

  • To facet your plot on the combination of two variable, use facet_grid(name of 1st variable ~ name of 2nd variable).
  • Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.

Legend layout

  • To control the position of the legend, you need to use a theme() setting.
  • The theme setting legend.position controls where the legend is drawn:
    • + theme(legend.position = "right") # the default
    • + theme(legend.position = "left")
    • + theme(legend.position = "top")
    • + theme(legend.position = "bottom")
  • You may include other options in the theme() as seen below!
suv <- mpg %>% filter(class == "suv")
p <- ggplot(suv, aes(displ, hwy, color = drv)) +
  geom_point(size = 4) + theme_bw()

p + labs(title = "Fuel economy data",
       subtitle = "Suv cars",
       x = "Engine displacement, in litres",
       y = "Highway miles per gallon",
       color = "Type of drive train") +
  scale_color_manual(labels = c("4wd", "Rear wheel drive"), 
                     values = c("blue", "red")) +
  theme(legend.position="bottom", 
        legend.key.size = unit(1.4, "cm"),
        legend.key.height=unit(0.5, "cm"),
        legend.key = element_rect(fill = "gray90", color = "red"),
        text=element_text(family="serif")) 

You can map the values of a non categorical variable to a different continuous scale using the scale_color_gradient() option

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple")

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") + 
  theme_minimal()

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") +
  theme_dark()

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") +
  theme_void()

You can also use the themes available from the package ggthemes

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") + 
  ggthemes::theme_fivethirtyeight()

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") + 
  ggthemes::theme_economist()

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") +
  ggthemes::theme_excel_new()

Or you can use the themes from the package hrbrthemes

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") +
  hrbrthemes::theme_ft_rc()

Or the package ggdark

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") +
  ggdark::dark_theme_classic()

Or even you can create your own theme. The source code of the following theme theme_bluewhite() can be found from the link https://www.datanovia.com/en/blog/ggplot-themes-gallery/

ggplot(data = mpg) + 
  geom_point(aes(displ,
                 hwy,
                 col=year),
             size = 4) + 
  scale_color_gradient(low = "green", 
                       high = "purple") + 
  theme_bluewhite()

Boxplots

Helps visualize whether a distribution of a data set is symmetric or skewed due to unusual observations (outliers). The grapgh displays the five numbers summary (minimum, maximum, median, first and third quartiles).

ggplot(data=mpg,
       mapping=aes(x=class,y=hwy))+
  geom_boxplot() + 
  theme_bw()
# Flip the coordinate systems (horizontal boxplots)
ggplot(data=mpg,
       mapping=aes(x=class,y=hwy))+
  geom_boxplot() + 
  theme_bw() + 
  coord_flip()

Boxplots

ggplot(data=mpg,
    mapping=aes(x=drv,y=hwy,color=class))+
  geom_boxplot(size=1) + 
  theme_bw()
ggplot(data=mpg,
    mapping=aes(x=class,y=hwy,fill=drv))+
  geom_boxplot(size=1) + 
  theme_bw()

Line chart (time series) and Histogram

  • Use the function tq_get() available from the package tidyquant to get stock price data (say Apple} from several web sources.
  • Define the returns as a time series data and plot this series using line chart.
  • Plot the distribution using histogram and annotate the plots.
  • Use the function grid.arrange() available from the package gridExtra to place the two plots on one page.
  • Use the option annotate() in ggplot2 to add a text label to your plot.
# Load "tidyquant" and "tidyverse" packages 
library(tidyquant)
library(tidyverse)

# Get daily stock prices of Apple from the web in a tibble format
Apple <- tq_get("AAPL",from="2010-01-04",
                to="2018-12-31",get="stock.prices")

# mutate returns series named as "ret"
Ap <- Apple %>% 
  mutate(Date = ymd(date), 
         Beforeclose = dplyr::lag(close),
         ret = log(close) - log(Beforeclose)) %>%
  drop_na(ret) #remove "NA"
# Plot log-returns series
P1 <- ggplot(Ap)+
  geom_line(aes(x=Date,y=ret),color="gray30")+
  labs(y="Log Returns", x="") +
  scale_x_date(date_labels="%Y %b", 
               date_breaks="12 months") +
  theme_bw()

# Plot histogram
P2 <- ggplot(Ap)+ 
  geom_histogram(aes(ret),binwidth=0.004,
                 col="gray30",fill="gray80")+
  annotate("text",x=c(-0.1,-0.1),y=c(70,60),
           label=c("Skewness:-0.1738",
                   "Ex.kurtosis:3.5783"),
           color=c("gray30","gray30"))+
           labs(y="", x="Log Returns") +
  theme_bw()

# Load "gridExtra" package
library(gridExtra)

# Place the two plots on one page
grid.arrange(P2, P1, nrow=1, 
   top="Apple, Inc. stock price from 
    January 04, 2010 to December 31, 2018")

Correlation matrix: ggcorrplot package

  • The function ggcorrplot() in the package ggcorrplot can be used to visualize a correlation matrix using ggplot2, which is inspired from the package corrplot.
  • Consider the motor trend car road tests data set, mtcars. More information about the data can be found by typing ?mtcars
library(ggcorrplot)
corr <- round(cor(mtcars), 1)
ggcorrplot(corr,
           type = "lower",
           lab = TRUE)

Interactive plots: plotly package

  • The package plotly provides R binding around javascript plotting library plotly.
  • Consider the iris data with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
library(plotly)
iris %>% 
  plot_ly(x=~Sepal.Length,
          y=~Sepal.Width,
          color=~Species,
          size=1)%>% 
  add_markers()

Interactive plots: plotly package

The function ggplotly() from the package plotly can be used to convert a static ggplot2 object into an interactive plot.

library(plotly)
p <- ggplot(iris,
            aes(x=Sepal.Length,
                y=Sepal.Width,
                col=Species)) + 
  geom_point(size = 3) + 
  labs(x="Sepal Length", y="Sepal Width") + 
  theme_bw()
ggplotly(p)

Interactive plots: plotly package

  • The package plotly can also used to make a contour/and surface plots.
  • Consider the 87\times 61 altitudes of a volcano, Maunga Whau (Mt Eden) in Auckland, stored in the datasets volcano. More information about the data can be found by typing ?volcano
library(plotly)
plot_ly(z = ~volcano) %>% 
  add_contour(type="contour", 
    contours=list(showlabels=TRUE))

To plot the surface, use the following command:
plot_ly(z=~volcano) %>% add_surface()

Interactive plots for matrix plot: plotly and GGally packages

  • The ggpairs() function of the GGally package allows to build a matrix of pairs plot (scatterplot, box-plot, histogram, density, and correlation).
  • The ggplotly() function of the package plotly allows to get an interactive plot.
library(GGally)
p <- ggpairs(iris, 
        mapping = aes(color = Species),
        title="Correlogram") 
plotly::ggplotly(p)

Exercise: Try the following

library(ggcorrplot); library(plotly)
corr <- round(cor(mtcars), 1)
ggcorrplot(corr)
ggplotly(ggcorrplot(corr,lab=TRUE))

Interactive plots: ggiraph package

  • The package ggiraph converts a static ggplot2 object into an interactive plot.
library(ggiraph)
p <- ggplot(iris,aes(x=Sepal.Length,
                     y=Petal.Length, colour=Species))+
  geom_point_interactive(aes(tooltip=
                             paste0("<b>Petal Length:</b>",
                                Petal.Length,"\n<b>Sepal Length:</b>",
                                Sepal.Length,"\n<b>Species:</b>",
                                Species)),size=1)+ 
  theme_bw()
tooltip_css <- "background-color:#f8f9f9;
                padding:10px;
                border-style:solid;
                border-width:2px;
                border-color:#125687;
                border-radius:5px;"
ggiraph(code=print(p),
        hover_css="cursor:pointer;
                   stroke:black;
                   fill-opacity:0.3",
        zoom_max=5,
        tooltip_extra_css=tooltip_css,
        tooltip_opacity=0.9,
        height_svg=4,width_svg=4,
        width=1)

Interactive plots: rbokeh package

  • R package rbokeh is a package that can be used for producing interactive plots.
library(rbokeh)
figure(height=480,width=620,
       xlab="Sepal Length",
       ylab="Sepal Width") %>%
 ly_points(Sepal.Length,
           Sepal.Width,
           data=iris,
           color=Species,
           glyph=Species,
           hover=list(Sepal.Length,
                      Sepal.Width))

Interactive plots: highcharter package

R package highcharter is a wrapper around javascript library highcharts.

library(highcharter)
p <- iris %>%
  hchart("scatter",
         hcaes(x="Sepal.Length",
               y="Sepal.Width",group="Species")) %>%
  hc_xAxis(title=list(text="Sepal Length"),
           crosshair=TRUE) %>%
  hc_yAxis(title=list(text="Sepal Width"),
           crosshair=TRUE) %>%
  hc_chart(zoomType="xy",inverted=FALSE) %>%
  hc_legend(verticalAlign="top",align="right") %>% 
  hc_size(height=500,width=500)

htmltools::tagList(list(p))

Interactive plots: gganimate package

Consider the gapminder data set on life expectancy, GDP per capita, and population by country.

library(gganimate)
library(gapminder)
p <- ggplot(gapminder,
       aes(x=gdpPercap, 
           y=lifeExp, 
           size=pop,
           color=country)) + 
  geom_point(show.legend=F,
             alpha=0.7) + 
  scale_color_viridis_d() + 
  scale_size(range=c(2, 12)) + 
  scale_x_log10()+ 
  theme_bw() + 
  labs(x="GDP per capita",
      y="Life expectancy")
p + 
   transition_time(year) + 
   labs(title="Year: {frame_time}")

Interactive plots: gganimate package

Consider the same previous data set in the previous slide. Here, we use the package gapminder to compare by continents.

p <- ggplot(gapminder,
            aes(x=gdpPercap, 
                y=lifeExp, 
                size=pop,
                color=country)) + 
  geom_point(show.legend=F,
             alpha=0.7) + 
  scale_color_viridis_d() + 
  scale_size(range=c(2, 12)) + 
  scale_x_log10()+ 
  theme_bw() + 
  labs(x="GDP per capita",
       y="Life expectancy") + 
  facet_wrap(~continent)
p + 
  transition_time(year) + 
  labs(title="Year: {frame_time}")

Interactive plots: dygraphs package

  • The package dygraphs provides R bindings for javascript library dygraphs for time series data.
  • Consider the three time series giving the monthly deaths from lung diseases in the UK from 1974 to 1979 for both sexes (ldeaths), males (mdeaths) and females (fdeaths).
library(dygraphs)
UKLungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
dygraph(UKLungDeaths, main="Monthly Deaths from Lung Diseases in the UK")%>%
  dyOptions(colors=c("#66C2A5","#FC8D62","#8DA0CB"))

Interactive plots: networkD3 package for network graph

The package networkD3 allows the use of interactive network graphs from the D3.js javascript library.

library(networkD3)
data(MisLinks,MisNodes)
forceNetwork(Links=MisLinks,
             Nodes=MisNodes,
             Source="source",
             Target="target",
             Value="value",
             NodeID="name",
             Group="group", 
             arrows = TRUE,
             legend = FALSE,
             opacity=0.9,
             height=500,
             width=700,
             fontSize=30)

Interactive plots: leaflet package

The package leaflet provides R bindings for javascript mapping library; leafletjs.

library(leaflet)
Carleton_University <- leaflet(height=360) %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=-75.698312, lat=45.383082, 
             popup="Carleton University") 
Carleton_University 

Interactive plots: crosstalk package

R package crosstalk allows crosstalk enabled plotting libraries to be linked. Through the shared key variable, data points can be manipulated simultaneously on two independent plots.

invisible(lapply(c("crosstalk","htmltools"), library, character.only = TRUE))
shared_quakes <- SharedData$new(quakes[sample(nrow(quakes), 100),])
lf <- leaflet(shared_quakes,height=300) %>%
        addTiles(urlTemplate='http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>% addMarkers()
py <- plot_ly(shared_quakes,x=~depth,y=~mag,size=~stations,height=300) %>% add_markers()
div(div(lf,style="float:left;width:45%"),div(py,style="float:right;width:45%"))

R Markdown

R Markdown is a powerful tool to write up a good-looking report by combining R code chunks, analysis, and reporting into the same document.

  • To start using R Markdown in the Rstudio environment, you need to install the package rmarkdown. This can be done by typing the following code:
install.packages("rmarkdown")
  • If you want to use Latex to write mathematical equations in R Markdown, then you need to install the package tinytex by typing the following code:
tinytex::install_tinytex()

This document is prepared by R Markdown.

Exercises

Q1: (see chapter 3 of Rob Kabacoff book : Data Visualization with R): The Marriage dataset from the package mosaicData contains the marriage records of 98 individuals in Mobile County, Alabama.
  • Plot the bar chart to display the distribution of wedding participants by race.
  • plot the distribution of race with modified colors and labels.
  • plot the distribution of race as percantages.
  • Sort the bar chart with percent labels.
  • Plot stacked bar chart as seen in Section 4.1.1 and add labels to it.
  • Plot grouped bar chart as seen in Section 4.1.2 and add labels to it.
Q3: (see chapter 11 of Rob Kabacoff book : Data Visualization with R):
  • Save the graphs, in Q1 and Q2, as several formats (pdf, jpeg, tiff, png, svg, wmf) using the function ggsave().
  • Save the graphs, in Q1 and Q2, use the menu in plot panel: Plots panel –> Export –> Save as Image or Save as PDF