Basics, R, Visualization Tutorials

Data Visualization in R (Part-2)



In this report, I will plot some more advanced charts using packageggplot2. If you want to learn more about some basic plots you can refer to my earlier article Data Visualization in R (Part 1) Also, you can view other posts related to visualizations here.


Data Smoothing in plots

Smoothing means to use algorithms to remove noise from a data set, allowing some important patterns to stand out. To add smoothing lines we would the geom geom_smooth() by default it uses LOESS smoothing which stands for Locally Weighted Scatterplot Smoothing

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() + geom_smooth()


If we want to change the previous plot to use ordinary linear model smoothing we can use the argumentmethod = "lm".

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() + geom_smooth(method = "lm")


The shaded portion in the above plots shows the 95% Confidence Intervals which also known as the standard error, we can remove this shaded portion using the argument se = FALSE

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)


Grouping variables in plots

Sometimes in our data, we might like to see patterns in the data based on some subgroups or categorical variables which can be shown using the aesthetic col as follows

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)


In the above ggplot command our smooth is calculated for each subgroup because there is an invisible aesthetic group which inherits from col. The following plot also add a smoothing line for the complete data along with the other subgroups

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = F) + 
  stat_smooth(method = "lm", se = F, aes(group = 1), linetype = 5)


Mapping different models in plots

In the below plot we will add two different models lm and loess in the same plot. Where lm stands for linear model which is also known as Ordinary Least Squares (OLS) method and loess smoothing is a non-parametric form of regression that uses a weighted, sliding-window, average to calculate a line of best fit. We can control the size of this window with the span argument.

myColors <- c(brewer.pal(3, "Dark2"), "black")
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, span = 0.75) +
  stat_smooth(method = "loess", 
              aes(group = 1, col="All"), 
              se = F, span = 0.7) +
  scale_color_manual("Cylinders", values = myColors)


Sometimes the ColorBrewer shows error when our factor or categorical variables have more than 9 subgroups as by default we ColorBrewer palette has only 9 colors. We can solve this problem using the function scale_color_gradientn

ggplot(Vocab, aes(x = education, y = vocabulary, col = year, group = factor(year))) +
  geom_jitter(alpha = 0.6) +
  stat_smooth(method = "lm", se = F, alpha = 0.2, size = 2) + 
  scale_color_gradientn(colors = brewer.pal(9,"YlOrRd"))


Adding statistics in plots

We can write explicit functions to calculate the statistics and then we can use those statistics in our plots. For example, below is a function save range for use in plots

plot_range <- function(x) {
  data.frame(ymin = min(x), 
             ymax = max(x))

and the below function calculates median, 1st quartile, and 3rd quartile

plot_IQR <- function(x) {
  data.frame(y = median(x), 
             ymin = quantile(x)[2], 
             ymax = quantile(x)[4])  
posn.d <- position_dodge(width = 0.1)

base_plot <- ggplot(mtcars, aes(x = factor(cyl),y = wt, col = factor(am), fill = factor(am), group = factor(am)))

base_plot + 
  stat_summary(geom = "linerange", = plot_IQR, 
               position = posn.d, size = 3) +
  stat_summary(geom = "linerange", = plot_range, 
               position = posn.d, size = 3, 
               alpha = 0.4) +
  stat_summary(geom = "point", fun.y = median, 
               position = posn.d, size = 3, 
               col = "black", shape = "X")


Creating pie charts

Pie charts can be thought of modification of stacked bar charts. Lets imagine a stacked bar plot and we just take the y-axis and bend it until it loops back on itself and will create a pie chart. So in the below code we first created a bar plot and then we converted the bar chart to a pie chart using the polar coordinates.

ggplot(mtcars, aes(x = 1, fill = factor(cyl))) +
  geom_bar() + 
  coord_polar(theta = "y")

pie-charts using ggplot2

Use of facetting in plots

Facets are a way of presenting categorical variables in the plots. It can also be used to include more number of variables in a plot. The following plot shows a total of 7 variables which are represented in the chart. In the following plot we have used a trick to map two variables onto two scalar scales- hue and lightness and we have combined cyl and am into a single variable cyl_am. And to accommodate this we also make a new color palette with alternating red and blue of increasing darkness.

mtcars$cyl_am <- paste(mtcars$cyl, mtcars$am, sep = "_")
myCol <- rbind(brewer.pal(9, "Blues")[c(3,6,8)],
               brewer.pal(9, "Reds")[c(3,6,8)])

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl_am, size = disp)) +
  geom_point() + 
  scale_color_manual(values = myCol) + 
  facet_grid(vs ~ gear)