Basics, R, Visualization Tutorials

Data Visualisation in R (Part-3)

heat-maps

Data Visualisation in R (Part-3)

Introduction

In this report I will plot some more advanced charts using ggplot2 package. If you want to learn more about some basic plots you can refer to my earlier articles Data Visualization in R (Part 1) and Data Visualization in R (Part 2)

library(Hmisc)
library(dplyr)
library(ggplot2)
library(ggplot2movies)
library(RColorBrewer)
library(PerformanceAnalytics)
library(GGally)

Boxplots and Variable Transformation in plots

Below I have randomly sampled 1000 observations from the movies dataset available in the ggplot2movies package. The variable rating is coded as numerical variable in the dataset, so we need to convert it into a factor variable.

set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small$rating <- factor(round(movies_small$rating))
d <- ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  geom_boxplot() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red")

d + scale_y_log10()

boxplot

Sometimes, we need to convert continuous variables into categorical variables so that we can make boxplots from scatterplots and better understand the data

data("diamonds")
p <- ggplot(diamonds, aes(x = carat, y = price))

p + geom_boxplot(aes(group = cut_interval(carat, n = 10)))

boxplot2

p + geom_boxplot(aes(group = cut_number(carat, n = 10)))

boxplot3

p + geom_boxplot(aes(group = cut_width(carat, width = 0.25)))

boxplot4

In the above plots, the methods cut_interval(), cut_number() and cut_width() are used to convert a continuous variable into a categorical variable. For example, the method cut_interval(carat, n = 10) makes 10 groups of vector carat of equal range. Similar is the case for other two methods.

Boxplots with varying widths

Boxplots generally do not show the number of sample sizes per group, so we do not know how many data points in each group has gone into making the plot. We can deal with this problem by using variable width for the box.

ggplot(diamonds, aes(x = cut, y = price)) + 
  geom_boxplot(varwidth = TRUE, aes(col = color)) + 
  facet_grid(. ~ color) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

boxplot5

Weighted density plots and violin plots

When we have categorical variables with several levels it’s important to observe the density of each level with respect to the whole dataset. So we need to weight the density plots so that they are relative to each other. Each density is given a weight w.r.t the proportion of data for each level in the whole dataset. For example, the following code will compute proportion of data points for each level of the categorical variable cut in the diamonds dataset

diamonds2 <- diamonds %>%
  group_by(cut) %>%
  mutate(n = n() / nrow(diamonds)) -> diamonds

After we calculate the weights we can plot the weighted density plots as follows

ggplot(diamonds, aes(x = price, fill = cut)) +
  geom_density(aes(weight = n), col = NA, alpha = 0.35) 

violin plot

Violin plots are another useful plots which show the data distribution as a proportion of data points of each level with respect to the whole dataset

ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
  geom_violin(aes(weight = n), col = NA)

violin plot

2D density plots

data("faithful")
plot <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0, 0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0, 0)) +
  coord_fixed(60 / 4.5)
plot + geom_density_2d()

2d-density plot

plot + stat_density_2d(aes(col = ..level..), h = c(5, 0.5))

2d-density plot

Plotting Correlation matrices

The below plots gives the correlation matrices between the variables in a dataset. The function chart.Correlation() from the package PerformanceAnalytics only works with continuous variables whereas the function ggpairs from GGally package works well with both continuous and categorical variables.

chart.Correlation(iris[1:4])

correlation matrices

ggpairs(mtcars[1:3])

correlation matrices

The above matrices are also known as SPLOM which stands for Scatter PLOt Matrices.

Network Plots

library(geomnet)
data("madmen")

The dataset madmen is a list of two dataframes edges and vertices, so we need to merge these two datasets first before we could make some network plots out of it

madmen_merged <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)
ggplot(data = madmen_merged, aes(from_id = Name1, to_id = Name2)) +
  geom_net(aes(col = Gender),
           size = 6,
           linewidth = 1,
           labelon = TRUE,
           fontsize = 3,
           labelcolour = "black",
           directed = TRUE) +
  scale_color_manual(values = c("#FF69B4", "#0099ff")) +
  xlim(c(-0.05, 1.05)) +
  theme(legend.key = element_blank())

Network-plots

Heatmaps

The autoplot function from the ggfortify package is used to create heatmaps if distance matrix is provided. For example, the eurodist contains distances between various European cities and it is saved as a dist class.

library(ggfortify)
autoplot(eurodist) +
  coord_fixed() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

heat-maps

Plotting K-Means Clusters

iris_clust <- kmeans(iris[-5], 3)
autoplot(iris_clust, data = iris, frame = TRUE, shape = 'Species')

k-means clusters

Cartographic Maps

library(ggmap)
mumbai_map <- get_map("Mumbai, India", zoom = 13)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Mumbai,+India&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Mumbai,%20India&sensor=false
ggmap(mumbai_map)

Cartographic Maps

Tagged , ,