# Data Visualisation in R (Part-3) ## Introduction

In this report I will plot some more advanced charts using `ggplot2` package. If you want to learn more about some basic plots you can refer to my earlier articles Data Visualization in R (Part 1) and Data Visualization in R (Part 2)

``````library(Hmisc)
library(dplyr)
library(ggplot2)
library(ggplot2movies)
library(RColorBrewer)
library(PerformanceAnalytics)
library(GGally)``````

## Boxplots and Variable Transformation in plots

Below I have randomly sampled 1000 observations from the `movies` dataset available in the `ggplot2movies` package. The variable `rating` is coded as numerical variable in the dataset, so we need to convert it into a factor variable.

``````set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small\$rating <- factor(round(movies_small\$rating))``````
``````d <- ggplot(movies_small, aes(x = rating, y = votes)) +
geom_point() +
geom_boxplot() +
stat_summary(fun.data = "mean_cl_normal",
geom = "crossbar",
width = 0.2,
col = "red")

d + scale_y_log10()`````` Sometimes, we need to convert continuous variables into categorical variables so that we can make boxplots from scatterplots and better understand the data

``data("diamonds")``
``````p <- ggplot(diamonds, aes(x = carat, y = price))

p + geom_boxplot(aes(group = cut_interval(carat, n = 10)))`````` ``p + geom_boxplot(aes(group = cut_number(carat, n = 10)))`` ``p + geom_boxplot(aes(group = cut_width(carat, width = 0.25)))`` In the above plots, the methods `cut_interval()`, `cut_number()` and `cut_width()` are used to convert a continuous variable into a categorical variable. For example, the method `cut_interval(carat, n = 10)` makes 10 groups of vector `carat` of equal range. Similar is the case for other two methods.

## Boxplots with varying widths

Boxplots generally do not show the number of sample sizes per group, so we do not know how many data points in each group has gone into making the plot. We can deal with this problem by using variable width for the box.

``````ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot(varwidth = TRUE, aes(col = color)) +
facet_grid(. ~ color) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))`````` ## Weighted density plots and violin plots

When we have categorical variables with several levels it’s important to observe the density of each level with respect to the whole dataset. So we need to weight the density plots so that they are relative to each other. Each density is given a weight w.r.t the proportion of data for each level in the whole dataset. For example, the following code will compute proportion of data points for each level of the categorical variable `cut` in the `diamonds` dataset

``````diamonds2 <- diamonds %>%
group_by(cut) %>%
mutate(n = n() / nrow(diamonds)) -> diamonds``````

After we calculate the weights we can plot the weighted density plots as follows

``````ggplot(diamonds, aes(x = price, fill = cut)) +
geom_density(aes(weight = n), col = NA, alpha = 0.35) `````` Violin plots are another useful plots which show the data distribution as a proportion of data points of each level with respect to the whole dataset

``````ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
geom_violin(aes(weight = n), col = NA)`````` ## 2D density plots

``data("faithful")``
``````plot <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
scale_y_continuous(limits = c(1, 5.5), expand = c(0, 0)) +
scale_x_continuous(limits = c(40, 100), expand = c(0, 0)) +
coord_fixed(60 / 4.5)
plot + geom_density_2d()`````` ``plot + stat_density_2d(aes(col = ..level..), h = c(5, 0.5))`` ## Plotting Correlation matrices

The below plots gives the correlation matrices between the variables in a dataset. The function `chart.Correlation()` from the package `PerformanceAnalytics` only works with continuous variables whereas the function `ggpairs` from `GGally` package works well with both continuous and categorical variables.

``chart.Correlation(iris[1:4])`` ``ggpairs(mtcars[1:3])`` The above matrices are also known as `SPLOM` which stands for `S`catter `PLO`t `M`atrices.

## Network Plots

``````library(geomnet)

The dataset `madmen` is a list of two dataframes `edges` and `vertices`, so we need to merge these two datasets first before we could make some network plots out of it

``````madmen_merged <- merge(madmen\$edges, madmen\$vertices,
by.x = "Name1", by.y = "label",
all = TRUE)``````
``````ggplot(data = madmen_merged, aes(from_id = Name1, to_id = Name2)) +
geom_net(aes(col = Gender),
size = 6,
linewidth = 1,
labelon = TRUE,
fontsize = 3,
labelcolour = "black",
directed = TRUE) +
scale_color_manual(values = c("#FF69B4", "#0099ff")) +
xlim(c(-0.05, 1.05)) +
theme(legend.key = element_blank())`````` ## Heatmaps

The `autoplot` function from the `ggfortify` package is used to create heatmaps if distance matrix is provided. For example, the `eurodist` contains distances between various European cities and it is saved as a `dist` class.

``library(ggfortify)``
``````autoplot(eurodist) +
coord_fixed() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))`````` ## Plotting K-Means Clusters

``````iris_clust <- kmeans(iris[-5], 3)
autoplot(iris_clust, data = iris, frame = TRUE, shape = 'Species')`````` ## Cartographic Maps

``````library(ggmap)
mumbai_map <- get_map("Mumbai, India", zoom = 13)``````
``## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Mumbai,+India&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false``
``## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Mumbai,%20India&sensor=false``
``ggmap(mumbai_map)``  