Basics, R, Visualization Tutorials

Data Visualization-R (Part-1)


Data Visualisation – R (Part-1)


In this report, I will use different datasets to plot the data to gain some meaningful insights using ggplot2 package. There is one more post which explains how to visualize maps in R using ggmaps package, you can read more about it here. This post will cover basics of data visualisation-R.

Some basic plots

First load the mtcars dataset

ggplot(mtcars, aes(x=mpg, y=0)) + geom_jitter() + scale_y_continuous(limits = c(-2,2))

The above plot is known as stripchart which is a univariate plot

ggplot(mtcars, aes(x = cyl, y = mpg)) + geom_point()


if we observe the dataset mtcars we will get to know that the variable cyl is categorical in nature but it is classified as numeric in the dataset. So we will need to tell ggplot2 that cyl is a categorical variable.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_point()


Now we can see that ggplot2 treats cyl as a factor. This time the x-axis does not contain the variables like 5 or 7, it contains only the values that are present in the dataset

ggplot(mtcars, aes(x = wt, y = mpg, color = disp)) + geom_point()


The above plot shows relationship between mpg and wt of the car with varying displacement disp of the car engine shown in different colors.

ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) + geom_point()


This plot also same as above, but this time dispalcement of the car engine is shown with varying sizes

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(am), fill = factor(cyl))) + 
  geom_point(shape = 21, size = 4, alpha = 0.6)


The above plot is used whenever we need to distinguish the data points based on two categorical variables

ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(aes(group = 1), method = "lm", se = FALSE, linetype = 2)


The above plot shows linear models of different subgroups cyl variable

val = c("#E41A1C", "#377EB8")
lab = c("Manual", "Automatic")
ggplot(mtcars, aes(x = factor(cyl), fill = factor(am))) +
  geom_bar(position = "dodge") +
  scale_x_discrete("Cylinders") + 
  scale_y_continuous("Number") +
                    values = val,
                    labels = lab) 


Plotting several distributions in the same panel

ggplot(mtcars, aes(x=mpg, col=factor(cyl))) + geom_histogram(binwidth = 1, position = "identity") + geom_freqpoly(binwidth = 1)


In the above plot we can see the three different distributions of cyl variable displayed on the same panel. This plot is known as Frequency Polygon plot

Daimond Dataset

Reducing the overplotting problem

ggplot(diamonds, aes(x=clarity, y=carat, color=price)) + geom_point()
ggplot(diamonds, aes(x=clarity, y=carat, color=price)) + geom_point(alpha=0.5, position = "jitter")


Adding a smoothing line

ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()


ggplot(diamonds, aes(x = carat, y = price, color = clarity)) + geom_point(alpha = 0.2)


The alpha argument inside the geom_point() function makes the data points transparent

ggplot(diamonds, aes(x = carat, y = price)) + geom_point(alpha = 0.2) + geom_smooth(aes(col = clarity), se = FALSE)


Tidying the data to make a Plot

In this problem we will use the iris dataset and we will rearrange the dataset to make simple but meaningful plots


Now we will tidy our iris dataset so that it is ready for plotting

iris_tidy <- iris %>% gather(Key, Value, -Species) %>% separate(Key, c("Part", "Measure"), "\\.")  

Now our dataset is ready for plotting

ggplot(iris_tidy, aes(x = Species, y = Value, col = Part)) + geom_jitter() + facet_grid(. ~ Measure)


Bar Plots with Color Ramp

we will use Vocab dataset from the car package

## Warning: package 'car' was built under R version 3.3.3
blues <- brewer.pal(9, "Blues") # from the RColorBrewer package
blue_range <- colorRampPalette(blues)
ggplot(Vocab, aes(x = factor(education), fill = factor(vocabulary))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = blue_range(11))


Tagged , ,