05.04.2018 R

Visualising multi-dimensional data in two dimensional space is a common problem. When you plot a correlation between two variables using just points, you lose information stored in other variables. One of the solutions might be to **replace points with small piecharts**. Size of each piechart could also represent the sample size! This effect can be achieved in R using the `plotrix`

package.

To showcase how to do it, I’m going to use well known mtcars dataset. We’ll try to visualise the average distance (in miles) a car can go on a gallon of fuel (mpg) depending on number of carburetors (carb). Additionally, piecharts will show fraction of cars having V and straight engines (vs). Their sizes will corresponds to number of observations.

Following code will perform several actions:

1. Load all necessary libraries (plotrix will be used later to plot the piecharts).

2. Calculate average miles per galon, fraction of V and straight engines and the desired area of the piechart chunks.

# install.packages("plotrix") # install.packages("dplyr") library(dplyr) library(plotrix) # Prepare the data df <- mtcars %>% group_by(carb) %>% summarise( avg_mpg = mean(mpg), straight_engine = sum(vs == 1) / length(vs), V_engine = sum(vs == 0) / length(vs), count = length(vs) / nrow(mtcars) )

If we just plot number of carburetors against average miles per gallon results will look like this:

Once could expect that cars having more carburetors have bigger engines, therefore will have higher fuel consumption (lower mpg). However, if we look at the plot above, this is not always the case. For some reason cars with 6 carburetors have higher mpg compared to those with 3 & 4 carburetors. Why is that?

Let’s plot the piecharts and note the sample size by it’s size. Following code will first replace all zeros with very small but recognisable number, what is needed for floating.pie() function to work properly in case some of the categories should not be plotted. Then new plot is created and pies are added in a loop one by one using the floating.pie() function.

# This is needed to preserve colors of categories in case some of them are not to be plotted df[df == 0] <- 2.225074e-308 # Set some plot aesthetics df[df == 0] par(family = "Open Sans",mar = c(5.1,5.1,1,1.1)) # This time we need to build the plot from scratch. We need to initialise it first. plot.new() # Set the axis limits plot.window(xlim = c(0,8), ylim = c(10,30)) # Draw a grid grid() # Iterate through the calculated dataset and draw pies for (i in 1:NROW(df)) { floating.pie(xpos = df$carb[i], ypos = df$avg_mpg[i], x = as.numeric(df[i,3:4]), radius = sqrt(df$count[i]), border = NA, col = c("#0091aa", "#b72301")) } # Show axes axis(1) axis(2, las = 2) # Add labels and a legend mtext(text = "Number of carburetors", side = 1, line = 3, cex = 1.4) mtext(text = "Average MPG", side = 2, line = 3, cex = 1.4) legend(x = 4, y = 30, legend = c("Straight engine","V engine"), fill = c("#0091aa", "#b72301"), border = NA, box.lwd = 0, ncol = 1, bg = "transparent")

Result looks as follows.

Now, with the point sizes reflecting sample sizes, we can see that there is only a small number of cars in the dataset with 6 carburetors (in fact there is only one). This car might just be a poor general representation of the 6-carburetor car group.

Moreover we can draw some side conclusions. For example, that straight engine cars in the dataset tend to have lower number of carbuetors.

And we did it all with a single plot!