Thursday, June 2, 2016

Basic Data Analysis with dplyr

The dplyr is a very useful package in R for data manipulation. Created and maintained by Hadley Wickham, it contains some very useful functions for data analysis and manipulation. Here, I will show some of the most basic but important functions to perform data analysis.

For this exercise, we'll use the data package Cars93 available in the R package MASS. We'll also be using the package dplyr to analyze data from the dataset. To first load the data and the package, we'll use the following lines of codes


Now we'll call the data


Before digging in, we'll first see what the data looks like. The package dplyr has a cool function View to view the dataset in RStudio.


As you can see, the dataset has 93 rows and 27 columns (although the image above only shows 17 rows and 7 columns) and is a dataframe of car manufacturers, their models and different variables like their price, horsepower, engine size etc. For the rest of this exercise, I'll only show the first few rows and columns because showing all would be very tedious.

The head of the dataset looks like this:


Now that we've an idea of what the dataset looks like, let's get started with the functions.

The filter is a function that returns the rows that satisfy certain conditions. It is similar to the default subset function. The first argument is the name of the dataframe and the subsequent arguments are the filters to choose specific data based on the criteria selected. For this example, let's say we will only want the list of cars manufacturers that are small and whose price is below 30.

filter(Cars93, Type=="Small" & Max.Price<30)

As you can see, only cars whose Type is Small are shown. There are 21 such cars in the dataframe.

Similarly, we can also use this function to choose only cars that have airbags for both the driver and the passenger.

filter(Cars93, AirBags == "Driver & Passenger")

The slice function selects rows by their position. Let's say we only want to see the first twenty rows of this dataset. This is also like the default head function, but the head function by default gives the first 5 rows.

slice(Cars93, 1:20)

The mutate function is used to add new variables. But what's cool is that the new variables can be functions of other variables. So let's say I want a new variable that calculates the deviation of the price ranges. To do this, I'll create a new variable ratio and calculate it by dividing the maximum price by the minimum price.

mutate(Cars93, ratio = Max.Price/Min.Price)

The newly created column will be the last column. However, I've added it to the third column titled ratio by using the following lines of code.

Cars93 <- Cars93[c(1,2,29,3)]

The select function allows you to select specific rows. The first argument is the dataset, and the subsequent arguments are used to specify which columns you want to use. For example, if you want only the first five rows, you can use:

select(Cars93, 1:5)

Similarly, if you only want only the columns: Manufacturer, Model, Price,, Horsepower, you can use:

select(Cars93, Manufacturer, Model, Price,, Horsepower)

If you want to see all columns except the columns from Manufacturer to Type, you can use:

select(Cars93, -(Manufacturer:Type))

Summarise and Group_by
The summarise function is used to summarize multiple values of a variable. Used in conjunction with other functions, like group_by, this can be a very useful tool to analyze data. Here, I want to group all car types and see their mean mileage average mileage. To do this, I use:

mutate(Cars93, meanmpg=(
summarise(group_by(cars93, Type), mean(meanmpg))

The n_distinct function shows the number of distinct values in a vector. The code:

[1] 6

shows the different types of car (small, midsize, compact etc.) in the dataset.

The top_n function is similar to the slice function as it shows the specified number of rows in the dataset.

top_n(Cars93, 5, MPG.highway)

The arrange function is used to arrange rows by variables. Here, I arranged the dataset according to price in descending order using the desc function, meaning the most expensive car manufacturers and the models will show up first.

arrange(Cars93, desc(Price))

The sample function is used to select random rows from a table.

sample_n(Cars93, size=10)

This code gives us a random list of 10 car manufacturers and the other variables.

The Pipe is an operator that allows the user to connect multiple codes together. This is particularly helpful when writing multiple lines of code and you do not want to see the output of every single line.

For example, let's say from the dataset Cars93, I only want car models that have airbags in both the driver and passenger seats; then I want to group the car types and then summarize according to mileage on highways.

Cars93 %>% filter(AirBags == "Driver & Passenger") %>% group_by(Type) %>% summarise(mean(MPG.highway))

These are very basic examples, but it's quite easy to see why they'll be very useful in data analysis.

This brings me to the end of this article. I hope you found it useful. Please feel free to leave a comment below.