Linear Models in R – A Brief Reference

Linear models in R or also called Regression models are used to understand the distribution of the dependent variable (Y). This model also uses the mean of (characteristic) of independent or explanatory variables (X’s). As I said earlier, linear models are for continuous Y variable or target variable which can be in the form of –

Y=β0+β1X1+β2X2+ϵY=β0+β1X1+β2X2+ϵ

Here,

Y is the target variable which is continuous.
X is the covariants or the explanatory variables.
β’s are the unknown parameters and
ϵ is the error measure.

Table of Contents

Creating Linear Models in R with Exploratory Data Analysis

To create linear models in R, we need to import the required libraries. Anyway, we are going to use the ‘diamonds‘ dataset for this modeling purpose.



#Imports the required libraries

#For data visualization

library(ggplot2)

#For modeling

library(modelr)

#It includes many other sub packages library(tidyverse)

options(na.action = na.warn)

Feel free to read the comments for each library above to understand its need. We need to load the “Diamonds” dataset for this modeling purpose. Let’s load the data and peek into it. As we know, this data is available with the “ggpot2” library. So, all you need to do is just read it.



#Load the data

df <- view(diamonds)

Diamonds Data



#Understand the data

str(df)



tibble[,10] [53,940 x 10] (S3: tbl_df/tbl/data.frame) $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ... $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ... $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ... $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Our job here is to understand the price of diamonds and how it’s being affected. As you look at the data we have, we get two things, one is caret and another one is cut.

Caret – The caret variable will show the size of the particular diamond.
Cut – Cut variable shows the quality of cut of a particular diamond.
Color – This variable shows the color of each item.

Let’s plot a graph using qplot function to understand how price is varying for different quality(Cut) of diamonds and also for different sizes(Carat).



#Qplot

qplot(carat, price, data = df, color = cut)

Qplot Of Diamonds Data- linear models in R

Wow! By seeing the data and this visualization, we can make some points clear.

The price of the diamonds is up based on the size (carat) of the diamonds.
The price of diamonds is up even though the quality of diamonds is less.
It’s very clear in the above graph that even high-quality diamonds are of less price just because they are less in size.
Similarly, you can observe that low-quality diamonds are gaining the highest prices because they are bigger in size.
So, it’s clear that the carat of diamonds is the major contributor to the pricing.

Diving Deeper into the Diamonds Data

We just found surprising relationship between the price and quality of the diamonds. So, let’s understand why the low quality diamonds are expensive?.

For this, we have to use the ggplot library to visualize data using boxplots. We have to visualize each variable contributing to low-quality diamonds i.e. Poor cuts, bad colors, and low clarity. We will plot 3 graphs, one for each.



#Plots the cut variable ggplot(diamonds, aes(cut, price)) + geom_boxplot()



ggplot(diamonds, aes(color, price)) + geom_boxplot()



ggplot(diamonds, aes(clarity, price)) + geom_boxplot()

Note: The low quality color is J, low quality clarity is I1 and low quality cut is Fair.

It is evident that all the low-quality parameters are having higher prices because of the one covariant i.e. size of the diamonds (carat). As per our plots, the carat is the major contributor to the price of diamonds and the low-quality diamonds are bigger.

Price V/s Carat Analysis – Linear Models

As we already know that lower quality diamonds are more weighted, we can visualize it and make final observations.



#Plots the price vs carat data

ggplot(diamonds, aes(carat, price)) + geom_hex(bins = 75)

Price Vs Carat

Now, it’s important to see how other variables are contributing to the price of diamonds other than a carat. So, we must tweak our data for this purpose.

As you see in the above plot, over 95% of the data falls under carat value 2.5. So, we can eliminate values or the carat size above 2.5.
We can go for log values of Price and Carat variables. This should be done if the data is not normally distributed. By using log transformation, we can normalize it.



#Filter the data

df <- df %>% filter(carat <= 2.5) %>% mutate(log_price = log2(price), log_carat = log2(carat)) #Plots the graph with log transformed values

ggplot(df, aes(log_carat, log_price)) + geom_hex(bins = 45)

Log Transformed

The results are satisfying. But we further need to avoid or eliminate this strong linear relationship. Let’s do it.



#Modeling and visualization #Linear model

my_linear_model <- lm(log_price~log_carat, data = df) #Plots the linear relationship my_plot <- df %>% data_grid(carat = seq_range(carat, 20)) %>% mutate(log_carat = log2(carat)) %>% add_predictions(my_linear_model, "log_price") %>% mutate(price = 2 ^ log_price)

ggplot(df, aes(carat, price)) + geom_hex(bins = 50) + geom_line(data = my_plot, color = "Orange", size = 1.5)

Linear Model

Now, we have to understand what this relationship is indicating us.

The large diamonds are of less price than we are expecting.
By the plot, we can confirm that no diamond cost more than 19k / 18k.

Use of residuals in linear models in R

Residuals will help us to see if we were able to remove the strong linear pattern which we seen earlier. It will verify our assumption.



#Adds residuals to data

my_model <- df %>% add_residuals(my_linear_model, 'log_resid')

View(my_model)

Residuals

You can see the addition of log values and residuals in the data.



#Creates a residual plot

ggplot(my_model, aes(log_carat, log_resid))+geom_hex(bins = 45)

Residual Plot

Wow! Take some time to appreciate yourself. You did a fantastic job. You have eliminated the strong linearity in the data. Now, as a final check, with a plot, you can confirm the pricing of diamonds. Let’s roll!



#Boxplot

ggplot(my_model, aes(cut, log_resid))+geom_boxplot()

Boxplots

Whoo! Now, you can see that the price of diamonds is raising as the quality of the diamonds increase. That is what we were looking for and here it is.

I have added both the plots for easy understating.

Before
After

It’s not the end. You can go for more complicated models. You can include other variables also to build a robust model.

Ending note – linear models in R

Linear models in R are the regression models which we use more often to find the relationships in the data. It is always good to use linear models as we have seen in this article, how well they are performing. Feel free to make some additions to the model, play with it. That’s all for now. Happy R!!!

More read: lm function in R