R Tutorial: Programming Linear Regression in R

R

Introduction

Regression is an important Machine Learning technique for creating prediction models. Forecasting/Prediction is an important data analysis technique in todays marketplace. You are given data from disparate domain and asked to find relationships in the data and thus predict future trends/patterns of data. There are various software available to automatically do the model creation and learning part for you. However in this tutorial we will focus on programming linear regression in R programming language.

We will go through the internal working of a Linear Regression process. We will focus on the details of the mathematical/statistical steps involved and also in parallel show how to program these mathematical/statistical constructs in R. After completing this tutorial you could program your own linear regression models and change the parameters to suit your needs.

Programming Linear Regression in R

Programming Linear Regression in R is a trivial task, it does not require any extra package. We will discuss each mathematical step of Linear Regression and simultaneously program it in R.

Step-1: Data Modelling

Data Modelling is an important aspect of any Machine Learning task. In data modelling we try to figure out what attributes the data will have, how it should be represented, how it should be processed etc. For our current discussion we will consider the most commonly used housing price prediction problem as our use-case. In housing price prediction we have data with two dimensions:

  1. Area of house will act as X-dimension (independent attribute)
  2. Price of house will act as Y-dimension (dependent attribute)

In regression analysis we have a dependent attribute whose value we are interested to compute. We also have a set independent attributes that are used to find value of dependent attribute. In our case Area is the dependent attribute and Price is independent attribute.From here onwards we will represent Area as X and Price as Y.

Step 2: Hypothesis-  h_\theta(x)

Hypothesis represents the relationship between X and Y. In other words we want to represent Y as a function of X

 Y=f(X)          eq. (1)

As we are using linear regression, f() will be a linear function of the type:

 Y=\theta_0 x_0 + \theta_1 x_1         eq.  (2)

here we will treat x_0 to be always 1 and x_1 will represent the Area of house. So the final hypothesis will be :

 h_\theta(x)=Y=\theta_0 + \theta_1 x_1         eq.  (3)

We took few random values for X and Y and plotted them as follows:

Learning Data

Fig 1. Learning Data

In Fig 1 x-axis represents area of house and y-axis represents corresponding price of house. Now to create a Linear Regression prediction model we need to fit a line that best represents given data. Something similar to green line in Fig 2.

Fig 2. Linear Regression Line

Fig 2. Linear Regression Line

We could have various lines that could pass through the given data. But in linear regression we try to find the line that optimally fits that data i.e. the line is at minimum possible distance from each data point.

The test data for X-dimension and Y-dimension is:

x=c(2,2,3,4,3,6,5,4,5,7,6,9)
y=c(2,3,3,4,5,5,6,7,7,7,8,9)

Step 3: Cost Function-  J(\theta)

Now that we have defined the hypothesis, we need to focus on evaluating the parameters of the hypothesis. In the given hypothesis in eq. (3) Y and x_i are variables and \theta_0 and \theta_1 are constants whose values we need to compute. To compute the values of thetas we need to create a cost function as follows:

 J(\theta)= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2           eq. (4)

substituting value of h_\theta(x) we get

 J(\theta_0, \theta_1)= \frac{1}{m} \sum_{i=1}^{m} ((\theta_0+\theta_1 x_1) - y^{(i)})^2           eq. (5)

Where: m=number of training examples or length of vectors x and y above. Super-script (i) is not the power instead it denotes the i^{th} training example

Here in eq. (4) we are trying to find the difference between the predicted value of y (price of house) denoted as h_\theta(x^{(i)} and actual value of y denoted as y^i. For programming linear regression in R we need initial values of \theta_0 and \theta_1. Lets following values as initial values of parameters thetas (you could any random value):

theta0=10
theta1=10

R code for computing cost function J(\theta) is as follows:

J=function(x,y,theta0,theta1){
    m=length(x)
    sum=0
    for(i in 1:m){
        sum=sum+((theta0+theta1*x[i]-y[i])^2)
    }
    sum=sum/(2*m)
    return(sum)
}

Step 3: Minimizing The Cost Function-  J(\theta)

From eq. 5 we could see that cost function J(\theta) depends on h_\theta(x) which in turn depends on values of \theta_0 and \theta_1. So a plot of J(\theta) with respect to \theta_1$ will look something like:

Cost function curve

Fig 3. Cost function curve

A similar graph could be used for \theta_1 In order minimize J(\theta) we need to minimize the term ((\theta_0+\theta_1 x_1) - y^{(i)}). As x and y will be supplied externally in the form of training examples, hence to minimize this term we need to find the values of \theta_0 and \theta_1 that minimize this term.

If we take an arbitrary point on Fig 3. and make a small change in \theta_0 then there will be a small change in the value of J(\theta). Similarly reducing the value of \theta_1 by a small amount will create a small change in the value of J(\theta). If these small changes are very small then slowly with each step we will proceed towards the minimum value of J(\theta). Mathematically we could represent this as:

 \theta_0=\theta_0-\eta \frac{\partial J(\theta)}{\partial \theta_0}          eq. (6)

Similarly

 \theta_1=\theta_1-\eta \frac{\partial J(\theta)}{\partial \theta_1}          eq. (7)

Generalizing we could write:

 \theta_j=\theta_j-\eta \frac{\partial J(\theta)}{\partial \theta_j}          eq. (8)

where j=0,1 and \eta is learning parameter.

Note: One point of consideration here is that, both \theta_0 and \theta_1 will be updated simultaneously i.e. in one iteration the updated value of one could not be used in calculating the other.

Now we just need to compute the values of \frac{\partial J(\theta)}{\partial \theta_0} and \frac{\partial J(\theta)}{\partial \theta_1} to compute the minimized J(\theta) :

 \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^{m}(\theta_0+\theta_1 x^{(i)} - y^{(i)})^2

\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0}= \frac{1}{m} \sum_{i=1}^{m}(\theta_0+\theta_1 x^{(i)}-y^{(i)})           eq. (9)

and

 \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} = \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum_{i=1}^{m}(\theta_0+\theta_1 x^{(i)} - y^{(i)})^2

\frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1}= \frac{1}{m} \sum_{i=1}^{m}(\theta_0+\theta_1 x^{(i)}-y^{(i)}) x^{(i)}           eq. (10)

The R code for updating theta is:

updateTheta=function(x,y,theta0,theta1){
    sum0=0
    sum1=0
    m=length(x)
    for(i in 1:m){
        sum0=sum0+(theta0+theta1*x[i]-y[i])
        sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i])
    }
    sum0=sum0/m
    sum1=sum1/m
    theta0=theta0-(alpha*sum0)
    theta1=theta1-(alpha*sum1)
    
    return(c(theta0,theta1))
}

Complete Code of Linear Regression in R

Now that we have discussed the technicalities of Linear Regression, mentioned below is the complete code. You could change the parameters and play with the code. Once you execute the code you could see a red colored line in action that shows the learning process. After the model completes the learning the line converts to green color.

x=c(2,2,3,4,3,6,5,4,5,7,6,9)
y=c(2,3,3,4,5,5,6,7,7,7,8,9)

plot(x,y)

theta0=10
theta1=10
alpha=0.0001
initialJ=100000
learningIterations=200000

J=function(x,y,theta0,theta1){
    m=length(x)
    sum=0
    for(i in 1:m){
        sum=sum+((theta0+theta1*x[i]-y[i])^2)
    }
    sum=sum/(2*m)
    return(sum)
}

updateTheta=function(x,y,theta0,theta1){
    sum0=0
    sum1=0
    m=length(x)
    for(i in 1:m){
        sum0=sum0+(theta0+theta1*x[i]-y[i])
        sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i])
    }
    sum0=sum0/m
    sum1=sum1/m
    theta0=theta0-(alpha*sum0)
    theta1=theta1-(alpha*sum1)
    
    return(c(theta0,theta1))
}    

for(i in 1:learningIterations){
    thetas=updateTheta(x,y,theta0,theta1)
    tempSoln=0
    tempSoln=J(x,y,theta0,theta1)
    if(tempSoln<initialJ){
        initialJ=tempSoln
    }
    if(tempSoln>initialJ){
        break
    }
    
    theta0=thetas[1]
    theta1=thetas[2]
    #print(thetas)
    #print(initialJ)
    plot(x,y)
    lines(x,(theta0+theta1*x), col="red")
    
}
lines(x,(theta0+theta1*x), col="green")

References

  1. https://en.wikipedia.org/wiki/Linear_regression
  2. http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning
  3. https://en.wikipedia.org/wiki/Gradient_descent