Regression is an important Machine Learning technique for creating prediction models. Forecasting/Prediction is an important data analysis technique in todays marketplace. You are given data from disparate domain and asked to find relationships in the data and thus predict future trends/patterns of data. There are various software available to automatically do the model creation and learning part for you. However in this tutorial we will focus on programming linear regression in R programming language.

We will go through the internal working of a Linear Regression process. We will focus on the details of the mathematical/statistical steps involved and also in parallel show how to program these mathematical/statistical constructs in R. After completing this tutorial you could program your own linear regression models and change the parameters to suit your needs.

Programming Linear Regression in R is a trivial task, it does not require any extra package. We will discuss each mathematical step of Linear Regression and simultaneously program it in R.

Data Modelling is an important aspect of any Machine Learning task. In data modelling we try to figure out what attributes the data will have, how it should be represented, how it should be processed etc. For our current discussion we will consider the most commonly used housing price prediction problem as our use-case. In housing price prediction we have data with two dimensions:

- Area of house will act as
**X-dimension (independent attribute)** - Price of house will act as
**Y-dimension (dependent attribute)**

In regression analysis we have a **dependent attribute** whose value we are interested to compute. We also have a set **independent attributes** that are used to find value of dependent attribute. In our case **Area** is the dependent attribute and **Price** is independent attribute.F**rom here onwards we will represent Area as X and Price as Y.**

Hypothesis represents the relationship between X and Y. In other words we want to represent Y as a function of X

eq. (1)

As we are using linear regression, will be a linear function of the type:

eq. (2)

here we will treat to be always 1 and will represent the **Area** of house. So the final hypothesis will be :

eq. (3)

We took few random values for X and Y and plotted them as follows:

In Fig 1 x-axis represents area of house and y-axis represents corresponding price of house. Now to create a Linear Regression prediction model we need to fit a line that best represents given data. Something similar to green line in Fig 2.

We could have various lines that could pass through the given data. But in linear regression we try to find the line that optimally fits that data i.e. the line is at minimum possible distance from each data point.

The test data for X-dimension and Y-dimension is:

x=c(2,2,3,4,3,6,5,4,5,7,6,9) y=c(2,3,3,4,5,5,6,7,7,7,8,9)

Now that we have defined the hypothesis, we need to focus on evaluating the parameters of the hypothesis. In the given hypothesis in eq. (3) and are variables and and are constants whose values we need to compute. To compute the values of **thetas** we need to create a **cost function** as follows:

eq. (4)

substituting value of we get

eq. (5)

**Where:** m=number of training examples or length of vectors x and y above. Super-script is not the power instead it denotes the training example

Here in eq. (4) we are trying to find the difference between the predicted value of y (price of house) denoted as and actual value of y denoted as . For programming linear regression in R we need initial values of and . Lets following values as initial values of parameters thetas (you could any random value):

theta0=10 theta1=10

R code for computing cost function is as follows:

J=function(x,y,theta0,theta1){ m=length(x) sum=0 for(i in 1:m){ sum=sum+((theta0+theta1*x[i]-y[i])^2) } sum=sum/(2*m) return(sum) }

From eq. 5 we could see that cost function depends on which in turn depends on values of and . So a plot of with respect to \theta_1$ will look something like:

A similar graph could be used for In order minimize we need to minimize the term . As and will be supplied externally in the form of training examples, hence to minimize this term we need to find the values of and that minimize this term.

If we take an arbitrary point on Fig 3. and make a small change in then there will be a small change in the value of . Similarly reducing the value of by a small amount will create a small change in the value of . If these small changes are very small then slowly with each step we will proceed towards the minimum value of . Mathematically we could represent this as:

eq. (6)

Similarly

eq. (7)

Generalizing we could write:

eq. (8)

where j=0,1 and is learning parameter.

**Note:** One point of consideration here is that, both and will be updated simultaneously i.e. in one iteration the updated value of one could not be used in calculating the other.

Now we just need to compute the values of and to compute the minimized :

eq. (9)

and

eq. (10)

The R code for updating theta is:

updateTheta=function(x,y,theta0,theta1){ sum0=0 sum1=0 m=length(x) for(i in 1:m){ sum0=sum0+(theta0+theta1*x[i]-y[i]) sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i]) } sum0=sum0/m sum1=sum1/m theta0=theta0-(alpha*sum0) theta1=theta1-(alpha*sum1) return(c(theta0,theta1)) }

Now that we have discussed the technicalities of Linear Regression, mentioned below is the complete code. You could change the parameters and play with the code. Once you execute the code you could see a red colored line in action that shows the learning process. After the model completes the learning the line converts to green color.

x=c(2,2,3,4,3,6,5,4,5,7,6,9) y=c(2,3,3,4,5,5,6,7,7,7,8,9) plot(x,y) theta0=10 theta1=10 alpha=0.0001 initialJ=100000 learningIterations=200000 J=function(x,y,theta0,theta1){ m=length(x) sum=0 for(i in 1:m){ sum=sum+((theta0+theta1*x[i]-y[i])^2) } sum=sum/(2*m) return(sum) } updateTheta=function(x,y,theta0,theta1){ sum0=0 sum1=0 m=length(x) for(i in 1:m){ sum0=sum0+(theta0+theta1*x[i]-y[i]) sum1=sum1+((theta0+theta1*x[i]-y[i])*x[i]) } sum0=sum0/m sum1=sum1/m theta0=theta0-(alpha*sum0) theta1=theta1-(alpha*sum1) return(c(theta0,theta1)) } for(i in 1:learningIterations){ thetas=updateTheta(x,y,theta0,theta1) tempSoln=0 tempSoln=J(x,y,theta0,theta1) if(tempSoln<initialJ){ initialJ=tempSoln } if(tempSoln>initialJ){ break } theta0=thetas[1] theta1=thetas[2] #print(thetas) #print(initialJ) plot(x,y) lines(x,(theta0+theta1*x), col="red") } lines(x,(theta0+theta1*x), col="green")