Skip to article frontmatterSkip to article content
Contents
and

44. Simple Linear Regression Model

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The simple regression model estimates the relationship between two variables xix_i and yiy_i

yi=α+βxi+ϵi,i=1,2,...,Ny_i = \alpha + \beta x_i + \epsilon_i, i = 1,2,...,N

where ϵi\epsilon_i represents the error between the line of best fit and the sample values for yiy_i given xix_i.

Our goal is to choose values for α and β to build a line of “best” fit for some data that is available for variables xix_i and yiy_i.

Let us consider a simple dataset of 10 observations for variables xix_i and yiy_i:

yiy_ixix_i
1200032
2100021
3150024
4250035
550010
690011
7110022
8150021
9180027
102502

Let us think about yiy_i as sales for an ice-cream cart, while xix_i is a variable that records the day’s temperature in Celsius.

x = [32, 21, 24, 35, 10, 11, 22, 21, 27, 2]
y = [2000,1000,1500,2500,500,900,1100,1500,1800, 250]
df = pd.DataFrame([x,y]).T
df.columns = ['X', 'Y']
df
Loading...

We can use a scatter plot of the data to see the relationship between yiy_i (ice-cream sales in dollars ($'s)) and xix_i (degrees Celsius).

ax = df.plot(
    x='X', 
    y='Y', 
    kind='scatter', 
    ylabel='Ice-cream sales ($\'s)', 
    xlabel='Degrees celcius'
)
<Figure size 640x480 with 1 Axes>

Figure 1:Scatter plot

as you can see the data suggests that more ice-cream is typically sold on hotter days.

To build a linear model of the data we need to choose values for α and β that represents a line of “best” fit such that

yi^=α^+β^xi\hat{y_i} = \hat{\alpha} + \hat{\beta} x_i

Let’s start with α=5\alpha = 5 and β=10\beta = 10

α = 5
β = 10
df['Y_hat'] = α + β * df['X']
fig, ax = plt.subplots()
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
plt.show()
<Figure size 640x480 with 1 Axes>

Figure 2:Scatter plot with a line of fit

We can see that this model does a poor job of estimating the relationship.

We can continue to guess and iterate towards a line of “best” fit by adjusting the parameters

β = 100
df['Y_hat'] = α + β * df['X']
fig, ax = plt.subplots()
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)
plt.show()
<Figure size 640x480 with 1 Axes>

Figure 3:Scatter plot with a line of fit #2

β = 65
df['Y_hat'] = α + β * df['X']
fig, ax = plt.subplots()
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
plt.show()
<Figure size 640x480 with 1 Axes>

Figure 4:Scatter plot with a line of fit #3

However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem.

Let’s consider the error ϵi\epsilon_i and define the difference between the observed values yiy_i and the estimated values y^i\hat{y}_i which we will call the residuals

e^i=yiy^i=yiα^β^xi\begin{aligned} \hat{e}_i &= y_i - \hat{y}_i \\ &= y_i - \hat{\alpha} - \hat{\beta} x_i \end{aligned}
df['error'] = df['Y_hat'] - df['Y']
df
Loading...
fig, ax = plt.subplots()
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r')
plt.show()
<Figure size 640x480 with 1 Axes>

Figure 5:Plot of the residuals

The Ordinary Least Squares (OLS) method chooses α and β in such a way that minimizes the sum of the squared residuals (SSR).

minα,βi=1Ne^i2=minα,βi=1N(yiαβxi)2\min_{\alpha,\beta} \sum_{i=1}^{N}{\hat{e}_i^2} = \min_{\alpha,\beta} \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}

Let’s call this a cost function

C=i=1N(yiαβxi)2C = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}

that we would like to minimize with parameters α and β.

44.1How does error change with respect to α and β

Let us first look at how the total error changes with respect to β (holding the intercept α constant)

We know from the next section the optimal values for α and β are:

β_optimal = 64.38
α_optimal = -14.72

We can then calculate the error for a range of β values

errors = {}
for β in np.arange(20,100,0.5):
    errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum()

Plotting the error

ax = pd.Series(errors).plot(xlabel='β', ylabel='error')
plt.axvline(β_optimal, color='r');
<Figure size 640x480 with 1 Axes>

Figure 6:Plotting the error

Now let us vary α (holding β constant)

errors = {}
for α in np.arange(-500,500,5):
    errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum()

Plotting the error

ax = pd.Series(errors).plot(xlabel='α', ylabel='error')
plt.axvline(α_optimal, color='r');
<Figure size 640x480 with 1 Axes>

Figure 7:Plotting the error (2)

44.2Calculating optimal values

Now let us use calculus to solve the optimization problem and compute the optimal values for α and β to find the ordinary least squares solution.

First taking the partial derivative with respect to α

Cα[i=1N(yiαβxi)2]\frac{\partial C}{\partial \alpha}[\sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}]

and setting it equal to 0

0=i=1N2(yiαβxi)0 = \sum_{i=1}^{N}{-2(y_i - \alpha - \beta x_i)}

we can remove the constant -2 from the summation by dividing both sides by -2

0=i=1N(yiαβxi)0 = \sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)}

Now we can split this equation up into the components

0=i=1Nyii=1Nαβi=1Nxi0 = \sum_{i=1}^{N}{y_i} - \sum_{i=1}^{N}{\alpha} - \beta \sum_{i=1}^{N}{x_i}

The middle term is a straight forward sum from i=1,...Ni=1,...N by a constant α

0=i=1NyiNαβi=1Nxi0 = \sum_{i=1}^{N}{y_i} - N*\alpha - \beta \sum_{i=1}^{N}{x_i}

and rearranging terms

α=i=1Nyiβi=1NxiN\alpha = \frac{\sum_{i=1}^{N}{y_i} - \beta \sum_{i=1}^{N}{x_i}}{N}

We observe that both fractions resolve to the means yiˉ\bar{y_i} and xiˉ\bar{x_i}

α=yiˉβxiˉ\alpha = \bar{y_i} - \beta\bar{x_i}

Now let’s take the partial derivative of the cost function CC with respect to β

Cβ[i=1N(yiαβxi)2]\frac{\partial C}{\partial \beta}[\sum_{i=1}^{N}{(y_i - \alpha - \beta x_i)^2}]

and setting it equal to 0

0=i=1N2xi(yiαβxi)0 = \sum_{i=1}^{N}{-2 x_i (y_i - \alpha - \beta x_i)}

we can again take the constant outside of the summation and divide both sides by -2

0=i=1Nxi(yiαβxi)0 = \sum_{i=1}^{N}{x_i (y_i - \alpha - \beta x_i)}

which becomes

0=i=1N(xiyiαxiβxi2)0 = \sum_{i=1}^{N}{(x_i y_i - \alpha x_i - \beta x_i^2)}

now substituting for α

0=i=1N(xiyi(yiˉβxiˉ)xiβxi2)0 = \sum_{i=1}^{N}{(x_i y_i - (\bar{y_i} - \beta \bar{x_i}) x_i - \beta x_i^2)}

and rearranging terms

0=i=1N(xiyiyiˉxiβxiˉxiβxi2)0 = \sum_{i=1}^{N}{(x_i y_i - \bar{y_i} x_i - \beta \bar{x_i} x_i - \beta x_i^2)}

This can be split into two summations

0=i=1N(xiyiyiˉxi)+βi=1N(xiˉxixi2)0 = \sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i) + \beta \sum_{i=1}^{N}(\bar{x_i} x_i - x_i^2)

and solving for β yields

β=i=1N(xiyiyiˉxi)i=1N(xi2xiˉxi)\beta = \frac{\sum_{i=1}^{N}(x_i y_i - \bar{y_i} x_i)}{\sum_{i=1}^{N}(x_i^2 - \bar{x_i} x_i)}

We can now use (12) and (20) to calculate the optimal values for α and β

Calculating β

df = df[['X','Y']].copy()  # Original Data

# Calculate the sample means
x_bar = df['X'].mean()
y_bar = df['Y'].mean()

Now computing across the 10 observations and then summing the numerator and denominator

# Compute the Sums
df['num'] = df['X'] * df['Y'] - y_bar * df['X']
df['den'] = pow(df['X'],2) - x_bar * df['X']
β = df['num'].sum() / df['den'].sum()
print(β)
64.37665782493369

Calculating α

α = y_bar - β * x_bar
print(α)
-14.72148541114052

Now we can plot the OLS solution

df['Y_hat'] = α + β * df['X']
df['error'] = df['Y_hat'] - df['Y']

fig, ax = plt.subplots()
ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)
ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')
plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');
<Figure size 640x480 with 1 Axes>

Figure 8:OLS line of best fit

CC-BY-SA-4.0

Creative Commons License – This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International.