Linear Regression is one of the simplest and most important Machine Learning algorithms used for predicting numerical values. It helps us understand the relationship between an input variable and an output variable.

The main goal is to predict a student’s score based on the number of hours they study.

Understanding the Dataset

What Does Linear Regression Do?

Linear Regression tries to draw the best-fit straight line through the data points.

Linear Regression Equation

Understanding the Slope

In our student dataset, the slope is positive because higher study hours usually lead to better marks.

Best-Fit Line Concept

The regression line is called the best-fit line because it tries to stay as close as possible to all data points.

Why Linear Regression is Important

Real-World Applications

Linear Regression: Simple¶

Task¶

You are given data that contains student study hours and their scores.

Design ML model that can make prediction on score based on hours a student studied.

Steps:

Collect data –> Split into training + testing –> Feed training data to model (i.e. train the model) –> Evaluate the model by feeding testing data

In [2]:

import pandas as pd

In [4]:

# collect the data


data = {
    "Hours": [
        2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5,
        8.3, 2.7, 7.7, 5.9, 4.5, 3.3, 1.1, 8.9,
        2.5, 1.9, 6.1, 7.4, 2.7, 4.8, 3.8, 6.9,
        7.8, 7.0, 7.9, 4.0, 3.0, 4.8, 3.2, 5.0,
        2.0, 7.8
    ],

    "Scores": [
        21, 47, 27, 75, 30, 20, 88, 60,
        81, 25, 85, 62, 41, 42, 17, 95,
        30, 24, 67, 69, 30, 54, 35, 76,
        86, 88, 86, 45, 34, 56, 32, 55,
        33, 90
    ]
}
df = pd.DataFrame(data)

print(df)

    Hours  Scores
0     2.5      21
1     5.1      47
2     3.2      27
3     8.5      75
4     3.5      30
5     1.5      20
6     9.2      88
7     5.5      60
8     8.3      81
9     2.7      25
10    7.7      85
11    5.9      62
12    4.5      41
13    3.3      42
14    1.1      17
15    8.9      95
16    2.5      30
17    1.9      24
18    6.1      67
19    7.4      69
20    2.7      30
21    4.8      54
22    3.8      35
23    6.9      76
24    7.8      86
25    7.0      88
26    7.9      86
27    4.0      45
28    3.0      34
29    4.8      56
30    3.2      32
31    5.0      55
32    2.0      33
33    7.8      90

In [ ]:

Lets visualize¶

Our initial question was whether we’d score a higher score if we’d studied longer. In essence, we’re asking for the relationship between Hours and Scores. A great way to explore relationships between variables is through Scatter plots.

In [5]:

# Lets plot scatter graph between x=hours and y=scores variable

df.plot.scatter(x='Hours', y='Scores', title='Scatterplot of hours and scores percentages')

# Observ: There is a linear relation between x and y

Out[5]:

<Axes: title={'center': 'Scatterplot of hours and scores percentages'}, xlabel='Hours', ylabel='Scores'>

No description has been provided for this image

In [ ]:

In [6]:

# Now we split the dataset into training / testing

# step1: Seperate your data into feature X and response y
X = df[['Hours']] # This is the feature that would predict the y
y = df['Scores']  # This is th prediction: labels/response/target

In [7]:

# Step2: 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [8]:

# Lets see how many records are in training and testing dataset
print("X_train shape:", X_train.shape)
print(f"X_test shape:", X_test.shape)

X_train shape: (27, 1)
X_test shape: (7, 1)

In [ ]:

In [9]:

# Modelling and Training

from sklearn.linear_model import LinearRegression

model = LinearRegression()

# This can take long time depending on size of training data
model.fit(X_train, y_train)

Out[9]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [10]:

# Lets make a prediction on score of a student who studies for hours=9.5,
# and check against the above plot

sample = [[9.5],]
sample_df = pd.DataFrame(sample, columns=X_train.columns)

score = model.predict(sample_df)

print("The score is", score)

The score is [96.61912726]

In [ ]:

Model Performance Evaluation¶

In [11]:

# step1: Lets make prediction on test data set
y_pred = model.predict(X_test)

print(y_pred)

[90.77311148 35.23596161 80.05541589 35.23596161 50.82533701 14.77490639
 28.41560987]

In [12]:

# step2: look at the actual test data
# print(X_test)
print(y_test.values)

[95 32 90 27 56 17 30]

In [13]:

# step3: Lets see the performance metric of the model 
from sklearn.metrics import r2_score

# Compute R-squared value
r2 = r2_score(y_test, y_pred)

print(f'R-squared: {r2:.4f}')  # closer to 1 is good

R-squared: 0.9618

Pretty good r-square value¶

STOP¶

In [ ]:

OPTIONAL¶

In [14]:

# Now lets the see absolute difference between the actual and prediction and also the percentage difference.

df_preds = pd.DataFrame({
    'hours': X_test.squeeze(),
    'actual_score': y_test.squeeze(),
    'predicted_score': y_pred.squeeze()})

df_preds['difference'] = (df_preds['actual_score'] - df_preds['predicted_score']).abs()

# Add the percentage difference column
df_preds['percentage_difference'] = (
    df_preds['difference'] / df_preds['actual_score']
) * 100

print(df_preds)

    hours  actual_score  predicted_score  difference  percentage_difference
15    8.9            95        90.773111    4.226889               4.449356
30    3.2            32        35.235962    3.235962              10.112380
33    7.8            90        80.055416    9.944584              11.049538
2     3.2            27        35.235962    8.235962              30.503562
29    4.8            56        50.825337    5.174663               9.240470
14    1.1            17        14.774906    2.225094              13.088786
16    2.5            30        28.415610    1.584390               5.281300

In [ ]:

Analyze:¶

The y intercept¶

To find the equation of line, $$ y = a_0 + a_1 x_1 $$ we need to find $a_0$ and $a_1$. We can get that from the above model.

In [15]:

#calculate
a0 = model.intercept_ # intercept
a1 = model.coef_[0] # # x-coefficient or slope

print(f"The intercept a0: {a0}")
print(f"The x-coeff   a1: {a1}")

The intercept a0: 4.057210803555726
The x-coeff   a1: 9.743359626743649

In [16]:

# Lets round them up
a0 = round(a0,4) # intercept
a1 = round(a1,4) # # x-coefficient or slope

print(f"The intercept (a0): {a0}")
print(f"The x-coeff (a1)  : {a1}")

The intercept (a0): 4.0572
The x-coeff (a1)  : 9.7434

This can quite literally be plugged in into our formula from before: $$ y = a0 + a1 ∗ x_1 $$

score = a0 + a1 ∗ hours

In [ ]:

In [19]:

# Lets plot the model line
import numpy as np
import matplotlib.pyplot as plt

x = df['Hours']
y = df['Scores']

# Create the scatter plot
plt.scatter(x, y, label="Data Points", color="blue")

# Generate x values for the line
x_line = np.linspace(x.min(), x.max(), 100)  # Create a range of x values
y_line = a0 + a1 * x_line   # Compute corresponding y values using the equation

# Plot the regression line
plt.plot(x_line, y_line, color="green", label=f"y = {a0} + {a1} * x")

# Add labels and title
plt.xlabel("Hours Studied")
plt.ylabel("Scores Percentage")
plt.title("Scatterplot of Hours and Scores with Regression Line")

plt.legend() # Show legend

plt.show() # Display the plot

Lets calculate score manually and also using the predict method¶

In [20]:

# Lets write a simple code to calculate score based on eqn: 
# score = a0 + a1 * hours
a0 = model.intercept_ # intercept
a1 = model.coef_[0] # # x-coefficient or slope

hours = 9.5
score = a0 + a1 * hours

print(score)

96.61912725762039

In [21]:

new_data = pd.DataFrame([9.5,], columns=X_train.columns)
score = model.predict(new_data)
print(score)

[96.61912726]

Both are same¶

This predicted value can also be obtained via the line intercept and slope value

In [ ]:

Predict values for larger dataset:¶

In [22]:

# Lets say someone gives you a large dataset to make predictions;
new_data = pd.DataFrame([8.3, 2.5, 5.9], columns=X_train.columns) # hours

# Lets make prediction on test data set
y_pred = model.predict(new_data)
print(y_pred)

[84.92709571 28.41560987 61.5430326 ]

In [ ]:

Model Deloyment:¶

Save the model and then send it to the server and users can access your model.¶

In [23]:

# now lets save the model
import joblib

joblib.dump(model, 'linear_regression_model.pkl')

Out[23]:

['linear_regression_model.pkl']

Load the model later¶

In [24]:

# lets load the model
import joblib

loaded_model = joblib.load('linear_regression_model.pkl')

In [25]:

# make prediction
new_data = pd.DataFrame([9.5], columns=X_train.columns) # hours
score = loaded_model.predict(new_data)
print(score)

[96.61912726]

In [ ]:

Linear Regression Explained Using Student Study Hours Example

Understanding the Dataset

What Does Linear Regression Do?

Linear Regression Equation

Understanding the Slope

Best-Fit Line Concept

Why Linear Regression is Important

Real-World Applications

Linear Regression: Simple¶

Task¶

Lets visualize¶

Model Performance Evaluation¶

Pretty good r-square value¶

STOP¶

OPTIONAL¶

Analyze:¶

The y intercept¶

Lets calculate score manually and also using the predict method¶

Both are same¶

Predict values for larger dataset:¶

Model Deloyment:¶

Save the model and then send it to the server and users can access your model.¶

Load the model later¶

Leave a Comment Cancel Reply

Categories

Archives

GET HELP

COURSES

CONTACT US

Linear Regression Explained Using Student Study Hours Example

Understanding the Dataset

What Does Linear Regression Do?

Linear Regression Equation

Understanding the Slope

Best-Fit Line Concept

Why Linear Regression is Important

Real-World Applications

Linear Regression: Simple¶

Task¶

Lets visualize¶

Model Performance Evaluation¶

Pretty good r-square value¶

STOP¶

OPTIONAL¶

Analyze:¶

The y intercept¶

Lets calculate score manually and also using the predict method¶

Both are same¶

Predict values for larger dataset:¶

Model Deloyment:¶

Save the model and then send it to the server and users can access your model.¶

Load the model later¶

Leave a Comment Cancel Reply

Categories

Archives

Tags

GET HELP

COURSES

CONTACT US

Search