Linear Regression is one of the simplest and most important Machine Learning algorithms used for predicting numerical values. It helps us understand the relationship between an input variable and an output variable.
In this example, we want to study the relationship between:
- Hours Studied → Input Variable (Independent Variable)
- Exam Scores → Output Variable (Dependent Variable)
The main goal is to predict a student’s score based on the number of hours they study.
Understanding the Dataset
Our dataset contains two columns:
| Hours Studied | Exam Score |
|---|---|
| 2.5 | 21 |
| 5.1 | 47 |
| 8.5 | 75 |
| 9.2 | 88 |
From the data, we can observe a clear pattern:
- Students who study more hours generally score higher marks.
- Students who study fewer hours tend to score lower marks.
This type of relationship is called a positive linear relationship.
What Does Linear Regression Do?
Linear Regression tries to draw the best-fit straight line through the data points.
This line helps us:
- Understand the trend in data
- Predict future values
- Estimate unknown outputs
For example:
- If a student studies for 6 hours, the model can estimate the expected exam score.
- If a student studies for 10 hours, the model predicts an even higher score.
Linear Regression Equation
The straight-line equation used in Linear Regression is:
y = mx + b
Where:
- y = Predicted Score
- x = Hours Studied
- m = Slope of the line
- b = Intercept
Understanding the Slope
The slope (m) tells us how much the score changes when study hours increase.
For example:
- If the slope is positive, scores increase as study hours increase.
- A steeper slope means scores improve rapidly with more study time.
In our student dataset, the slope is positive because higher study hours usually lead to better marks.
Best-Fit Line Concept
The regression line is called the best-fit line because it tries to stay as close as possible to all data points.
Some points may lie:
- Above the line
- Below the line
But overall, the line captures the general trend of the data.
Why Linear Regression is Important
Linear Regression is widely used because:
- It is simple and easy to understand
- It works well for numerical prediction problems
- It helps explain relationships between variables
- It forms the foundation of many advanced Machine Learning algorithms
Real-World Applications
Linear Regression is commonly used in:
- Student performance prediction
- House price prediction
- Sales forecasting
- Financial analysis
- Business analytics
- Risk prediction
Linear Regression: Simple¶
Task¶
You are given data that contains student study hours and their scores.
Design ML model that can make prediction on score based on hours a student studied.
Steps:
Collect data –> Split into training + testing –> Feed training data to model (i.e. train the model) –> Evaluate the model by feeding testing data
import pandas as pd
# collect the data
data = {
"Hours": [
2.5, 5.1, 3.2, 8.5, 3.5, 1.5, 9.2, 5.5,
8.3, 2.7, 7.7, 5.9, 4.5, 3.3, 1.1, 8.9,
2.5, 1.9, 6.1, 7.4, 2.7, 4.8, 3.8, 6.9,
7.8, 7.0, 7.9, 4.0, 3.0, 4.8, 3.2, 5.0,
2.0, 7.8
],
"Scores": [
21, 47, 27, 75, 30, 20, 88, 60,
81, 25, 85, 62, 41, 42, 17, 95,
30, 24, 67, 69, 30, 54, 35, 76,
86, 88, 86, 45, 34, 56, 32, 55,
33, 90
]
}
df = pd.DataFrame(data)
print(df)
Hours Scores 0 2.5 21 1 5.1 47 2 3.2 27 3 8.5 75 4 3.5 30 5 1.5 20 6 9.2 88 7 5.5 60 8 8.3 81 9 2.7 25 10 7.7 85 11 5.9 62 12 4.5 41 13 3.3 42 14 1.1 17 15 8.9 95 16 2.5 30 17 1.9 24 18 6.1 67 19 7.4 69 20 2.7 30 21 4.8 54 22 3.8 35 23 6.9 76 24 7.8 86 25 7.0 88 26 7.9 86 27 4.0 45 28 3.0 34 29 4.8 56 30 3.2 32 31 5.0 55 32 2.0 33 33 7.8 90
Lets visualize¶
Our initial question was whether we’d score a higher score if we’d studied longer. In essence, we’re asking for the relationship between Hours and Scores. A great way to explore relationships between variables is through Scatter plots.
# Lets plot scatter graph between x=hours and y=scores variable
df.plot.scatter(x='Hours', y='Scores', title='Scatterplot of hours and scores percentages')
# Observ: There is a linear relation between x and y
<Axes: title={'center': 'Scatterplot of hours and scores percentages'}, xlabel='Hours', ylabel='Scores'>
# Now we split the dataset into training / testing
# step1: Seperate your data into feature X and response y
X = df[['Hours']] # This is the feature that would predict the y
y = df['Scores'] # This is th prediction: labels/response/target
# Step2:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# Lets see how many records are in training and testing dataset
print("X_train shape:", X_train.shape)
print(f"X_test shape:", X_test.shape)
X_train shape: (27, 1) X_test shape: (7, 1)
# Modelling and Training
from sklearn.linear_model import LinearRegression
model = LinearRegression()
# This can take long time depending on size of training data
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# Lets make a prediction on score of a student who studies for hours=9.5,
# and check against the above plot
sample = [[9.5],]
sample_df = pd.DataFrame(sample, columns=X_train.columns)
score = model.predict(sample_df)
print("The score is", score)
The score is [96.61912726]
Model Performance Evaluation¶
# step1: Lets make prediction on test data set
y_pred = model.predict(X_test)
print(y_pred)
[90.77311148 35.23596161 80.05541589 35.23596161 50.82533701 14.77490639 28.41560987]
# step2: look at the actual test data
# print(X_test)
print(y_test.values)
[95 32 90 27 56 17 30]
# step3: Lets see the performance metric of the model
from sklearn.metrics import r2_score
# Compute R-squared value
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.4f}') # closer to 1 is good
R-squared: 0.9618
Pretty good r-square value¶
STOP¶
OPTIONAL¶
# Now lets the see absolute difference between the actual and prediction and also the percentage difference.
df_preds = pd.DataFrame({
'hours': X_test.squeeze(),
'actual_score': y_test.squeeze(),
'predicted_score': y_pred.squeeze()})
df_preds['difference'] = (df_preds['actual_score'] - df_preds['predicted_score']).abs()
# Add the percentage difference column
df_preds['percentage_difference'] = (
df_preds['difference'] / df_preds['actual_score']
) * 100
print(df_preds)
hours actual_score predicted_score difference percentage_difference 15 8.9 95 90.773111 4.226889 4.449356 30 3.2 32 35.235962 3.235962 10.112380 33 7.8 90 80.055416 9.944584 11.049538 2 3.2 27 35.235962 8.235962 30.503562 29 4.8 56 50.825337 5.174663 9.240470 14 1.1 17 14.774906 2.225094 13.088786 16 2.5 30 28.415610 1.584390 5.281300
#calculate
a0 = model.intercept_ # intercept
a1 = model.coef_[0] # # x-coefficient or slope
print(f"The intercept a0: {a0}")
print(f"The x-coeff a1: {a1}")
The intercept a0: 4.057210803555726 The x-coeff a1: 9.743359626743649
# Lets round them up
a0 = round(a0,4) # intercept
a1 = round(a1,4) # # x-coefficient or slope
print(f"The intercept (a0): {a0}")
print(f"The x-coeff (a1) : {a1}")
The intercept (a0): 4.0572 The x-coeff (a1) : 9.7434
This can quite literally be plugged in into our formula from before: $$ y = a0 + a1 ∗ x_1 $$
score = a0 + a1 ∗ hours
# Lets plot the model line
import numpy as np
import matplotlib.pyplot as plt
x = df['Hours']
y = df['Scores']
# Create the scatter plot
plt.scatter(x, y, label="Data Points", color="blue")
# Generate x values for the line
x_line = np.linspace(x.min(), x.max(), 100) # Create a range of x values
y_line = a0 + a1 * x_line # Compute corresponding y values using the equation
# Plot the regression line
plt.plot(x_line, y_line, color="green", label=f"y = {a0} + {a1} * x")
# Add labels and title
plt.xlabel("Hours Studied")
plt.ylabel("Scores Percentage")
plt.title("Scatterplot of Hours and Scores with Regression Line")
plt.legend() # Show legend
plt.show() # Display the plot
Lets calculate score manually and also using the predict method¶
# Lets write a simple code to calculate score based on eqn:
# score = a0 + a1 * hours
a0 = model.intercept_ # intercept
a1 = model.coef_[0] # # x-coefficient or slope
hours = 9.5
score = a0 + a1 * hours
print(score)
96.61912725762039
new_data = pd.DataFrame([9.5,], columns=X_train.columns)
score = model.predict(new_data)
print(score)
[96.61912726]
Both are same¶
This predicted value can also be obtained via the line intercept and slope value
Predict values for larger dataset:¶
# Lets say someone gives you a large dataset to make predictions;
new_data = pd.DataFrame([8.3, 2.5, 5.9], columns=X_train.columns) # hours
# Lets make prediction on test data set
y_pred = model.predict(new_data)
print(y_pred)
[84.92709571 28.41560987 61.5430326 ]
# now lets save the model
import joblib
joblib.dump(model, 'linear_regression_model.pkl')
['linear_regression_model.pkl']
Load the model later¶
# lets load the model
import joblib
loaded_model = joblib.load('linear_regression_model.pkl')
# make prediction
new_data = pd.DataFrame([9.5], columns=X_train.columns) # hours
score = loaded_model.predict(new_data)
print(score)
[96.61912726]
