Here we will understand how logistic regression models a binary outcome in stocks. The features would be RSI, Moving Averages, etc. We will train a logistic regression model to predict next-day stock movement.
π Logistic Regression for Stock Movement PredictionΒΆ
Teaching Notebook β Concept to CodeΒΆ
Learning Objectives
By the end of this notebook, students will be able to:
- Understand how logistic regression models a binary outcome
- Compute financial technical indicators (RSI, Moving Averages)
- Train a logistic regression model to predict next-day stock movement
- Evaluate the model and interpret its outputs
π§ Section 1 β The Big IdeaΒΆ
Question: Can we predict whether a stock will go up or down tomorrow?
This is a binary classification problem:
- Output = 1 β Stock goes UP
- Output = 0 β Stock goes DOWN
The Sigmoid FunctionΒΆ
Logistic regression uses the sigmoid function to squish any number into a probability between 0 and 1:
$$P(y=1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta x)}}$$
- If the output > 0.5 β predict UP (1)
- If the output < 0.5 β predict DOWN (0)
Features We Will UseΒΆ
| Feature | What it measures |
|---|---|
| RSI | Relative Strength Index β is the stock overbought or oversold? |
| MA5 | 5-day moving average β short-term trend |
| MA20 | 20-day moving average β medium-term trend |
π¦ Section 2 β Install & Import LibrariesΒΆ
# Run this cell first to install required libraries
!pip install yfinance scikit-learn matplotlib seaborn pandas numpy --quiet
[notice] A new release of pip is available: 26.0.1 -> 26.1.1 [notice] To update, run: python.exe -m pip install --upgrade pip
!pip install yfinance
Collecting yfinance
Downloading yfinance-1.4.0-py2.py3-none-any.whl.metadata (6.2 kB)
Requirement already satisfied: pandas>=1.3.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (2.2.3)
Requirement already satisfied: numpy>=1.16.5 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (1.26.4)
Requirement already satisfied: requests>=2.31 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (2.32.3)
Collecting multitasking>=0.0.7 (from yfinance)
Downloading multitasking-0.0.13-py3-none-any.whl.metadata (16 kB)
Requirement already satisfied: platformdirs>=2.0.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (4.3.6)
Requirement already satisfied: pytz>=2022.5 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (2025.1)
Collecting peewee>=3.16.2 (from yfinance)
Downloading peewee-4.0.6-py3-none-any.whl.metadata (8.6 kB)
Requirement already satisfied: beautifulsoup4>=4.11.1 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (4.13.3)
Collecting curl_cffi>=0.15 (from yfinance)
Downloading curl_cffi-0.15.0-cp310-abi3-win_amd64.whl.metadata (18 kB)
Requirement already satisfied: protobuf>=3.19.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (4.25.6)
Requirement already satisfied: websockets>=13.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from yfinance) (14.2)
Requirement already satisfied: soupsieve>1.2 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.6)
Requirement already satisfied: typing-extensions>=4.0.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from beautifulsoup4>=4.11.1->yfinance) (4.12.2)
Collecting cffi>=2.0.0 (from curl_cffi>=0.15->yfinance)
Downloading cffi-2.0.0-cp311-cp311-win_amd64.whl.metadata (2.6 kB)
Requirement already satisfied: certifi>=2024.2.2 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from curl_cffi>=0.15->yfinance) (2025.1.31)
Requirement already satisfied: rich in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from curl_cffi>=0.15->yfinance) (13.9.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from pandas>=1.3.0->yfinance) (2.9.0.post0)
Requirement already satisfied: tzdata>=2022.7 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from pandas>=1.3.0->yfinance) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from requests>=2.31->yfinance) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from requests>=2.31->yfinance) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from requests>=2.31->yfinance) (2.3.0)
Requirement already satisfied: pycparser in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from cffi>=2.0.0->curl_cffi>=0.15->yfinance) (2.22)
Requirement already satisfied: six>=1.5 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.3.0->yfinance) (1.17.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from rich->curl_cffi>=0.15->yfinance) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from rich->curl_cffi>=0.15->yfinance) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in c:\users\uct\desktop\ashwani_stuff\code_stuff_temp\.venv\lib\site-packages (from markdown-it-py>=2.2.0->rich->curl_cffi>=0.15->yfinance) (0.1.2)
Downloading yfinance-1.4.0-py2.py3-none-any.whl (137 kB)
Downloading curl_cffi-0.15.0-cp310-abi3-win_amd64.whl (1.7 MB)
---------------------------------------- 0.0/1.7 MB ? eta -:--:--
---------------------------------------- 1.7/1.7 MB 14.9 MB/s eta 0:00:00
Downloading multitasking-0.0.13-py3-none-any.whl (16 kB)
Downloading peewee-4.0.6-py3-none-any.whl (146 kB)
Downloading cffi-2.0.0-cp311-cp311-win_amd64.whl (182 kB)
Installing collected packages: peewee, multitasking, cffi, curl_cffi, yfinance
Attempting uninstall: cffi
Found existing installation: cffi 1.17.1
Uninstalling cffi-1.17.1:
Successfully uninstalled cffi-1.17.1
Successfully installed cffi-2.0.0 curl_cffi-0.15.0 multitasking-0.0.13 peewee-4.0.6 yfinance-1.4.0
[notice] A new release of pip is available: 25.0.1 -> 26.1.1 [notice] To update, run: python.exe -m pip install --upgrade pip
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report, confusion_matrix,
ConfusionMatrixDisplay, roc_curve, roc_auc_score
)
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Plot style
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
print('All libraries imported successfully!')
All libraries imported successfully!
π₯ Section 3 β Load Real Stock DataΒΆ
We use Apple (AAPL) stock from 2022 to 2024 β 3 years of daily prices, ~750 trading days.
# Download AAPL data
ticker = 'AAPL'
df = yf.download(ticker, start='2022-01-01', end='2024-12-31', auto_adjust=True)
# Flatten multi-level columns if present
if isinstance(df.columns, pd.MultiIndex):
df.columns = df.columns.get_level_values(0)
print(f'Shape: {df.shape}')
print(f'Date range: {df.index[0].date()} β {df.index[-1].date()}')
df.head()
[*********************100%***********************] 1 of 1 completed
Shape: (752, 5) Date range: 2022-01-03 β 2024-12-30
| Price | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|
| Date | |||||
| 2022-01-03 | 177.939743 | 178.790298 | 173.735915 | 173.853227 | 104487900 |
| 2022-01-04 | 175.681351 | 178.848900 | 175.114320 | 178.545835 | 99310400 |
| 2022-01-05 | 171.008270 | 176.140865 | 170.734533 | 175.593390 | 94537600 |
| 2022-01-06 | 168.153595 | 171.379801 | 167.801645 | 168.837938 | 96904000 |
| 2022-01-07 | 168.319778 | 170.245725 | 167.205273 | 169.023678 | 86709100 |
# π Visualise the closing price
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df.index, df['Close'], color='steelblue', linewidth=1.5)
ax.set_title(f'{ticker} Closing Price (2022β2024)', fontsize=14, fontweight='bold')
ax.set_ylabel('Price (USD)')
ax.set_xlabel('')
plt.tight_layout()
plt.show()
df.head(10)
| Price | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|
| Date | |||||
| 2022-01-03 | 177.939743 | 178.790298 | 173.735915 | 173.853227 | 104487900 |
| 2022-01-04 | 175.681351 | 178.848900 | 175.114320 | 178.545835 | 99310400 |
| 2022-01-05 | 171.008270 | 176.140865 | 170.734533 | 175.593390 | 94537600 |
| 2022-01-06 | 168.153595 | 171.379801 | 167.801645 | 168.837938 | 96904000 |
| 2022-01-07 | 168.319778 | 170.245725 | 167.205273 | 169.023678 | 86709100 |
| 2022-01-10 | 168.339310 | 168.642375 | 164.409205 | 165.298858 | 106765600 |
| 2022-01-11 | 171.164673 | 171.262428 | 166.999945 | 168.466400 | 76138300 |
| 2022-01-12 | 171.604630 | 173.217725 | 170.910516 | 172.181432 | 74805200 |
| 2022-01-13 | 168.339310 | 172.670234 | 167.948246 | 171.849023 | 84505800 |
| 2022-01-14 | 169.199646 | 169.893760 | 167.263914 | 167.508323 | 80440800 |
π§ Section 4 β Feature EngineeringΒΆ
We compute three technical indicators from the raw price data.
4a β Moving Averages (MA5 & MA20)ΒΆ
A moving average smooths out daily noise: $$MA_n = \frac{\text{Close}_{t} + \text{Close}_{t-1} + \ldots + \text{Close}_{t-n+1}}{n}$$
- MA5 captures short-term momentum
- MA20 captures the medium-term trend
- When MA5 > MA20 β often a bullish signal
df['MA5'] = df['Close'].rolling(window=5).mean()
df['MA20'] = df['Close'].rolling(window=20).mean()
# Plot
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df.index, df['Close'], label='Close', color='lightgray', linewidth=1)
ax.plot(df.index, df['MA5'], label='MA5', color='steelblue', linewidth=1.5)
ax.plot(df.index, df['MA20'], label='MA20', color='tomato', linewidth=1.5)
ax.set_title('Closing Price with Moving Averages', fontsize=13, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()
# A new column MA5, MA20
df.head(25)
| Price | Close | High | Low | Open | Volume | MA5 | MA20 |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2022-01-03 | 177.939743 | 178.790298 | 173.735915 | 173.853227 | 104487900 | NaN | NaN |
| 2022-01-04 | 175.681351 | 178.848900 | 175.114320 | 178.545835 | 99310400 | NaN | NaN |
| 2022-01-05 | 171.008270 | 176.140865 | 170.734533 | 175.593390 | 94537600 | NaN | NaN |
| 2022-01-06 | 168.153595 | 171.379801 | 167.801645 | 168.837938 | 96904000 | NaN | NaN |
| 2022-01-07 | 168.319778 | 170.245725 | 167.205273 | 169.023678 | 86709100 | 172.220547 | NaN |
| 2022-01-10 | 168.339310 | 168.642375 | 164.409205 | 165.298858 | 106765600 | 170.300461 | NaN |
| 2022-01-11 | 171.164673 | 171.262428 | 166.999945 | 168.466400 | 76138300 | 169.397125 | NaN |
| 2022-01-12 | 171.604630 | 173.217725 | 170.910516 | 172.181432 | 74805200 | 169.516397 | NaN |
| 2022-01-13 | 168.339310 | 172.670234 | 167.948246 | 171.849023 | 84505800 | 169.553540 | NaN |
| 2022-01-14 | 169.199646 | 169.893760 | 167.263914 | 167.508323 | 80440800 | 169.729514 | NaN |
| 2022-01-18 | 166.002762 | 168.681478 | 165.621484 | 167.674513 | 90956700 | 169.262204 | NaN |
| 2022-01-19 | 162.512604 | 167.254149 | 162.229096 | 166.198300 | 94815000 | 167.531790 | NaN |
| 2022-01-20 | 160.831055 | 165.885436 | 160.508433 | 163.245819 | 91420500 | 165.377075 | NaN |
| 2022-01-21 | 158.778015 | 162.610350 | 158.670474 | 160.743060 | 122848900 | 163.464816 | NaN |
| 2022-01-24 | 158.005676 | 158.670477 | 151.240430 | 156.441466 | 162294600 | 161.226022 | NaN |
| 2022-01-25 | 156.206879 | 159.120233 | 153.508605 | 155.424766 | 115798400 | 159.266846 | NaN |
| 2022-01-26 | 156.118851 | 160.713742 | 154.290674 | 159.843645 | 108275300 | 157.988095 | NaN |
| 2022-01-27 | 155.659348 | 160.176025 | 154.740366 | 158.817111 | 121954600 | 156.953754 | NaN |
| 2022-01-28 | 166.520905 | 166.540461 | 159.159299 | 162.004227 | 179935700 | 158.502332 | NaN |
| 2022-01-31 | 170.871414 | 171.086496 | 165.719262 | 166.354735 | 115541600 | 161.075479 | 166.062891 |
| 2022-02-01 | 170.705200 | 170.930053 | 168.456632 | 170.118612 | 86213900 | 163.975143 | 165.701163 |
| 2022-02-02 | 171.907700 | 171.946813 | 169.453836 | 170.842079 | 84914300 | 167.132913 | 165.512481 |
| 2022-02-03 | 169.033417 | 172.298735 | 168.270861 | 170.578085 | 89418100 | 169.807727 | 165.413738 |
| 2022-02-04 | 168.749619 | 170.423515 | 167.075722 | 168.054605 | 82465400 | 170.253470 | 165.443539 |
| 2022-02-07 | 168.034988 | 170.276623 | 167.339975 | 169.209645 | 77251200 | 169.686185 | 165.429300 |
4b β RSI (Relative Strength Index)ΒΆ
RSI measures the speed and magnitude of recent price changes:
$$RSI = 100 – \frac{100}{1 + RS} \quad \text{where} \quad RS = \frac{\text{Avg Gain}}{\text{Avg Loss}}$$
- RSI > 70 β Overbought (stock may fall)
- RSI < 30 β Oversold (stock may rise)
- RSI ranges from 0 to 100
Average Gain & Average LossΒΆ
For a 14-day RSI, on each day you first compute the daily change:
$$\delta_t = \text{Close}_t – \text{Close}_{t-1}$$
Then split it into gain or loss:
$$\text{Gain}_t = \begin{cases} \delta_t & \text{if } \delta_t > 0 \\ 0 & \text{otherwise} \end{cases} \qquad \text{Loss}_t = \begin{cases} |\delta_t| & \text{if } \delta_t < 0 \\ 0 & \text{otherwise} \end{cases}$$
Then average over 14 days:
$$\text{Avg Gain} = \frac{\text{Gain}_1 + \text{Gain}_2 + \ldots + \text{Gain}_{14}}{14} \qquad \text{Avg Loss} = \frac{\text{Loss}_1 + \text{Loss}_2 + \ldots + \text{Loss}_{14}}{14}$$
Intuition for StudentsΒΆ
| Scenario | Avg Gain | Avg Loss | RS | RSI |
|---|---|---|---|---|
| Strong uptrend | High | Low | Large | Close to 100 |
| Strong downtrend | Low | High | Small | Close to 0 |
| Sideways market | Equal | Equal | ~1 | ~50 |
The key insight: RSI doesn’t measure price level, it measures momentum β how aggressively buyers are winning versus sellers over the last 14 days.
# Now we build RSI
def compute_rsi(series, period=14):
delta = series.diff()
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)
avg_gain = gain.rolling(window=period).mean()
avg_loss = loss.rolling(window=period).mean()
rs = avg_gain / avg_loss
rsi = 100 - (100 / (1 + rs))
return rsi
df['RSI'] = compute_rsi(df['Close'])
# Plot
fig, ax = plt.subplots(figsize=(12, 3))
ax.plot(df.index, df['RSI'], color='darkorange', linewidth=1.2)
ax.axhline(70, color='red', linestyle='--', linewidth=0.8, label='Overbought (70)')
ax.axhline(30, color='green', linestyle='--', linewidth=0.8, label='Oversold (30)')
ax.set_title('RSI (14-day)', fontsize=13, fontweight='bold')
ax.set_ylim(0, 100)
ax.legend()
plt.tight_layout()
plt.show()
# We see that there is a new column - RSI
df.head(20)
| Price | Close | High | Low | Open | Volume | MA5 | MA20 | RSI |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2022-01-03 | 177.939743 | 178.790298 | 173.735915 | 173.853227 | 104487900 | NaN | NaN | NaN |
| 2022-01-04 | 175.681351 | 178.848900 | 175.114320 | 178.545835 | 99310400 | NaN | NaN | NaN |
| 2022-01-05 | 171.008270 | 176.140865 | 170.734533 | 175.593390 | 94537600 | NaN | NaN | NaN |
| 2022-01-06 | 168.153595 | 171.379801 | 167.801645 | 168.837938 | 96904000 | NaN | NaN | NaN |
| 2022-01-07 | 168.319778 | 170.245725 | 167.205273 | 169.023678 | 86709100 | 172.220547 | NaN | NaN |
| 2022-01-10 | 168.339310 | 168.642375 | 164.409205 | 165.298858 | 106765600 | 170.300461 | NaN | NaN |
| 2022-01-11 | 171.164673 | 171.262428 | 166.999945 | 168.466400 | 76138300 | 169.397125 | NaN | NaN |
| 2022-01-12 | 171.604630 | 173.217725 | 170.910516 | 172.181432 | 74805200 | 169.516397 | NaN | NaN |
| 2022-01-13 | 168.339310 | 172.670234 | 167.948246 | 171.849023 | 84505800 | 169.553540 | NaN | NaN |
| 2022-01-14 | 169.199646 | 169.893760 | 167.263914 | 167.508323 | 80440800 | 169.729514 | NaN | NaN |
| 2022-01-18 | 166.002762 | 168.681478 | 165.621484 | 167.674513 | 90956700 | 169.262204 | NaN | NaN |
| 2022-01-19 | 162.512604 | 167.254149 | 162.229096 | 166.198300 | 94815000 | 167.531790 | NaN | NaN |
| 2022-01-20 | 160.831055 | 165.885436 | 160.508433 | 163.245819 | 91420500 | 165.377075 | NaN | NaN |
| 2022-01-21 | 158.778015 | 162.610350 | 158.670474 | 160.743060 | 122848900 | 163.464816 | NaN | NaN |
| 2022-01-24 | 158.005676 | 158.670477 | 151.240430 | 156.441466 | 162294600 | 161.226022 | NaN | 15.097523 |
| 2022-01-25 | 156.206879 | 159.120233 | 153.508605 | 155.424766 | 115798400 | 159.266846 | NaN | 15.344478 |
| 2022-01-26 | 156.118851 | 160.713742 | 154.290674 | 159.843645 | 108275300 | 157.988095 | NaN | 18.336770 |
| 2022-01-27 | 155.659348 | 160.176025 | 154.740366 | 158.817111 | 121954600 | 156.953754 | NaN | 20.416598 |
| 2022-01-28 | 166.520905 | 166.540461 | 159.159299 | 162.004227 | 179935700 | 158.502332 | NaN | 47.172681 |
| 2022-01-31 | 170.871414 | 171.086496 | 165.719262 | 166.354735 | 115541600 | 161.075479 | 166.062891 | 53.502864 |
4c β Create new features: Return, Volume_change, Return_lag1ΒΆ
# NDaily return
df['Return'] = df['Close'].pct_change()
# Add more features (volume, momentum, lag returns)
df['Volume_change'] = df['Volume'].pct_change()
df['Return_lag1'] = df['Return'].shift(1) # # yesterday's return β you DO know this before market opens
df.head(10)
| Price | Close | High | Low | Open | Volume | MA5 | MA20 | RSI | Return | Volume_change | Return_lag1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||
| 2022-01-03 | 177.939743 | 178.790298 | 173.735915 | 173.853227 | 104487900 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2022-01-04 | 175.681351 | 178.848900 | 175.114320 | 178.545835 | 99310400 | NaN | NaN | NaN | -0.012692 | -0.049551 | NaN |
| 2022-01-05 | 171.008270 | 176.140865 | 170.734533 | 175.593390 | 94537600 | NaN | NaN | NaN | -0.026600 | -0.048059 | -0.012692 |
| 2022-01-06 | 168.153595 | 171.379801 | 167.801645 | 168.837938 | 96904000 | NaN | NaN | NaN | -0.016693 | 0.025031 | -0.026600 |
| 2022-01-07 | 168.319778 | 170.245725 | 167.205273 | 169.023678 | 86709100 | 172.220547 | NaN | NaN | 0.000988 | -0.105206 | -0.016693 |
| 2022-01-10 | 168.339310 | 168.642375 | 164.409205 | 165.298858 | 106765600 | 170.300461 | NaN | NaN | 0.000116 | 0.231308 | 0.000988 |
| 2022-01-11 | 171.164673 | 171.262428 | 166.999945 | 168.466400 | 76138300 | 169.397125 | NaN | NaN | 0.016784 | -0.286865 | 0.000116 |
| 2022-01-12 | 171.604630 | 173.217725 | 170.910516 | 172.181432 | 74805200 | 169.516397 | NaN | NaN | 0.002570 | -0.017509 | 0.016784 |
| 2022-01-13 | 168.339310 | 172.670234 | 167.948246 | 171.849023 | 84505800 | 169.553540 | NaN | NaN | -0.019028 | 0.129678 | 0.002570 |
| 2022-01-14 | 169.199646 | 169.893760 | 167.263914 | 167.508323 | 80440800 | 169.729514 | NaN | NaN | 0.005111 | -0.048103 | -0.019028 |
π― Section 5 β Create the Target VariableΒΆ
We want to predict next-day movement:
$$\text{Target} = \begin{cases} 1 & \text{if } \text{Close}_{t+1} > \text{Close}_t \\ 0 & \text{otherwise} \end{cases}$$
# Target: will NEXT day's price be higher? (shift -1 to look forward)
df['Target'] = (df['Close'].shift(-1) > df['Close']).astype(int)
df.head(20)
| Price | Close | High | Low | Open | Volume | MA5 | MA20 | RSI | Return | Volume_change | Return_lag1 | Target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2022-01-03 | 177.939743 | 178.790298 | 173.735915 | 173.853227 | 104487900 | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
| 2022-01-04 | 175.681351 | 178.848900 | 175.114320 | 178.545835 | 99310400 | NaN | NaN | NaN | -0.012692 | -0.049551 | NaN | 0 |
| 2022-01-05 | 171.008270 | 176.140865 | 170.734533 | 175.593390 | 94537600 | NaN | NaN | NaN | -0.026600 | -0.048059 | -0.012692 | 0 |
| 2022-01-06 | 168.153595 | 171.379801 | 167.801645 | 168.837938 | 96904000 | NaN | NaN | NaN | -0.016693 | 0.025031 | -0.026600 | 1 |
| 2022-01-07 | 168.319778 | 170.245725 | 167.205273 | 169.023678 | 86709100 | 172.220547 | NaN | NaN | 0.000988 | -0.105206 | -0.016693 | 1 |
| 2022-01-10 | 168.339310 | 168.642375 | 164.409205 | 165.298858 | 106765600 | 170.300461 | NaN | NaN | 0.000116 | 0.231308 | 0.000988 | 1 |
| 2022-01-11 | 171.164673 | 171.262428 | 166.999945 | 168.466400 | 76138300 | 169.397125 | NaN | NaN | 0.016784 | -0.286865 | 0.000116 | 1 |
| 2022-01-12 | 171.604630 | 173.217725 | 170.910516 | 172.181432 | 74805200 | 169.516397 | NaN | NaN | 0.002570 | -0.017509 | 0.016784 | 0 |
| 2022-01-13 | 168.339310 | 172.670234 | 167.948246 | 171.849023 | 84505800 | 169.553540 | NaN | NaN | -0.019028 | 0.129678 | 0.002570 | 1 |
| 2022-01-14 | 169.199646 | 169.893760 | 167.263914 | 167.508323 | 80440800 | 169.729514 | NaN | NaN | 0.005111 | -0.048103 | -0.019028 | 0 |
| 2022-01-18 | 166.002762 | 168.681478 | 165.621484 | 167.674513 | 90956700 | 169.262204 | NaN | NaN | -0.018894 | 0.130728 | 0.005111 | 0 |
| 2022-01-19 | 162.512604 | 167.254149 | 162.229096 | 166.198300 | 94815000 | 167.531790 | NaN | NaN | -0.021025 | 0.042419 | -0.018894 | 0 |
| 2022-01-20 | 160.831055 | 165.885436 | 160.508433 | 163.245819 | 91420500 | 165.377075 | NaN | NaN | -0.010347 | -0.035801 | -0.021025 | 0 |
| 2022-01-21 | 158.778015 | 162.610350 | 158.670474 | 160.743060 | 122848900 | 163.464816 | NaN | NaN | -0.012765 | 0.343778 | -0.010347 | 0 |
| 2022-01-24 | 158.005676 | 158.670477 | 151.240430 | 156.441466 | 162294600 | 161.226022 | NaN | 15.097523 | -0.004864 | 0.321091 | -0.012765 | 0 |
| 2022-01-25 | 156.206879 | 159.120233 | 153.508605 | 155.424766 | 115798400 | 159.266846 | NaN | 15.344478 | -0.011384 | -0.286493 | -0.004864 | 0 |
| 2022-01-26 | 156.118851 | 160.713742 | 154.290674 | 159.843645 | 108275300 | 157.988095 | NaN | 18.336770 | -0.000564 | -0.064967 | -0.011384 | 0 |
| 2022-01-27 | 155.659348 | 160.176025 | 154.740366 | 158.817111 | 121954600 | 156.953754 | NaN | 20.416598 | -0.002943 | 0.126338 | -0.000564 | 1 |
| 2022-01-28 | 166.520905 | 166.540461 | 159.159299 | 162.004227 | 179935700 | 158.502332 | NaN | 47.172681 | 0.069778 | 0.475432 | -0.002943 | 1 |
| 2022-01-31 | 170.871414 | 171.086496 | 165.719262 | 166.354735 | 115541600 | 161.075479 | 166.062891 | 53.502864 | 0.026126 | -0.357873 | 0.069778 | 0 |
# Drop rows with NaN (from rolling windows)
df_clean = df[['RSI', 'MA5', 'MA20', 'Return', 'Volume_change', 'Return_lag1', 'Target']].dropna()
print(f'Dataset size after cleaning: {df_clean.shape[0]} rows')
print(f"\nTarget distribution:")
counts = df_clean['Target'].value_counts()
print(f" UP (1): {counts[1]} days ({counts[1]/len(df_clean)*100:.1f}%)")
print(f" DOWN (0): {counts[0]} days ({counts[0]/len(df_clean)*100:.1f}%)")
Dataset size after cleaning: 733 rows Target distribution: UP (1): 392 days (53.5%) DOWN (0): 341 days (46.5%)
df_clean.head(5)
| Price | RSI | MA5 | MA20 | Return | Volume_change | Return_lag1 | Target |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2022-01-31 | 53.502864 | 161.075479 | 166.062891 | 0.026126 | -0.357873 | 0.069778 | 0 |
| 2022-02-01 | 49.313896 | 163.975143 | 165.701163 | -0.000973 | -0.253828 | 0.026126 | 1 |
| 2022-02-02 | 50.442480 | 167.132913 | 165.512481 | 0.007044 | -0.015074 | -0.000973 | 0 |
| 2022-02-03 | 51.025096 | 169.807727 | 165.413738 | -0.016720 | 0.053039 | 0.007044 | 0 |
| 2022-02-04 | 49.323860 | 170.253470 | 165.443539 | -0.001679 | -0.077755 | -0.016720 | 0 |
# Quick look at descriptive stats for few of our features
df_clean[['RSI', 'MA5', 'MA20', 'Return']].describe().round(3)
| Price | RSI | MA5 | MA20 | Return |
|---|---|---|---|---|
| count | 733.000 | 733.000 | 733.000 | 733.000 |
| mean | 53.814 | 175.772 | 174.888 | 0.001 |
| std | 17.912 | 29.292 | 27.979 | 0.017 |
| min | 7.865 | 125.075 | 128.759 | -0.059 |
| 25% | 39.833 | 152.833 | 152.728 | -0.008 |
| 50% | 54.475 | 170.886 | 171.440 | 0.001 |
| 75% | 67.678 | 189.131 | 187.467 | 0.010 |
| max | 96.163 | 254.886 | 247.686 | 0.089 |
π€ Section 6 β Train the Logistic Regression ModelΒΆ
Steps:
- Select features (X) and label (y)
- Split into train/test sets β never test on data you trained on!
- Scale features β logistic regression is sensitive to scale
- Fit the model
# Step 1 β Features and label
X = df_clean[['RSI', 'MA5', 'MA20', 'Volume_change', 'Return_lag1']]
y = df_clean['Target']
# Step 2 β Train / test split (80% train, 20% test, time-ordered)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False # shuffle=False preserves time order
)
print(f'Training samples : {len(X_train)}')
print(f'Test samples : {len(X_test)}')
Training samples : 586 Test samples : 147
# Lets look at few records of train
X_train.iloc[0:3]
# X_test.iloc[0:3]
| Price | RSI | MA5 | MA20 | Volume_change | Return_lag1 |
|---|---|---|---|---|---|
| Date | |||||
| 2022-01-31 | 53.502864 | 161.075479 | 166.062891 | -0.357873 | 0.069778 |
| 2022-02-01 | 49.313896 | 163.975143 | 165.701163 | -0.253828 | 0.026126 |
| 2022-02-02 | 50.442480 | 167.132913 | 165.512481 | -0.015074 | -0.000973 |
# Step 3 β Standardise features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only!
X_test_scaled = scaler.transform(X_test) # apply same scale to test
# compare actual vs scaled values
print(list(X_train.iloc[0]))
print((X_train_scaled[0]))
[53.502864452503275, 161.07547912597656, 166.062890625, -0.35787284013122467, 0.06977773695505785] [ 0.0706431 -0.15334537 0.15106034 -1.4165033 3.89624435]
# Step 4 β Train logistic regression
model = LogisticRegression(class_weight='balanced',random_state=42)
model.fit(X_train_scaled, y_train)
print('β
Model trained!')
print(f'\nTrain accuracy : {model.score(X_train_scaled, y_train):.3f}')
print(f'Test accuracy : {model.score(X_test_scaled, y_test):.3f}')
β Model trained! Train accuracy : 0.543 Test accuracy : 0.388
features = ['RSI', 'MA5', 'MA20', 'Volume_change', 'Return_lag1']
coefs = model.coef_[0]
coef_df = pd.DataFrame({'Feature': features, 'Coefficient': coefs})
coef_df['Direction'] = coef_df['Coefficient'].apply(lambda x: 'β Bullish' if x > 0 else 'β Bearish')
print(f'Intercept (Ξ²β): {model.intercept_[0]:.4f}\n')
print(coef_df.to_string(index=False))
# Bar chart
colors = ['tomato' if c < 0 else 'steelblue' for c in coefs]
fig, ax = plt.subplots(figsize=(7, 3))
ax.barh(features, coefs, color=colors)
ax.axvline(0, color='black', linewidth=0.8)
ax.set_title('Logistic Regression Coefficients', fontsize=13, fontweight='bold')
ax.set_xlabel('Coefficient value')
plt.tight_layout()
plt.show()
Intercept (Ξ²β): 0.0009
Feature Coefficient Direction
RSI 0.103775 β Bullish
MA5 -0.406252 β Bearish
MA20 0.285262 β Bullish
Volume_change -0.065997 β Bearish
Return_lag1 -0.167302 β Bearish
7b β Visualise the Sigmoid FunctionΒΆ
This is the core of logistic regression β the S-shaped curve that maps any value to (0, 1).
z = np.linspace(-8, 8, 300)
sigmoid = 1 / (1 + np.exp(-z))
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(z, sigmoid,
color='steelblue',
linewidth=2.5,
label='Sigmoid Ο(z)'
)
ax.axhline(0.5, color='tomato', linestyle='--', linewidth=1, label='Decision boundary (0.5)')
ax.axvline(0, color='gray', linestyle='--', linewidth=0.8)
ax.fill_between(z, 0.5, sigmoid, where=(sigmoid > 0.5), alpha=0.1, color='green', label='Predict UP')
ax.fill_between(z, sigmoid, 0.5, where=(sigmoid < 0.5), alpha=0.1, color='red', label='Predict DOWN')
ax.set_title('The Sigmoid Function', fontsize=13, fontweight='bold')
ax.set_xlabel('z = Ξ²β + Ξ²Β·x')
ax.set_ylabel('P(y = 1 | x)')
ax.legend()
plt.tight_layout()
plt.show()
# This line already exists in your notebook β y_prob is created here
y_prob = model.predict_proba(X_test_scaled)[:, 1]
# NOTE: Play around with this number in class
y_pred = (y_prob > 0.45).astype(int)
cm = confusion_matrix(y_test, y_pred)
print("confusion matrix:\n", cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['DOWN (0)', 'UP (1)'])
fig, ax = plt.subplots(figsize=(5, 4))
disp.plot(ax=ax, colorbar=False, cmap='Blues')
ax.set_title('Confusion Matrix (Test Set)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
confusion matrix: [[52 4] [75 16]]
8b β Classification ReportΒΆ
| Metric | Meaning |
|---|---|
| Precision | Of all predicted UPs, how many were actually UP? |
| Recall | Of all actual UPs, how many did we catch? |
| F1-Score | Harmonic mean of precision and recall |
print(classification_report(y_test, y_pred, target_names=['DOWN (0)', 'UP (1)']))
precision recall f1-score support
DOWN (0) 0.41 0.93 0.57 56
UP (1) 0.80 0.18 0.29 91
accuracy 0.46 147
macro avg 0.60 0.55 0.43 147
weighted avg 0.65 0.46 0.39 147
This is the classic tradeoff:ΒΆ
- When the model does predict UP, it’s right 80% of the time (high precision) β but it only catches 18% of actual UP days (low recall)
- It’s being very conservative about calling UP, which in finance is actually reasonable β you’d rather miss opportunities than make bad trades
8c β ROC Curve & AUCΒΆ
The ROC curve shows the trade-off between true positives and false positives at different thresholds. AUC = 1.0 is perfect; AUC = 0.5 is random guessing.
y_prob = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
fig, ax = plt.subplots(figsize=(6, 5))
ax.plot(fpr, tpr, color='steelblue', linewidth=2, label=f'Logistic Regression (AUC = {auc:.3f})')
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random guess (AUC = 0.5)')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curve', fontsize=13, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()
π‘ Section 9 β Make a Prediction (Live Example)ΒΆ
Let’s manually predict whether the stock will go up tomorrow given hypothetical indicator values.
print(X_train.iloc[0]) # first row
print(y_train[0])
Price RSI 53.502864 MA5 161.075479 MA20 166.062891 Volume_change -0.357873 Return_lag1 0.069778 Name: 2022-01-31 00:00:00, dtype: float64 0
# --- Change these values and re-run! ---
rsi_today = 53 # RSI between 0β100
ma5_today = 161 # 5-day average price
ma20_today = 166 # 20-day average price
volume_change_today = -0.35
return_lag1_today = 0.069
# ---------------------------------------
new_data = np.array([[rsi_today, ma5_today, ma20_today, volume_change_today, return_lag1_today]])
new_data_scaled = scaler.transform(new_data)
probability = model.predict_proba(new_data_scaled)[0][1]
prediction = model.predict(new_data_scaled)[0]
label = 'UP' if prediction == 1 else 'DOWN'
# print(f'Input β RSI={rsi_today}, MA5={ma5_today}, MA20={ma20_today}')
print(f'Input β {new_data}')
print(f'P(UP) β {probability:.3f}')
print(f'Prediction β {label}')
Input β [[ 5.30e+01 1.61e+02 1.66e+02 -3.50e-01 6.90e-02]] P(UP) β 0.391 Prediction β DOWN
π§ͺ Section 10 β Student ExercisesΒΆ
Try these on your own:
Change the stock β Replace
'AAPL'with'MSFT','TSLA', or'GOOGL'. Does the model accuracy change?Add a new feature β Volume is often a predictor of price movement. Add
df['Volume']as a feature and retrain. Did AUC improve?Change the threshold β Instead of using 0.5 to classify UP/DOWN, try 0.55. Does precision improve? What happens to recall?
y_pred_custom = (y_prob > 0.55).astype(int)
Try a different RSI period β Replace
period=14incompute_rsi()withperiod=7(faster) orperiod=21(slower). How does it affect the model?Interpret the coefficients β If RSI has a negative coefficient, what does that mean financially?
β οΈ Key Takeaways & CaveatsΒΆ
- Logistic regression is interpretable and a great baseline model
- Stock markets are highly efficient β even a 55% accuracy is meaningful
- This model does not account for transaction costs, slippage, or regime changes
- Always test on out-of-sample (future) data; never on training data
- A model that works on AAPL 2022β2024 may not work in 2025
“All models are wrong, but some are useful.” β George Box
