Naive Bayes For Email Classification¶
In [8]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
1) Data Collection¶
In [2]:
df = pd.read_csv("data_spam.csv")
# df = pd.read_csv("https://raw.githubusercontent.com/ash322ash422/data/refs/heads/main/data_spam.csv")
df.head()
Out[2]:
| Category | Message | |
|---|---|---|
| 0 | ham | Go until jurong point, crazy.. Available only … |
| 1 | ham | Ok lar… Joking wif u oni… |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina… |
| 3 | ham | U dun say so early hor… U c already then say… |
| 4 | ham | Nah I don’t think he goes to usf, he lives aro… |
In [ ]:
2) EDA¶
In [3]:
print(df.shape)
(5572, 2)
In [4]:
# How many category are there ?
df.groupby('Category').describe()
Out[4]:
| Message | ||||
|---|---|---|---|---|
| count | unique | top | freq | |
| Category | ||||
| ham | 4825 | 4516 | Sorry, I’ll call later | 30 |
| spam | 747 | 641 | Please call our customer service representativ… | 4 |
In [5]:
# Feature Engineering
# spam -> 1 and ham -> 0
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()
Out[5]:
| Category | Message | spam | |
|---|---|---|---|
| 0 | ham | Go until jurong point, crazy.. Available only … | 0 |
| 1 | ham | Ok lar… Joking wif u oni… | 0 |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina… | 1 |
| 3 | ham | U dun say so early hor… U c already then say… | 0 |
| 4 | ham | Nah I don’t think he goes to usf, he lives aro… | 0 |
In [ ]:
Seperate Features/target and split data¶
In [6]:
X = df['Message'] # features
y = df['spam'] # target
In [9]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [14]:
X_train.shape
Out[14]:
(4457,)
In [ ]:
In [ ]:
Convert the into vector¶
corpus = [
'This document is the first document.', # doc1
'This document is the second document.', # doc2
'And this one is the third one.', # doc3
'Is this the first document?', # doc4
]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Vector representation of above documents:
[0 , 2, 1, 1 0, 0, 1, 0, 1] --> doc1
[0, 2, 0, 1, 0, 1, 1, 0, 1] --> doc2
[1, 0, 0, 1, 2, 0, 1, 1, 1] --> doc3
[0, 1, 1, 1, 0, 0, 1, 0, 1] --> doc4
Here number of documents=4 and size of vocabulary=9
In [10]:
# convert the X_train into vector
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_vector = v.fit_transform(X_train.values)
In [11]:
# lets see first 2 values
X_train_vector.toarray()[:2]
Out[11]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
In [13]:
X_train_vector.shape # (Number of documents, size of vocabulary)
Out[13]:
(4457, 7817)
In [ ]:
Naive Bayes classifiers are categorized based on the type of data they handle.¶
Bernoulli Naive Bayes: It is designed for binary or Boolean features. It’s effective in scenarios where data is represented as yes/no or true/false or 0/1. This classifier is frequently employed in spam detection and sentiment analysis.
Multinomial Naive Bayes: It excels with discrete data. This classifier is adept at handling features that represent counts, like word frequencies in documents.
It’s commonly used in text classification tasks and document categorization.Gaussian Naive Bayes: It is suited for continuous data. It posits that the features adhere to a Gaussian distribution. This classifier is particularly useful for numerical data, such as measurements or sensor readings
In [ ]:
Modelling + Training¶
In [18]:
model = MultinomialNB()
model.fit(X_train_vector, y_train)
Out[18]:
MultinomialNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [19]:
# Test on 2 email whose target status is know to us: 1 and 0
emails = [
'Had your mobile 11 months or more? U R entitled to Update to the latest \
colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
"Nah I don't think he goes to usf, he lives around here though"
]
emails_vector = v.transform(emails)
model.predict(emails_vector) # 1 -> spam
Out[19]:
array([1, 0], dtype=int64)
In [20]:
# Lets check the accuracy on test data
X_test_vector = v.transform(X_test)
model.score(X_test_vector, y_test)
Out[20]:
0.9928251121076234
STOP¶
In [ ]:
In [ ]:
In [ ]:
In [ ]:
OPTIONAL: Pipeline¶
In [10]:
from sklearn.pipeline import Pipeline
clf = Pipeline([
('vectorizer', CountVectorizer()),
('nb', MultinomialNB())
])
In [11]:
clf.fit(X_train, y_train)
Out[11]:
Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])CountVectorizer()
MultinomialNB()
In [12]:
clf.score(X_test,y_test)
Out[12]:
0.9856424982053122
In [13]:
emails = [
'Had your mobile 11 months or more? U R entitled to Update to the latest \
colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
"Nah I don't think he goes to usf, he lives around here though"
]
clf.predict(emails)
Out[13]:
array([1, 0], dtype=int64)
