Naive Bayes For Email Classification¶

In [8]:

import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

1) Data Collection¶

In [2]:

df = pd.read_csv("data_spam.csv")
# df = pd.read_csv("https://raw.githubusercontent.com/ash322ash422/data/refs/heads/main/data_spam.csv")


df.head()

Out[2]:

	Category	Message
0	ham	Go until jurong point, crazy.. Available only …
1	ham	Ok lar… Joking wif u oni…
2	spam	Free entry in 2 a wkly comp to win FA Cup fina…
3	ham	U dun say so early hor… U c already then say…
4	ham	Nah I don’t think he goes to usf, he lives aro…

In [ ]:

2) EDA¶

In [3]:

print(df.shape)

(5572, 2)

In [4]:

# How many category are there ?

df.groupby('Category').describe()

Out[4]:

	Message
	count	unique	top	freq
Category
ham	4825	4516	Sorry, I’ll call later	30
spam	747	641	Please call our customer service representativ…	4

In [5]:

# Feature Engineering

# spam -> 1 and ham -> 0
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Out[5]:

	Category	Message	spam
0	ham	Go until jurong point, crazy.. Available only …	0
1	ham	Ok lar… Joking wif u oni…	0
2	spam	Free entry in 2 a wkly comp to win FA Cup fina…	1
3	ham	U dun say so early hor… U c already then say…	0
4	ham	Nah I don’t think he goes to usf, he lives aro…	0

In [ ]:

Seperate Features/target and split data¶

In [6]:

X = df['Message'] # features
y = df['spam']   # target

In [9]:

# split data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [14]:

X_train.shape

Out[14]:

(4457,)

In [ ]:

Convert the into vector¶

corpus = [
    'This document is the first document.',   # doc1
    'This document is the second document.',  # doc2
    'And this one is the third one.',         # doc3
    'Is this the first document?',            # doc4
]

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Vector representation of above documents: 
[0   ,      2,         1,    1      0,      0,        1,     0,        1]    --> doc1
[0,         2,         0,    1,     0,      1,        1,     0,        1]    --> doc2
[1,         0,         0,    1,     2,      0,        1,     1,        1]    --> doc3
[0,         1,         1,    1,     0,      0,        1,     0,        1]    --> doc4

Here number of documents=4 and size of vocabulary=9

In [10]:

# convert the X_train into vector

from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_vector = v.fit_transform(X_train.values)

In [11]:

# lets see first 2 values
X_train_vector.toarray()[:2]

Out[11]:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [13]:

X_train_vector.shape # (Number of documents, size of vocabulary)

Out[13]:

(4457, 7817)

In [ ]:

Naive Bayes classifiers are categorized based on the type of data they handle.¶

Bernoulli Naive Bayes: It is designed for binary or Boolean features. It’s effective in scenarios where data is represented as yes/no or true/false or 0/1. This classifier is frequently employed in spam detection and sentiment analysis.
Multinomial Naive Bayes: It excels with discrete data. This classifier is adept at handling features that represent counts, like word frequencies in documents.
It’s commonly used in text classification tasks and document categorization.
Gaussian Naive Bayes: It is suited for continuous data. It posits that the features adhere to a Gaussian distribution. This classifier is particularly useful for numerical data, such as measurements or sensor readings

In [ ]:

Modelling + Training¶

In [18]:

model = MultinomialNB()

model.fit(X_train_vector, y_train)

Out[18]:

MultinomialNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [19]:

# Test on 2 email whose target status is know to us: 1 and 0
emails = [ 
    'Had your mobile 11 months or more? U R entitled to Update to the latest \
    colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
    "Nah I don't think he goes to usf, he lives around here though"
]
emails_vector = v.transform(emails)

model.predict(emails_vector) # 1 -> spam

Out[19]:

array([1, 0], dtype=int64)

In [20]:

# Lets check the accuracy on test data

X_test_vector = v.transform(X_test)

model.score(X_test_vector, y_test)

Out[20]:

0.9928251121076234

STOP¶

In [ ]:

OPTIONAL: Pipeline¶

In [10]:

from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [11]:

clf.fit(X_train, y_train)

Out[11]:

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [12]:

clf.score(X_test,y_test)

Out[12]:

0.9856424982053122

In [13]:

emails = [ 
    'Had your mobile 11 months or more? U R entitled to Update to the latest \
    colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
    "Nah I don't think he goes to usf, he lives around here though"
]

clf.predict(emails)

Out[13]:

array([1, 0], dtype=int64)

Naive Bayes Classifier (Part1)- e-mail Spam

Naive Bayes For Email Classification¶

1) Data Collection¶

2) EDA¶

Seperate Features/target and split data¶

Convert the into vector¶

Naive Bayes classifiers are categorized based on the type of data they handle.¶

Modelling + Training¶

STOP¶

OPTIONAL: Pipeline¶

Leave a Comment Cancel Reply

Categories

Archives

GET HELP

COURSES

CONTACT US

Naive Bayes Classifier (Part1)- e-mail Spam

Naive Bayes For Email Classification¶

1) Data Collection¶

2) EDA¶

Seperate Features/target and split data¶

Convert the into vector¶

Naive Bayes classifiers are categorized based on the type of data they handle.¶

Modelling + Training¶

STOP¶

OPTIONAL: Pipeline¶

Leave a Comment Cancel Reply

Categories

Archives

Tags

GET HELP

COURSES

CONTACT US

Search