__author__ = "Chris Tran"
__email__ = "tranduckhanh96@gmail.com"
__website__ = "chriskhanhtran.github.io"
In order to predict whether a message is spam, first I vectorized text messages into a form that machine learning algorithms can understand. Next I train a machine learning model to learn to discriminate between normal and spam messages. Finally, with the trained model, I will classify unlabel messages into normal or spam.
I have taken a great Machine Learning course by Jose Portilla on Udemy and now I want to apply what I have learnt so far in Natural Language Processing to analyze this sms dataset.
The SMS Spam Collection Data Set is obtained from UCI Machine Learning Repository. The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
sns.set_style('darkgrid')
sms = pd.read_csv('smsspamcollection/SMSSpamCollection', sep='\t',
names=["label", "message"])
sms.head()
Let's explore the data:
sms.describe()
The data has a total of 5572 messages.
sms.groupby('label').describe()
The target variable is either ham or spam. There are 4825 ham messages and 747 spam messages.
plt.figure(figsize=(8,4))
sns.countplot(x='label', data=sms)
plt.title('Count Plot')
Let's explore the length of messages:
sms['length'] = sms['message'].apply(len)
sms.head()
plt.figure(figsize=(8,4))
sns.distplot(sms[('length')])
The data seems to have some outliers with more than 800 characters. I will use a box plot to discover these outliers.
plt.figure(figsize=(8,2))
sns.boxplot(sms[('length')])
There seem to be 3 messages with about 600 characters, 1 with 800 characters and 1 with 900 characters. What are they?
sms[sms['length'] > 500]
for text in sms[sms['length'] > 550]['message']:
print(text, "\n\n")
There are some interesting stories going on here, but let's go back to analyzing our data. How are ham and spam messages different in length?
g = sns.FacetGrid(data=sms, hue="label", height=4, aspect=2)
g.map(sns.distplot, 'length', bins=30)
g.set(xticks=np.arange(0,1000,50))
plt.legend()
The average length of ham messages is about 40 characters while that of spam messages is 160. It is a big difference, so length could be a good feature to classify message labels.
Before vectorizing the messages, I will clean them to get the words I actually want by removing punctuation and stop words (i.e. "the", "a", "to"...). This process is called tokenization. I will need to use the NLTK library to do this step.
import string
import nltk
from nltk.corpus import stopwords
# nltk.download_shell() #download stopwords
def text_preprocess(text):
"""
1. Remove punctuation in the text
2. Remove stop words in the text
3. Return a list of words in the text
"""
remove_punctuation = "".join([c for c in text if c not in string.punctuation])
remove_stopwords = [word for word in remove_punctuation.split() if word not in stopwords.words('english')]
return remove_stopwords
sms['message'].head(5)
# Let's check the function
sms['message'].head(5).apply(text_preprocess)
In this step I will create a pipeline, in which:
But first, let's split the data into train and test data.
from sklearn.model_selection import train_test_split
X = sms['message']
y= sms['label']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
pipeline = Pipeline([
('vectorize', CountVectorizer(analyzer=text_preprocess)),
('tfidf', TfidfTransformer()),
('NBclassifier', MultinomialNB())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(f"""
Confusion Matrix:
{confusion_matrix(y_test, y_pred)}
Classification Report:
{classification_report(y_test, y_pred)}
""")
There are 66 messages that the model fails to predict as spam in the total of 246 spam messages in the dataset. It does not misclassify any normal messages as spam. The overall accuracy rate is 96%.
Before using machine learning techniques to identify spam messages, I cleaned the dataset of 5572 messages obtained from UCI Machine Learning Repository by removing punctuation and stop words in each message. Then I created a pipeline that vectorized the text messages, calculated TF-IDF of each vector, and train the data with Naive Bayes algorithm. The model obtains the accuracy rate of 96% overall.
There are a lot of approaches to process, tokenize and train text data. What I covered in this project are just some basic techniques to get me to know Natural Language Processing.
If you have any question, please feel free to contact me at tranduckhanh96@gmail.com. Thanks for reading!
Reference:
Python for Data Science and Machine Learning Course by Jose Portilla on Udemy