9/9/2018 - 5:54 PM

Spam Classifier

import pandas as pd
# Dataset from -
df = pd.read_table("smsspamcollection/SMSSpamCollection", sep='\t', names=['label','sms_message'])

# Output printing out first 5 rows

df['label'] ={'ham':0, 'spam':1})
df.head() # returns (rows, columns)

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the 
document-term matrix generation happens. We have created a sample document set 'documents'.
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

Practice node:
Print the 'count_vector' object which is an instance of 'CountVectorizer()'

# prints:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

# prints 

doc_array = count_vector.transform(documents).toarray()

frequency_matrix = pd.DataFrame(doc_array, 
                                columns = count_vector.get_feature_names())

X_train is our training data for the 'sms_message' column.
y_train is our training data for the 'label' column
X_test is our testing data for the 'sms_message' column.
y_test is our testing data for the 'label' column Print out the number of rows we have in each our training and testing data.

# split into training and testing sets
# USE from sklearn.model_selection import train_test_split to avoid seeing deprecation warning.
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB(), y_train)

predictions = naive_bayes.predict(testing_data)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))