#Introduction
In this analysis, I will be working with the SMS Spam Collection data set. The goal is to build a classifier that can accurately identify spam messages. I will perform data cleaning, exploratory analysis, and then train a Naive Bayes classifier.
First, I will load the data set and explore its structure.
# Load the dataset
sms <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
# Display the structure of the dataset
str(sms)
## 'data.frame': 5574 obs. of 2 variables:
## $ type: chr "ham" "ham" "spam" "ham" ...
## $ text: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
# Convert the type variable to a factor
sms$type <- factor(sms$type)
# Display the distribution of message types
table(sms$type)
##
## ham spam
## 4827 747
#Data Preprocessing
Next, I will clean the text data to prepare it for analysis. This involves creating a text corpus, transforming the text to lowercase, removing numbers, punctuation, and stop-words, and applying stemming.
# Create a text corpus from the SMS text
sms_corpus <- VCorpus(VectorSource(sms$text))
print(sms_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5574
# Display the first SMS text as a character string
as.character(sms_corpus[[1]])
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
# Display the first three SMS texts
lapply(sms_corpus[1:3], as.character)
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
##
## $`2`
## [1] "Ok lar... Joking wif u oni..."
##
## $`3`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
# Clean the text data
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords("en"))
# Apply stemming to the cleaned corpus
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
# Display the cleaned text
as.character(sms_corpus_clean[[1]])
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"
#Document-Term Matrix
Now, I will create a Document-Term Matrix (DTM) to represent the frequency of words in the SMS messages.
# Create a Document-Term Matrix
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
sms_dtm
## <<DocumentTermMatrix (documents: 5574, terms: 6942)>>
## Non-/sparse entries: 43749/38650959
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
#Splitting the Data
I will split the dataset into training and testing sets. The training set will consist of the first 4000 messages, while the testing set will contain the remaining messages.
# Split the dataset into training and testing sets
sms_dtm_train <- sms_dtm[1:4000, ]
sms_dtm_test <- sms_dtm[4001:5559, ]
# Assign labels to training and testing sets
sms_train_labels <- sms[1:4000, ]$type
sms_test_labels <- sms[4001:5559, ]$type
# Display the proportion of each type in the training set
prop.table(table(sms_train_labels)) * 100
## sms_train_labels
## ham spam
## 86.65 13.35
# Display the proportion of each type in the testing set
prop.table(table(sms_test_labels)) * 100
## sms_test_labels
## ham spam
## 86.46568 13.53432
#Exploratory Data Analysis
To visualize the most common words in the SMS messages, I will create a word cloud.
# Create a word cloud of the cleaned corpus
wordcloud(sms_corpus_clean, min.freq = 40, random.order = FALSE)
#Frequent Terms
Next, I will find the most frequent terms in the training set.
# Find frequent terms in the training DTM
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)
## chr [1:1145] "£wk" "abiola" "abl" "abt" "accept" "access" "account" ...
# Create new DTM with only frequent terms
sms_dtm_freq_train <- sms_dtm_train[, sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[, sms_freq_words]
#Data Conversion
I will convert the term frequencies into binary values (Yes/No) to prepare the data for classification.
# Function to convert counts to Yes/No
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
# Apply the conversion function to the training and testing sets
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)
#Training the Classifier
Now, I will train a Naive Bayes classifier using the training data.
# Train the Naive Bayes classifier
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
# Make predictions on the testing set
sms_test_pred <- predict(sms_classifier, sms_test)
#Evaluating the Model
Finally, I will evaluate the model’s performance by comparing the predicted labels with the actual labels.
# Create a cross-table to evaluate the model's predictions
CrossTable(sms_test_pred, sms_test_labels, prop.t = FALSE, dnn = c('Predicted', 'Actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1559
##
##
## | Actual
## Predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1339 | 24 | 1363 |
## | 21.851 | 139.595 | |
## | 0.982 | 0.018 | 0.874 |
## | 0.993 | 0.114 | |
## -------------|-----------|-----------|-----------|
## spam | 9 | 187 | 196 |
## | 151.951 | 970.756 | |
## | 0.046 | 0.954 | 0.126 |
## | 0.007 | 0.886 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1348 | 211 | 1559 |
## | 0.865 | 0.135 | |
## -------------|-----------|-----------|-----------|
##
##
# Display the confusion matrix
table(sms_test_pred, sms_test_labels)
## sms_test_labels
## sms_test_pred ham spam
## ham 1339 24
## spam 9 187
#Conclusion
In this analysis, I successfully built a Naive Bayes classifier to detect spam messages in the SMS dataset. The model’s predictions were evaluated using a confusion matrix, providing insights into its performance.
tm
,
SnowballC
, e1071
, wordcloud
,
gmodels
) installed before running the code.
read.csv()
function as needed.