Load necessary libraries

#Introduction

In this analysis, I will be working with the SMS Spam Collection data set. The goal is to build a classifier that can accurately identify spam messages. I will perform data cleaning, exploratory analysis, and then train a Naive Bayes classifier.

First, I will load the data set and explore its structure.

# Load the dataset
sms <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

# Display the structure of the dataset
str(sms)
## 'data.frame':    5574 obs. of  2 variables:
##  $ type: chr  "ham" "ham" "spam" "ham" ...
##  $ text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
# Convert the type variable to a factor
sms$type <- factor(sms$type)

# Display the distribution of message types
table(sms$type)
## 
##  ham spam 
## 4827  747

#Data Preprocessing

Next, I will clean the text data to prepare it for analysis. This involves creating a text corpus, transforming the text to lowercase, removing numbers, punctuation, and stop-words, and applying stemming.

# Create a text corpus from the SMS text
sms_corpus <- VCorpus(VectorSource(sms$text))
print(sms_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5574
# Display the first SMS text as a character string
as.character(sms_corpus[[1]])
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
# Display the first three SMS texts
lapply(sms_corpus[1:3], as.character)
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## 
## $`2`
## [1] "Ok lar... Joking wif u oni..."
## 
## $`3`
## [1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
# Clean the text data
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords("en"))

# Apply stemming to the cleaned corpus
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)

# Display the cleaned text
as.character(sms_corpus_clean[[1]])
## [1] "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

#Document-Term Matrix

Now, I will create a Document-Term Matrix (DTM) to represent the frequency of words in the SMS messages.

# Create a Document-Term Matrix
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
sms_dtm
## <<DocumentTermMatrix (documents: 5574, terms: 6942)>>
## Non-/sparse entries: 43749/38650959
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

#Splitting the Data

I will split the dataset into training and testing sets. The training set will consist of the first 4000 messages, while the testing set will contain the remaining messages.

# Split the dataset into training and testing sets
sms_dtm_train <- sms_dtm[1:4000, ]
sms_dtm_test <- sms_dtm[4001:5559, ]

# Assign labels to training and testing sets
sms_train_labels <- sms[1:4000, ]$type
sms_test_labels <- sms[4001:5559, ]$type

# Display the proportion of each type in the training set
prop.table(table(sms_train_labels)) * 100
## sms_train_labels
##   ham  spam 
## 86.65 13.35
# Display the proportion of each type in the testing set
prop.table(table(sms_test_labels)) * 100
## sms_test_labels
##      ham     spam 
## 86.46568 13.53432

#Exploratory Data Analysis

To visualize the most common words in the SMS messages, I will create a word cloud.

# Create a word cloud of the cleaned corpus
wordcloud(sms_corpus_clean, min.freq = 40, random.order = FALSE)

#Frequent Terms

Next, I will find the most frequent terms in the training set.

# Find frequent terms in the training DTM
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)
##  chr [1:1145] "£wk" "abiola" "abl" "abt" "accept" "access" "account" ...
# Create new DTM with only frequent terms
sms_dtm_freq_train <- sms_dtm_train[, sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[, sms_freq_words]

#Data Conversion

I will convert the term frequencies into binary values (Yes/No) to prepare the data for classification.

# Function to convert counts to Yes/No
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

# Apply the conversion function to the training and testing sets
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

#Training the Classifier

Now, I will train a Naive Bayes classifier using the training data.

# Train the Naive Bayes classifier
sms_classifier <- naiveBayes(sms_train, sms_train_labels)

# Make predictions on the testing set
sms_test_pred <- predict(sms_classifier, sms_test)

#Evaluating the Model

Finally, I will evaluate the model’s performance by comparing the predicted labels with the actual labels.

# Create a cross-table to evaluate the model's predictions
CrossTable(sms_test_pred, sms_test_labels, prop.t = FALSE, dnn = c('Predicted', 'Actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1559 
## 
##  
##              | Actual 
##    Predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1339 |        24 |      1363 | 
##              |    21.851 |   139.595 |           | 
##              |     0.982 |     0.018 |     0.874 | 
##              |     0.993 |     0.114 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         9 |       187 |       196 | 
##              |   151.951 |   970.756 |           | 
##              |     0.046 |     0.954 |     0.126 | 
##              |     0.007 |     0.886 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1348 |       211 |      1559 | 
##              |     0.865 |     0.135 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
# Display the confusion matrix
table(sms_test_pred, sms_test_labels)
##              sms_test_labels
## sms_test_pred  ham spam
##          ham  1339   24
##          spam    9  187

#Conclusion

In this analysis, I successfully built a Naive Bayes classifier to detect spam messages in the SMS dataset. The model’s predictions were evaluated using a confusion matrix, providing insights into its performance.

Notes: