Singular Value Decomposition, First 2 Components:
Feature Importance:
My baseline prediction was 53.8% accuracy by predicting the positive class for all samples.
The Random Forest model showed strong performance with the training data however it performed poorly on the validation data, showing signs of overfitting. The predictive accuracy on untrained data was only 70.5%, less than the naive bayes model. It’s likely this is was a result of using a small number of trees for fitting, and considering a larger number for each split would produce a more generalizable result.
The Multinomial Naive Bayes model performed consistently between the training and validation datasets, predicting classes with 72.4% accuracy. Considering the simplicity fo the model and that the model was fit to Term Frequency / Inverse Document Frequency vectorized data this is a strong increase over the baseline model.
Fit additional models
Use GridSearch to optimize over a larger space of hyperparameters
More aggressive data cleaning to streamline computations and reduce compute time.
This preliminary investigation showed strong positive results and set the groundwork for further investigations. Continuing with a wider group of model algorithms as well as allocating more compute to tuning our models is likely to produce a generalizable model that predicts with a much higher accuracy. For initial studies, the Multinomial Naive Bayes may serve as the minimum viable product to put this classification system into production.
Full project repo is available on GitHub
I'm a Data Scientist living in Santa Monica, CA. Driven by curiosity, I'm eager to apply ML and Data Science techniques to create scalable, robust solutions to complex problems.