Political Subgroup Identification on Reddit

1 Jan 2019

Creating accurate classification models has application in assessing journalistic sources as well as optimizing communications to create a deeper connection with the target audience. To explore this space, I used Natural Language Processing techniques to identify linguistic differences between closely related but ideologically distinct political subgroups. In this case, the groups of interest are Conservatives and Libertarians.

Problem Statement

This project aims to identify semantic and syntactical markers associated with two similar but distinct political subgroups: Conservatives and Libertarians. Understanding the ideological underpinnings that define communication dynamics within groups can be applied by modeling baseline tone for a given source and alerting audiences to significantly divergent texts. It may also be used for identifying bots or other inauthentic sources which may attempt to influence group sentiment. 100,000 Reddit comments were scraped from each subreddit to produce the data used for model training and validation.

Executive Summary

My model classified my data with 72.4% accuracy, showing a positive proof of concept. The best model was multinomial naive bayes, a relatively simple model that suggests there is room for significant improvement as I move toward more sophisticated models. The baseline accuracy was 53.8%

Background

As access to information becomes more distributed across web sources the ability to verify journalistic integrity becomes a pressing concern. This is made clear by ongoing issues surrounding social media platforms’ promotion of ‘fake news’ designed to influence political outcomes as well as the broader public conversation around bias in media.

Outside of propaganda detection, accurate modeling may also be used for scoring articles or press releases for relevance among the target audience. By scoring words that hold weight within a group the communication may be optimized for relatability, impact, scope, and demographics.

A text corpus was built by querying Reddit and compiling comments from the Conservative and Libertarian subreddits. The Pushshift API was used to collect 100,000 samples from each subreddit and the data was cleaned using Regex and word vectorization prior to Natural Language Processing (NLP). The clean data was divided into training and validation sets before for model fitting before predicting classifications using the validation dataset. Models included CART methods, Support Vector Machines (SVM), K-Nearest Neighbors / Latent Semantic Analysis (KNN / LSA), and others before determining a final production model.

Findings

During EDA I found that the Libertarian subreddit generally had higher word and character counts for a given document compared with the Conservative subreddit. While Count Vectorization produced similar top word lists, Tf-idf Vectorization was able to identify words that had a stronger association with each groups individual ideology. As such, Tf-idf data was used for modeling.

Most frequent words for each category: Word Frequencies

Singular Value Decomposition, First 2 Components: SVD

Feature Importance: SVD

My baseline prediction was 53.8% accuracy by predicting the positive class for all samples.

The Random Forest model showed strong performance with the training data however it performed poorly on the validation data, showing signs of overfitting. The predictive accuracy on untrained data was only 70.5%, less than the naive bayes model. It’s likely this is was a result of using a small number of trees for fitting, and considering a larger number for each split would produce a more generalizable result.

The Multinomial Naive Bayes model performed consistently between the training and validation datasets, predicting classes with 72.4% accuracy. Considering the simplicity fo the model and that the model was fit to Term Frequency / Inverse Document Frequency vectorized data this is a strong increase over the baseline model.

Recommendations

Fit additional models
Use GridSearch to optimize over a larger space of hyperparameters
More aggressive data cleaning to streamline computations and reduce compute time.

Conclusions

This preliminary investigation showed strong positive results and set the groundwork for further investigations. Continuing with a wider group of model algorithms as well as allocating more compute to tuning our models is likely to produce a generalizable model that predicts with a much higher accuracy. For initial studies, the Multinomial Naive Bayes may serve as the minimum viable product to put this classification system into production.

Full project repo is available on GitHub

Posted by Chris Sinatra

I'm a Data Scientist living in Santa Monica, CA. Driven by curiosity, I'm eager to apply ML and Data Science techniques to create scalable, robust solutions to complex problems.