(Source: CrisisLex Disaster)
Test: Using the Twitter standard tweet search API, we collected tweets using search terms relevant to current disasters and geo coordinates of current disasters. $Xnumber$ tweets were collected from January 6, 2019 - January 16, 2019. We extracted the text and bounding box coordinates.
2. Exploratory Data Analysis and Cleaning
For a better understanding of the data, we performed exploratory data analysis. With the text data, we exercised standard natural language processing techniques such as tokenizing and lemmatizing. Then we split the data into train and test dataset and performed TF-IDF vectorizing.
We want to use TF-IDF to determine which words are most discriminating between tweets. Words that occur frequently are penalized and rare words are given more influence in our model.
3. Preprocessing and Modeling
Sandy Hurricane
Oklahoma Tornado
Alberta Floods and Queensland Flood
All Disasters
4. Mapping Simulation
Geo-coordinates are randomly generated for tweets. These tweets are then mapped, with red representing on-topic
and blue representing off-topic
to better visualize disaster location.
5. Proof of Concept
We query tweets in Malibu, CA and Riverside, CA with flood-related search terms. We cleaned, preprocessed and modeled these raw tweets using the pipeline outlined in previous notebooks.
Advantages with Random Forest: The model is more sophisticated and may perform better on less obvious data which can may mean less data preprocessing.
Advantages with Logisitic Regression: We can easily determine how likely a tweet will be classified as on-topic
by taking the exponential of the log odds of words. Even though Random Forest informs us about feature importance, we cannot quickly determine the information gained to classify a tweet as on-topic
.
Given a training set, we were able to classify disasters with high confidence based on the context of an individual tweet. This is important as given an unseen set of tweets, we are able to identify if a disaster is occurring. We randomly assigned geo coordinates to the training set to simulate a real-life scenario in which first responders would benefit from knowing the specific locations on a map to contain the disaster at hand.
We simulated such process when we gathered real-life flood related tweets. Applying the previously built pipeline for our training data, we classified with high accuracy that a real tweet can be classified into its respective disaster category.
There were three areas that we wanted to improve on:
Lack of recent real world tweet data: We were unable to find sufficient recent real world disaster tweets from which to test our trained models on. The raw data we were able to collect were not labeled as relevant or not which limited the models we wanted to use.
Lack of real geo coordinates: We were unable to gather real geo coordinates. Instead we had generated hypothetical geo coordinates of different tweets to mimic a natural disaster. Had we had access to the actual geo-coordinates, we could have accurately simulated a disaster through our plot.
Lack of compute: With more computing power, we could have scaled our analysis to gather more tweets and run our models handle more tweets and provide a more confident model.
Full project repo is available on GitHub
I'm a Data Scientist living in Santa Monica, CA. Driven by curiosity, I'm eager to apply ML and Data Science techniques to create scalable, robust solutions to complex problems.