The Beauty of API Wrappers and the Ugliness of Bad Data
During a wonderful week-long vacation to Boise, filled with countless brewery trips and hiking opportunities, I had the opportunity to talk to many of the locals. One theme pervaded the conversations: people were unhappy with the spike in housing prices and huge influx of out-of-staters (they gave me a pass since I was just visiting!). With current supply shortages and people moving out of urban areas, housing prices across the nation have sharply increased. Boise, and Idaho in general, has been hit the hardest, with more than a 20% increase in housing and rental prices in just one year. In this article, I’ll be taking a look at whether there is a correlation between Twitter/Reddit sentiment and housing price index changes, and focusing on how to narrow down the data to have it make sense for the question at hand.
Take a few seconds and think of a popular website that you use; perhaps you have the mobile application on your phone or tablet. Whether you pictured the aforementioned Twitter or Reddit, Facebook, YouTube, or even a smaller site such as Etsy, you can grab all sorts of useful data through their Application Programming Interface (API). For those unfamiliar with APIs, the best analogy I’ve heard is to imagine that a website is a restaurant. As a customer or user, you sit in the front of the house. A waiter is available to take your order and deliver it to the back kitchen. The specifics of how they cook it in the kitchen is irrelevant as long as you get your food as requested, and frankly, the restaurant does not want their customers to be able to gain access to the kitchen. In this example, the waiter and order are the request that gets sent over the API and the results that get served back to the user, with the kitchen being the back-end data architecture in which the request gets fulfilled. This analogy is most apt for the GET command, but APIs generally have others such as POST, which allows a user to make changes and post to a site, even with a bot.
Python has a
requests library that can handle API calls. For this and any form of API calls, you will generally need a developer account for the site.
Continuing with the Twitter example, we can search for all mentions of “Boise”, with the API documentation usually giving the format of the URL for the query. Here there are parameters such as “q” for query, and “result_type” for getting the most recent and/or most popular tweets.
During the end of June 2021, Boise (and much of the West) experienced an unprecedented heat wave that took over much of media. With the free developer account tier, Twitter limits tweets to the past one week, so we would then expect much of current Twitter sentiment to be overwhelmingly negative based on geographical location. This is the first hint of bad data — not knowing how/what to filter out from current events that heavily skew one way. I luckily grabbed data from a few weeks previous, avoiding weather-related issues.
To build the dataset, we need to choose which cities to look at, find the housing price index changes, and get text from Twitter and Reddit. To get the cities, we use Wikipedia’s list of most populous U.S. cities and take the top 100. Ideally, we want to use API calls for everything.
To make APIs easier to use, developers have created API Wrapper libraries, which greatly simplify the code needed to interact with the API. Wrappers are awesome for other reasons too, such as reducing the number of calls required. RapidAPI has a great article about wrappers, but for now we just need to know that a wrapper function means ease of use, and no more
requests. Let’s start with, in my opinion, the easiest wrapper to use:
wikipedia. This library allows us to get data from Wikipedia without downloading the database or scraping the site, AND we don’t even need any developer account!
So there’s already a bit of trouble brewing, in that the states had to be manually labeled. More trouble is ahead in finding the housing development index (HDI). Zillow has all the info we need (NOTE: the correlations in the data or lack thereof are predicated upon Zillow’s data being accurate) and even has an API, and even better,
pyzillow is a wrapper for it! Unfortunately, this API costs money to use and requires approval; the wrapper also doesn’t support the function that we want to use. We can’t resort to scraping either because Zillow can detect this and sends a canned HTML response basically accusing us of being a bot. So, here is the sneaky improvisation, which uses the
adj-city column created above:
This code prepares the URLs for opening, and then every 10 seconds, opens the next city’s URL. On each page on Zillow, the HDI is on full display in big font, which makes it is easy to nab, paste in a list, and close out of the tab before the next one opens. This avoids APIs and scraping, but took 16 minutes to pull off, so don’t use this method for larger datasets.
Next, we can use the
tweepy API wrapper to get data from Twitter. Thinking about what kind of data we want, the tweet should contain the city name and we should get a good number of tweets so that the polarity’s standard deviation is minimized. We also want to have a better chance that the Tweet is relevant to the city, so we restrict to 100 km around the city: for example, if a New Yorker tweets about “Houston”, it may easily be about Houston Street. Since the maximum number of Tweets for one query is 100 and we want 1000, we need to update the maximum ID each time so that we don’t get repeats.
For sentiment analysis, we will try two different packages:
TextBlob sentiment analyzer and VADER sentiment with
vaderSentiment. The latter uses the whole sentence with context, while the former is based on individual words.
Right away we can see that for the top 5 cities, the HDI and sentiment scores don’t correlate at all. But is this true for all 100 cities?
Absolutely true! There is a clear near-Gaussian distribution of the HDI change, but the outliers of the sentiment scores occur near the mean HDI. Boise can be seen all the way on the right side of the graph, but is not even in the top 5 cities by sentiment. Could it be that negative tweets by Idahoans about the housing market have the reverse desired effect? Perhaps some names correlate to sentiment, like “New York City” may be more neutral than using “NYC”, which was not captured.
Reddit and the Data Problem
Twitter proved to be fruitless for supporting our hypothesis, but perhaps Reddit can provide. Reddit also requires a developer account but limits calls to one per second, so no waiting around is necessary. Since we are just pulling data and not posting anything, authentication is simple with the
The wrapper allows us control with a few parameters. We can choose which subreddits to look in. To look in all subreddits, just choose r/all. We can limit the results to the newest X or top X amount, as well as the time, such as the past week, year, or all-time. We can also limit to the top-level comments; otherwise we traverse child comment trees and may get off-topic. I wanted to limit the number of posts for a city to 200, and the number of comments from one post to 20. The post’s score should have over 5 upvotes such that the sentiment is at least somewhat agreed upon.
After checking the subreddits that these comments came from, a big problem emerged. For New York, a lot of content was coming from the Bravo Real Housewives subreddit, while for many cities, the sports team subreddits dominated the discussion. I thought the sentiment on sports teams would not be a good representation of overall sentiment in case of nice cities with bad teams or vice versa. Another problem was the usage of the city name in other contexts. For Tacoma, the majority of posts were from subreddits mentioning the Toyota Tacoma truck, and for Washington, D.C., the comments were split among the capital and the state.
To get more relevant data, we can use the principal subreddit for the city. Most cities just use their name as the subreddit, but some might include the state, or use an abbreviation such as r/NYC. Data sparsity becomes an issue for the smaller cities such as Hialeah, FL, so I decided to only use the top 50 cities and find their subreddits manually (sorry Boise!).
Chicago seems to have a much higher sentiment score than the other cities, even though the HDI didn’t change as drastically. We can also see Phoenix’s first post is about a pandemic death, so if more posts are like that, the polarity would skew down.
It’s About the Journey
Reddit’s sentiment also doesn’t seem to correlate with the housing prices. There is thus no evidence that housing prices correlate with online sentiment of cities. Even stranger, the sentiments of Twitter and Reddit don’t even seem to correlate with each other! Although I didn’t draw a y=x line, it is visually apparent that Twitter simply has a lower sentiment about the same cities than Reddit does. Does this mean Twitter users are more negative people than Redditors? Who’s to say; because the data were pulled in different ways and were limited to different time periods (a week and a year, respectively), a direct comparison cannot truly be made.
Although this seemed like an exercise in futility, I hope it helps to make you think twice about making assumptions, finding appropriate data, and how to access that data with ease.