Using Twitter data to collect individual data

In this blog post I will illustrate how to collect Twitter data on individuals to analyze further. I will present this topic from the lens of an environmental economist, but to the best of my knowledge, to date there aren’t any publications in this field that use Twitter data.

The R package I will be using is rtweet. This is a very practical R package to extract Twitter data using its API.

The first step is to create a developer account. For a very detailed explanation about how to set up a Twitter developer account, you can check the rtweet package documentation. Either way, I will replicate some steps here because this process has changed slightly since 2018.

To set up your own developer account, you need to set up your own Twitter account if you don’t already have one. You should then login in your own developer account and click on the button “Create an app”. You should specify the name of the app, your website (which is not important for what we are doing), and describe how the twitter data you collect will be used. All of this is mandatory information. This is how the page to create an app looks like:

It goes without saying that using Twitter data must be done in a responsible manner (with great power comes great responsibility!).

Once you have filled out the form above, you have created your own Twitter app which should show up in the Apps list under whatever name you assigned it. If you click on Details and then on Keys and tokens, you should be able to see the API keys and access tokens you need to collect Twitter data. This is how they look like:

Again, it goes without saying that you should keep your own keys secret. If someone else gets a hold of them and uses them unlawfully, it is your responsibility!

Once you have obtained these keys, you are ready to use the rtweet package. As usual, the first step to use a new package is to install it and call the library function:

install.packages("rtweet")
library(rtweet)

After generating keys and secret keys, you should write them in your R script:

twitter_token <- create_token(
app = "NAME OF YOUR APP",
                              consumer_key = "KEY",
                              consumer_secret = "KEY",
                              access_token="KEY",
                              access_secret="KEY" ,
                              set_renv = TRUE)

You should replace the name of your app and the keys accordingly in the code above. You are now all set to start “harvesting” tweets.

rtweet has several different functions to get data from Twitter. Moreover, the Twitter API has several restrictions in terms of extracting data.

I will start with perhaps the simplest function to just extract tweets based on a keyword. Since I am an environmental economist, I will extract all tweets with the keyword “Nature” to find out what people are talking about when they use the word “Nature”. In order to do so, I use the search_tweets function in the rtweet package.

Inside the search_tweets function, I first specify that I want tweets with the “Nature” keyword, as well as the maximum of tweets I want, which I specify as 1000 tweets (n=1000). In the rtweet documentation, it is specified that you can only extract tweets from the past six to nine days.

tweets_nature <- search_tweets("Nature", n=1000)

The resulting tweets_nature dataframe has 999 observations and 90 variables. If there are less than 1000 tweets with the keyword “Nature”, the function will extract only the tweets that are available in the last 6 to 9 days.

Just to double-check which tweets we got, I will check which is the oldest tweet I extracted:

min(tweets_nature$created_at)
[1] "2020-05-22 11:39:53 UTC"

A lot of people use “Nature” as a keyword or as part of the body of text of the Tweet, so all of the tweets I extracted were published today.

There are several things I can do to clean up the data. For example, I can check how many unique users tweeting about Nature in my 999 Tweets.

user_info <- lookup_users(unique(tweets_nature$user_id))
dim(user_info)
[1] 905 90

Out of 999 Tweets, there are 905 unique Twitter accounts.

Another alternative to extract twitter information is to extract live tweets for a certain period of time. For example, I would like to get all tweets published for 3 minutes that have the keyword “Nature”. In this case, I use the stream_tweets function and specify the keyword (“Nature”) and the time (180 seconds):

tweets_nature2 <- stream_tweets(q = "Nature",timeout = 180)

Once you have extracted the tweets you are interested in, you might be compelled to start analyzing them. However, one important consideration is to double-check whether all of the tweet users are actually legitimate users. In fact, many twitter accounts are bot accounts, and it might be important to take them out before you start analyzing your Twitter data. To do so, I use the package tweetbotornot:

install.packages("tweetbotornot")
library(tweetbotornot)

Note: I had some trouble downloading the tweetbotornot package, and had to try different installation commands and specified in the package’s documentation.

data <- tweetbotornot(tweets_nature$user_id)

The results of the command above is a new dataframe that summarizes the probability of a given user being a bot. Here are the first six observations:

> head(data)
# A tibble: 6 x 3
  screen_name    user_id             prob_bot
  <chr>          <chr>                  <dbl>
1 NAME        1161936535945658370   0.678 
2 NAME        1073430589011841025   0.988 
3 NAME        1106262646272069633   0.680 
4 NAME        516093297             0.971 
5 NAME        21749305              0.0631
6 NAME        219988196             0.867 

According to this package, the second and forth user in our data has a very high probability (more than 95%) of being a Twitter bot. One should consider deleting these observations before deciding to do any analysis to the data, as these do not represent individuals.

There are many possibilities when it comes to Twitter data analysis. For example, one of the most common analyzes done to Twitter data is to conduct sentiment analysis. In our application, one could analyze which kind of sentiments are associated with Nature when people publish a Tweet. Many other analyzes are possible with the wide variety of variables that are extracted with rtweet. In our tweets_nature dataframe, we have the following variables:

> colnames(tweets_nature)
 [1] "user_id"                 "status_id"               "created_at"             
 [4] "screen_name"             "text"                    "source"                 
 [7] "display_text_width"      "reply_to_status_id"      "reply_to_user_id"       
[10] "reply_to_screen_name"    "is_quote"                "is_retweet"             
[13] "favorite_count"          "retweet_count"           "quote_count"            
[16] "reply_count"             "hashtags"                "symbols"                
[19] "urls_url"                "urls_t.co"               "urls_expanded_url"      
[22] "media_url"               "media_t.co"              "media_expanded_url"     
[25] "media_type"              "ext_media_url"           "ext_media_t.co"         
[28] "ext_media_expanded_url"  "ext_media_type"          "mentions_user_id"       
[31] "mentions_screen_name"    "lang"                    "quoted_status_id"       
[34] "quoted_text"             "quoted_created_at"       "quoted_source"          
[37] "quoted_favorite_count"   "quoted_retweet_count"    "quoted_user_id"         
[40] "quoted_screen_name"      "quoted_name"             "quoted_followers_count" 
[43] "quoted_friends_count"    "quoted_statuses_count"   "quoted_location"        
[46] "quoted_description"      "quoted_verified"         "retweet_status_id"      
[49] "retweet_text"            "retweet_created_at"      "retweet_source"         
[52] "retweet_favorite_count"  "retweet_retweet_count"   "retweet_user_id"        
[55] "retweet_screen_name"     "retweet_name"            "retweet_followers_count"
[58] "retweet_friends_count"   "retweet_statuses_count"  "retweet_location"       
[61] "retweet_description"     "retweet_verified"        "place_url"              
[64] "place_name"              "place_full_name"         "place_type"             
[67] "country"                 "country_code"            "geo_coords"             
[70] "coords_coords"           "bbox_coords"             "status_url"             
[73] "name"                    "location"                "description"            
[76] "url"                     "protected"               "followers_count"        
[79] "friends_count"           "listed_count"            "statuses_count"         
[82] "favourites_count"        "account_created_at"      "verified"               
[85] "profile_url"             "profile_expanded_url"    "account_lang"           
[88] "profile_banner_url"      "profile_background_url"  "profile_image_url"      

What I am looking into is whether we can apply a travel cost model by using Twitter data. Given the lack of coordinates shared in tweets, it is challenging to apply such a model. It is also not clear what kind of keywords one uses when publishing a tweets about a recreational visit.

While I do not have yet an answer of how we can use Twitter data to apply non-market valuation methods, the potential and availability of Twitter data opens to door to new methods to find the value of environmental services that would be impractical otherwise.