Tweet Collection

In this tutorial I will show you how to collect more than 18K tweets and save them to our computer in an organized, systematic way.

Let’s start with organizing! Define your data folder location, create folders if needed.

One good Rscript writing practice is to define your main data folder as your ‘bucket’ from the very beginning. This is ideally a directory that is separate from where you save your R files for coding. Have one folder for codes, another for your dataset. So, you might be working on a completely different directory in R [this is what you see when you type getwd()]; but when you save your data, you will save it to a file or folder located in your bucket, regardless of which directory you are working at easily!

Setting your bucket in your R code will also help you to change the bucket name easily (for instance, if you switch from one computer to another).

Ok, let’s set your bucket:

bucket <- 'type your bucket location here'
#for me, it is the following, so you will need to change this for your own use:
bucket <- 'D:/Work_20200326/datasets/github-tutorials'

So, this is your bucket name. Maybe you already opened up this folder, or maybe you did not. The code below checks if this folder already exists, and if not, creates this bucket for you:

if(dir.exists(bucket)==FALSE){
  dir.create(bucket)
}

Next is to speficy a new folder in this bucket folder. This way I know the data is coming from this particular RMarkdown file!

file.path(bucket, 'twitter', 'tweet_collection_18Kplus') #the file.path function takes all the elements in it and merges them in a way that you can use it as a file or folder location!
## [1] "D:/Work_20200326/datasets/github-tutorials/twitter/tweet_collection_18Kplus"
tmp <- 'twitter' #the upper folder
export_loc <-  file.path(bucket, 'twitter', 'tweet_collection_18Kplus') #the folder where we will save our tweet data

#check whether the twitter folder exists first, (i.e., the upper folder)
if(dir.exists(file.path(bucket,tmp))==FALSE){
  dir.create(file.path(bucket,tmp))
}
rm(tmp)
#and then create the folder under the upper folder -- this is going to be our "export location"
if(dir.exists(file.path(export_loc))==FALSE){
  dir.create(file.path(export_loc))
}

Great! so now we have a new folder for saving the data we will collect.

Let’s practice tweet data collection first

Ok, so before we think too much about how to save the files, let’s learn how to collect tweets first! I am assumng you will use your personal standard twitter API. I will not got into the details of how to get a Twitter Developer API, I am pretty sure search engines are full of such kind of instructions.

Since we will use the rtweet package for tweet collection, we will set our token first, following the code below. Keep in mind: never share your tokens or passwords with others. Therefore you will not see mine written here, but keep in mind that I am using that information in the background:

## load rtweet package and other packages you need
library(rtweet)
library(stringr)

##
#fill in the information to the lines 7-11 from the Twitter developer website

appname = 'your-ap-name-here'
key = 'your-key-here'
secret = 'your-secret-here'
access_token = 'your-access-token-here'
access_secret= 'your-access-secret-here'

twitter_token <- create_token(
  app = appname,
  consumer_key = key,
  consumer_secret = secret,
  access_token = access_token,
  access_secret = access_secret)

There are different ways of collecting tweets with rtweet package using the standard Twitter API:

For more options, check this website[https://rtweet.info/articles/intro.html]

EDIT: I wrote this code before AcademicAPI was released. Please check the limits for AcademicAPI if you have one, and also please take a look at the AcademicTwitteR package[https://github.com/cjbarrie/academictwitteR]

we will use search_tweets() today.

Let’s search for a hashtag! How about… ‘#rladies’ ? And for our practice let’s collect 5 tweets:

query <- "#rladies"
collect_count <- 5

tweets <- search_tweets(query,
                        include_rts=FALSE, #decide whether you want to include retweets or not.
                        n=collect_count)

This dataset should include many columns. Let’s check its dimensions and the names of the columns to see what kind of information we are collectiong regarding these tweets (and the users posting these tweets!)

dim(tweets)
## [1]  5 90
names(tweets)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "quote_count"             "reply_count"            
## [17] "hashtags"                "symbols"                
## [19] "urls_url"                "urls_t.co"              
## [21] "urls_expanded_url"       "media_url"              
## [23] "media_t.co"              "media_expanded_url"     
## [25] "media_type"              "ext_media_url"          
## [27] "ext_media_t.co"          "ext_media_expanded_url" 
## [29] "ext_media_type"          "mentions_user_id"       
## [31] "mentions_screen_name"    "lang"                   
## [33] "quoted_status_id"        "quoted_text"            
## [35] "quoted_created_at"       "quoted_source"          
## [37] "quoted_favorite_count"   "quoted_retweet_count"   
## [39] "quoted_user_id"          "quoted_screen_name"     
## [41] "quoted_name"             "quoted_followers_count" 
## [43] "quoted_friends_count"    "quoted_statuses_count"  
## [45] "quoted_location"         "quoted_description"     
## [47] "quoted_verified"         "retweet_status_id"      
## [49] "retweet_text"            "retweet_created_at"     
## [51] "retweet_source"          "retweet_favorite_count" 
## [53] "retweet_retweet_count"   "retweet_user_id"        
## [55] "retweet_screen_name"     "retweet_name"           
## [57] "retweet_followers_count" "retweet_friends_count"  
## [59] "retweet_statuses_count"  "retweet_location"       
## [61] "retweet_description"     "retweet_verified"       
## [63] "place_url"               "place_name"             
## [65] "place_full_name"         "place_type"             
## [67] "country"                 "country_code"           
## [69] "geo_coords"              "coords_coords"          
## [71] "bbox_coords"             "status_url"             
## [73] "name"                    "location"               
## [75] "description"             "url"                    
## [77] "protected"               "followers_count"        
## [79] "friends_count"           "listed_count"           
## [81] "statuses_count"          "favourites_count"       
## [83] "account_created_at"      "verified"               
## [85] "profile_url"             "profile_expanded_url"   
## [87] "account_lang"            "profile_banner_url"     
## [89] "profile_background_url"  "profile_image_url"
tweets[1,1:10] #list the first 10 columns
## # A tibble: 1 x 10
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 870664~ 13955021~ 2021-05-20 22:10:14 RLadiesMVD  "\U0~ Twitt~
## # ... with 4 more variables: display_text_width <dbl>,
## #   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
## #   reply_to_screen_name <lgl>

Now it is time for the real deal: collect more than 18K!

So, there is this thing called ‘rate limit,’ and you need to be beware about this. When using Twitter’s standard API to collect data from the past, you are limited with collection 18K tweets at a time. This means, you need to make sure that you need to wait for 15 minute before you ask for the 18,001st tweet you request to collect. Although search_tweets() function includes a parameter called retryonratelimit = and when you set this to “TRUE” it will to the sleeping part automatically for you, if you are aiming to collect, say, 1 million tweets, there is a huge chance that your connection might be lost, and you can end up losing what you collected thus far. Therefore, my suggestion is to write your code so that even if you lose connection or your function breaks for some unforeseen reason, you still get to keep the data you collected thus far, and you can get to save that.

To this is what we will focus on today.

Remind your future self what you searched for n your query

search_tweets() function allows you to do multiple term searches at a time. For example, it is possible to not only search for #rladies" but also “#rstats” at one go. You will know what your query included today, but, when you look at your data, sat, three months from now, will you remember what exactly you searched for? So, create a reminder file for yourself. A very simple text file.

library(data.table)
library(fst)

q <- list()
q[[1]] <- paste("search query:",query)
fwrite(q, file = file.path(export_loc,"readme.txt"))

So now, we can one file in our folder, named as “readme.txt”

files <- list.files(path = export_loc)
files
## [1] "readme.txt"                           
## [2] "search_tweets_1394804998115143682.csv"
## [3] "search_tweets_1395059182924087299.csv"

Great. Now,this is what we will want to do. We will want to collect the data in chunks. Say 18K at a time.At every round, we will store the data collected in a variable. Once we reach the rate limit, we will ask our code to not to do anything (i.e., sleep) for 15 minutes. Then, we will want to save a file once we reach to, say, 180K tweets. This way, we will save a big enough list of tweets in a file, but the file will not be a giant one that will slow down opening process for R when we need to use it in the future. And we will do rounds of that until we reach the total number of tweets we want to collect (say, 1M).

Let’s start with setting up those numbers

n_search <- 18000
n_total <- 1000000
round_total <- 10 #we will make 10 rounds of 18K tweet collection and then save as we reach to 180K collected tweets

We can write a loop function to do this for us. Before we do so, however, keep in mind that search_tweets() function collects tweets in the order from newest to the oldest. This means that if you do not set your parameters right, you will end up collecting the same set of tweets in your following searches, and not be able to go back in time while collecting your data.

The parameter you need to be aware of for this purpose is called max_id(). Say, in your first search, you collected five tweets with the following tweet numbers: 3046,2034,1984,832,824 (because the newest tweets will have higher numbers, this is how they will be ordered. The newest tweet is collected first, then an older one, and so forth). In your second search, if you set your max_id parameter to max_id=“824”, this will search tweets with the id number 824 or lower that fit your search criteria. This way, you will know for sure that you are collecting older tweets from what you have collected thus far.

We also need to be careful about naming our files. Each data should have a number, and we can use max_id for numbering purposes: such as “tweets_3478563047560131.csv”,“tweets_3476730928467093.csv”

The dataset collected using search_tweets function will require some column adjusting before saving as .csv, because some columns will be lists. The function below makes that adjustment:

#unlist the list columns in twitter
adjust_data= function(list){
  df_tweets <- rbindlist(list) #bind all the dataframes in the list to one dataframe. a data.table function.
  df_tweets[,hashtags:= sapply(hashtags, toString)]
  df_tweets[,symbols:= sapply(symbols, toString)]
  df_tweets[,urls_url:= sapply(urls_url, toString)]
  df_tweets[,urls_t.co:= sapply(urls_t.co, toString)]
  df_tweets[,urls_expanded_url:= sapply(urls_expanded_url, toString)]
  df_tweets[,media_url:= sapply(media_url, toString)]
  df_tweets[,media_t.co:= sapply(media_t.co, toString)]
  df_tweets[,media_expanded_url:= sapply(media_expanded_url, toString)]
  df_tweets[,media_type:= sapply(media_type, toString)]
  df_tweets[,ext_media_url:= sapply(ext_media_url, toString)]
  df_tweets[,ext_media_t.co:= sapply(ext_media_t.co, toString)]
  df_tweets[,ext_media_expanded_url:= sapply(ext_media_expanded_url, toString)]
  df_tweets[,ext_media_type:= sapply(ext_media_type, toString)]
  df_tweets[,mentions_user_id:= sapply(mentions_user_id, toString)]
  df_tweets[,mentions_screen_name:= sapply(mentions_screen_name, toString)]
  df_tweets[,geo_coords:= sapply(geo_coords, toString)]
  df_tweets[,coords_coords:= sapply(coords_coords, toString)]
  df_tweets[,bbox_coords:= sapply(bbox_coords, toString)]
  df_tweets
}

Now, we can start collecting tweets! Let’s write the function:

#search tweets with 15 min. sleeping + data org method.
my_search_tweets = function(q, # query
                                rt=FALSE, #whether include  retweets or not
                                n_search=18000, #how many tweets to search at a time
                                maxid=NULL, # default is NULL
                                round_total=100, #max. rounds of search
                                since=NULL,
                                until=NULL,
                                loc_export){ #location to save file
  #create an empty list
  tweetList <- list() 
  isit="NO"
  for(i in 1:round_total){
    print(paste("data number:",i,"out of",round_total))
    tweets <- search_tweets(q=q,
                            include_rts=rt, 
                            n=n_search,
                            max_id = maxid,
                            since=since,
                            until=until)
    tweetList[[1]] <- tweets
    df_tweets=adjust_data(list=tweetList)
    if(is.null(maxid)==FALSE){
      #set isit to 'YES' if the search collects no new tweets anymore.
      if(maxid==tail(tweets,1)$status_id){isit='YES'} 
    }
    #stop the loop if maxid does not update anymore, after sleeping for 2 minutes.
    if(isit=="YES"){
      print('search complete. Will close after 2 minutes of sleep time.')
      Sys.sleep(60*2)
      break}
    #reset the maxid to use it in the next round of tweet collection
    maxid <- tail(tweets,1)$status_id 
    #save data
    filename=paste0('search_tweets_',maxid,'.csv')
    fwrite(df_tweets,file.path(loc_export,filename))
    print('Data saved.')
    rm(tweets)
    #sleep for 15 minutes before collecting the next dataset
    print('sleeping for 15 minutes')
    Sys.sleep(60*15)
  }
}

And let’s collect all tweets from today with the hashtag: ‘#coronavirus’ posted in the last two days:

my_search_tweets(q = '#coronavirus',
                 since='2021-05-19',
                 until='2021-05-21',
                 loc_export = export_loc)

Now, let’s check the list of files in our folder:

files <- list.files(path = export_loc)
files
## [1] "readme.txt"                           
## [2] "search_tweets_1394804998115143682.csv"
## [3] "search_tweets_1395059182924087299.csv"

It worked!! let’s also check whether we were able to collect older tweets (i.e. whether the maxid parameter adjustment worked)

df1 = fread(file.path(export_loc,files[2]))
df2 = fread(file.path(export_loc,files[3]))
nrow(df1)
## [1] 11693
nrow(df2)
## [1] 17756
new = length(unique(c(df1$status_id,df2$status_id))) 
new
## [1] 29446

That also worked!!!