Part Two: Scraping Tweets from Time Frames and Locations
In a previous post, I introduced how to scrape tweets from Twitter. Specifically, I covered how to connect R to the Twitter API and how to extract tweets related to a specific Twitter account (or Twitter handle) and a specified hashtag.
In this post, I am going to delve into further details regarding scraping tweets for a hashtag. First, we will take a look at tweets from a specific time frame versus extracting just the latest tweets posted with a hashtag. Second, we will look at tweets shared from a specific location. Since we have already covered the setup and the basics for scraping tweets, we can get right into the details.
Scraping Tweets from a Specific Time Frame
Sometimes, you might want to extract tweets regarding an event or a campaign from the day that it was happening. By the time you go to scrape tweets, the event may have passed, and the latest tweets may not be what was tweeted during the day of the event.
Luckily, when you scrape tweets, you can add additional arguments to your searchTwitter() function to scrape date-specific tweets. These arguments are “since” and “until”. To put this into practice, I wanted to extract tweets about the Democratic Debates on 9/12/19 that had #democraticdebate. Let’s take a look at what that code would look like:
Tweets <- searchTwitter("#democraticdebate, exclude:retweets", n=5000, since ='2019-09-11', until= '2019-09-13')
Since I wanted tweets from 9/12/19, I set the “since” attribute to the day before and “until” to the day after. This would extract the last “n” number of tweets from 9/12. In this case, the debates were at night, so I estimated to scrape 5000 tweets (n=5000). Finding the right number to set “n” to may take some trial and error, but as you will see shortly, R will sometimes help you with determining what “n” should be.
Once I turned these tweets into a data frame, I could see that the earliest tweets were from roughly 7:30pm, just as the debates were starting. For a quick refresher on how to turn your tweets into a data frame and CSV file:
Tweets_df <- tbl_df(map_df(Tweets, as.data.frame)) # create data frame
write.csv(Tweets_df, "DemDebateTweets.csv") # create CSV file
Scraping Tweets from a Specific Location
Beyond scraping tweets from a specific day or time frame, you might also be interested in scraping tweets that were tweeted from a specific location. This can be done with the “geocode” argument. A geocode is simply the coordinates for a location, and this information can easily be found anywhere on the web.
To continue with the #democraticdebate, I then wanted to extract tweets specifically from New York:
TweetsNYC <- searchTwitter("#democraticdebate exclude:retweets", n=5000, since ='2019-09-11', until= '2019-09-13', geocode = "40.730610,-73.935242,5mi")
As you can see, I added the coordinates to the “geocode” argument, along with 5mi as a radius.
An important note here, I kept “n = 5000”, however R returned the following: “5000 tweets were requested but the API can only return 313”. I then re-ran the above code with “n = 313” and it worked perfectly. See, R does help us out sometimes!
Now you know how to scrape time and location-specific tweets. If you want to practice, click here for the reference code.