Twitter is an online microblogging and social networking platform where users can post short messages, up to 280 characters’ length. These short messages are called Tweets. Having limited Tweet’s length supports the use of direct and concise language, which makes it easy for users to scan and read Tweets. However, it also becomes very challenging to write a Tweet. In addition to text, Tweets can also include pictures, sounds, or videos.
Nowadays, politicians, celebrities, and almost all influential personalities are using Twitter as a channel to communicate with the public and post about their lives and thoughts. This diverse and rich online presence of famous people on Twitter makes it a great place to find about events in the community, strengthen general knowledge, and connecting with people from all around the world. Moreover, companies, universities, organizations, and governments are also using Twitter to post about their events, plans, or products.
Twitter content is widely used in studying people’s opinions and attitudes toward several topics or products. This can help to better understand social problems and develop better solutions and improve regulations.
In this short post, I will show how to extract datasets from Twitter API using a Python library called Tweepy based on keywords and terms. The output dataset file includes a set of metadata for each Tweet in an easy-to-use CSV format.
Before starting with the code, you will need to follow the instructions provided by Twitter Developer Platform to get Twitter account access credentials. Twitter offers multiple account type depending on the purpose of downloading the data, each account type might have different features. After approving your request, you will receive consumer_key, consumer_secret, access_token, and access_token_secret, which are mandatory to access Twitter API.
Step 1: Importing Libraries
Three Python libraries are used in extracting and preparing the dataset; Tweepy to connect with Twitter API, Pandas to transform the data into an organized format in a dataframe, and CSV to create the output dataset CSV file.
import tweepy
import pandas as pd
import csv
Step 2: Accessing Twitter API
In this step, I provide Twitter API credentials to connect to the API through Tweepy using the following script. The API supports access to several Twitter’s available functionalities (tweets, retweets, mentions, likes …etc).
# Setting Twitter API credentials
consumer_key= "" # Insert your consumer key
consumer_secret= "" # Insert your consumer secret
access_token="" # Insert your access token
access_token_secret="" # Insert your access token secret
# Authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
# Creating an object from Twitter API
api = tweepy.API(auth)
Step 3: Setting Parameters and Extracting Tweets
Parameters can be used as filters to search for our targeted Tweets and to customize the search. The followings are the parameters that will be used in this exercise:
-
q : for the keywords or terms used in search query. For multiple keywords search, they need to be defined in a single quotes and separated by ‘OR’.
-
count : number of tweets per page
-
lang : the language of tweets specified using, for example “ar” for Arabic, “en” for English. For more information about the languages supported by Twitter and their codes, please check this page.
-
since : the start date of search formatted as YYYY-MM-DD
-
until : the stop date formatted as YYYY-MM-DD
-
result_type : the type of the search results, which could have one from the three values: “recent” for the most recent results, “popular” for the most popular results, and “mixed” for both popular and real time results.
-
include_entities : entities provide additional contextual information. It includes hashtags, user mentions, links, stock tickers (symbols), Twitter polls, and attached media. They will be included in the results as an array when set to True.
-
tweet_mode: this parameter added after the extension of the number of character per Tweet from 140 to 280 characters. When tweet_model is set to “extended”, it will extract the full untruncated Tweets and if it sets to “compat”, it will give untruncated Tweets with 140 characters.
-
Encoding : this parameter might be helpful if the extracted Tweets were in non English, for example, setting the encoding value to ‘utf-8-sig’ will work very good for Arabic Tweets. For more details about character encoding in Twitter, you can refer to this webpage.
Twitter use pagination feature to iterate through timelines. This pagination feature also returns the requested data in a series of pages. Thus, a page or cursor parameter need to be provided with each data request to manage the pagination loop. I used the Cursor object in Tweepy to manage the pagination feature easily.
Two methods are used to support instances from the Cursor object:
- items : this method takes the maximum number of items to iterate over per page returned
- pages : help to process per page of results and takes the maximum number of pages to iterate over
Now let’s look into the code and see how it works. There was a severe dust storm when I started writing this post. I decided to see what people were sharing about the storm on Twitter. I will use some keywords that are related to the dust storm in Kuwait as the main parameter to extract tweets.
# Setting parameters
limit = 10000
language = 'ar' # others could be 'en', 'fa', 'tr'
keywords = '#غبار_الكويت OR #الكويت OR #كويت'
startDate = "2022-05-20"
endDate = "2022-05-24"
# Passing the parameters into the Cursor constructor method
public_tweets = tweepy.Cursor( api.search,
q= keywords,
result_type='recent',
since = startDate,
until = endDate,
count=100,
include_entities=True,
lang=language,
tweet_mode="extended",
encoding='utf-8-sig').items(limit)
Step 4: Saving Results into a CSV File
After pulling the dataset, an array is created for each attribute we want to save. Then, all arrays are arranged into a Pandas dataframe to create a Comma-Separated Values (CSV) file that is easy to use for further processing and analyzing.
# Defining Arrays to save results for each attribute seperatly
tweet_id_list = []
tweet_text_list = []
tweet_location_list = []
tweet_geo_list = []
user_screen_name_list = []
tweet_created_list = []
tweet_contributors_list = []
tweet_entities_list =[]
tweet_retweet_count_list = []
tweet_source_list = []
tweet_username_list = []
tweet_followers_count_list = []
friends_count_List = []
user_url_list = []
user_desc_list = []
# Iterating through the results to extract the results
for tweet in public_tweets:
tweet_id_list.append(tweet.id)
tweet_text_list.append(tweet.full_text)
tweet_location_list.append(tweet.user.location)
tweet_geo_list.append(tweet.geo)
user_screen_name_list.append(tweet.user.screen_name)
user_url_list.append(tweet.user.url)
user_desc_list.append(tweet.user.description)
tweet_source_list.append(tweet.source)
tweet_created_list.append(tweet.created_at)
tweet_contributors_list.append(tweet.id_str)
tweet_entities_list.append (tweet.entities)
tweet_retweet_count_list.append(tweet.retweet_count)
tweet_username_list.append(tweet.user.name)
tweet_followers_count_list.append(tweet.user.followers_count)
friends_count_List.append(tweet.user.friends_count)
# Creating a Pandas dataframe to organize the data into a table
df = pd.DataFrame({
'tweet_id': tweet_id_list,
'tweet_text': tweet_text_list,
'tweet_location': tweet_location_list,
'tweet_geo':tweet_geo_list,
'user_screen': user_screen_name_list,
'url' : user_url_list,
'user_desc': user_desc_list,
'tweet_source': tweet_source_list,
'tweet_created': tweet_created_list,
'tweet_contributors': tweet_contributors_list,
'tweet_entities': tweet_entities_list,
'tweet_retweet_count': tweet_retweet_count_list,
'tweet_username': tweet_username_list,
'tweet_followers_count': tweet_followers_count_list,
'friends_count':friends_count_List})
Converting the dataframe to a CSV file to download it. If you are using Google Colab, I also include the code needed to push the file to Google drive and save it. You will need to provide your Google account credentials to permit saving the file to your Google Drive.
# Converting the dataframe to CSV file
df.to_csv('Kuwait_Dust_Storm_2022.csv', sep=',', index=False, encoding='utf-8-sig')
# Saving the file to Google drive
file_name = "Kuwait_Dust_Storm_2022.csv"
from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build
from google.colab import auth
auth.authenticate_user()
drive_service = build('drive', 'v3')
def save_file_to_drive(name, path):
file_metadata = {'name': name, 'mimeType': 'application/octet-stream'}
media = MediaFileUpload(path, mimetype='application/octet-stream', resumable=True)
created = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
return created
save_file_to_drive(file_name, file_name)
I hope this post was informative and help you. You can find the full code on this Colab project.