Eight Techniques for Exploratory Data Analysis of Arabic Text

Before selecting and working with data, it is always better to get to know it and understand its content. An initial investigation of the dataset is very crucial to conduct before using the dataset in research or further analysis. It can help to detect trends and patterns in data, identify outliers, and find valuable relations among variables. This investigation is called Exploratory Data Analysis (EDA). The EDA applies statistical methods and data visualization tools to support the exploration process. The main goals of the EDA are to deeply understand the content and structure of the data, and to find out if there are any problems in the data.

Knowing Your Data

In this article, the type of the data explored is mostly textual data. I will present an overview for  some EDA tools for the Arabic language using Python; such as the number of characters per token and the number of tokens per post, and word cloud graph. Conducting an in-depth investigation support understanding the content of the dataset from multiple dimensions. Some visualization tools are used to better understand the content and context of the data. This article covers the following steps:

  • Step 1: importing libraries
  • Step 2: reading the dataset
  • Step 3: basic filtering and cleaning 
  • Step 4: tokens frequencies
  • Step 5: stop words frequencies
  • Step 6: number of tokens per tweet
  • Step 7: number of characters per token
  • Step 8: word cloud graph

Dataset

For the purpose of this exercise, I extracted a set of Arabic tweets that contain the Arabic word “مجبوس / Machboos”. Machboos is a traditional Kuwaiti dish. You can check this video if you are curious to know more about it. While I am writing this article, I was thinking about what to cook for tomorrow lunch. I decided to prepare the ingredient for chicken Machboos, and so I used it to extract the dataset and explore how people are using it on Twitter. The dataset can be downloaded from here.
Chicken Machboos

 

Please cite these references if you are using the code discussed in this article for publications and academic purposes:

    • Husain, F., & Uzuner, O. (2021). Exploratory Arabic Offensive Language Dataset Analysis. arXiv preprint arXiv:2101.11434.
    • Husain, F. (2021). Arabic Offensive Language Detection in Social Media (Doctoral dissertation, George Mason University).

You might also check the above mentioned references to see how I used the same code discussed in this article to study offensive language detection from several datasets. Please feel free to contact me if you would like to get a copy from the references. 

Basic EDA Techniques Examples For Arabic text in Python

Step 1: Importing Libraries:

				
					import pandas as pd
import numpy as np
import nltk
import csv
import string 
import re
from nltk import word_tokenize
import itertools
import collections
import codecs
import requests
import nltk
nltk.download('punkt')
				
			

Step 2: Reading the Dataset:

First I uploaded the dataset to the Colab project, then I fetch the file as the following:

				
					Dataset = pd.read_csv('/content/Machboos.csv', sep=',', encoding='utf-8-sig')
Dataset
				
			
Screenshot of the dataset

Since the exploration focuses only on the tweets,  I will extract the tweets and save them into an array to be used on the following steps. 

				
					tweets = Dataset.loc[:,"Text"]
tweets
				
			
Sample from tweets

The total number of tweets is 774 tweets before cleaning and filtering.

Step 3: Basic Filtering and Cleaning:

The main goal of this step is to remove any unnecessary content that might affect the results of the analysis or that might cause some violation to user’s privacy. For example, it is recommended to remove usernames from the data to preserve their privacy.

In this script, I first replace username’s mentions with the string/keyword “@USER” to anonymize the data:

				
					# Define an array for the filtered tweets list
filteredTweets = []

for tweet in tweets:
  tweet = str(tweet)
  # Replace the specific part of the string based on pattern
  replacedText = re.sub('@[a-zA-Z0-9_.-]*', '@USER', tweet)
  # uncomment the next line if you want to totally remove usermentions
  #replacedText = re.sub('@USER:', '', replacedText)
  # Print the original tweet
  print("Original Text:", tweet)
  # Print the replaced tweet
  print("\nReplaced Text:", replacedText)
  # Adding the replaced tweet to the filtered tweet list
  filteredTweets.append(replacedText)

				
			
Sample of the output

Duplicated tweets might affect the results, so I first remove the retweets keyword “RT” and then remove duplicates. 

				
					# removing RT
# Define an array for the tweets without RT character list
filteredCleanedTweet = []

for tweet in filteredTweets:
  tweet = str(tweet)
  # Replace the specific part of the string based on pattern
  replacedText = re.sub('RT', '', tweet)
  # Print the original tweet
  print("Original Text:", tweet)
  # Print the replaced tweet
  print("\nReplaced Text:", replacedText)
  # Adding the replaced tweet to the filtered tweet list
  filteredCleanedTweet.append(replacedText)
				
			
Sample of the output

Now lets check if we still need to filter some of the content or not. 

				
					filteredCleanedTweet
				
			
Screenshot of sample tweets

The newline character “\n”  needs to be removed too, before removing duplicates. 

				
					# removing the newline character "\n"

filteredCleanedLineTweet = []

for tweet in filteredCleanedTweet:
  tweet = str(tweet)
  # Remove new line character
  replacedText = re.sub('\n', ' ', tweet)
  # Adding the replaced tweet to the filtered tweet list
  filteredCleanedLineTweet.append(replacedText)

filteredCleanedLineTweet
				
			
Sample of tweets

Once we are done with the basic filtering and cleaning, the list of cleaned tweets will formatted back into a Pandas dataframe. A dataframe would make it easier to play around with the data and use multiple tools to explore its content. 

				
					# Convert the tweets format to a Pandas dataframe
df = pd.DataFrame({'tweet':filteredCleanedLineTweet})
				
			

I will check the size of the dataframe first, then I will remove the duplicated tweets and check the size again after to calculate the percentage of duplicated tweets in the dataset. 

				
					# Checking the size of the dataframe before removing duplicated tweets
print("Number of tweets before removing duplicates:", df.shape)

# Removing duplicates
df = df.drop_duplicates()

# Checking the size of the dataframe after removing duplicated tweets
print("Number of tweets after removing duplicates:", df.shape)

				
			
Screenshot of the output

Thus, about 26.7% of the dataset is duplicated tweets. Only the unique tweets will be used in the following analysis, which are 567 tweets. 

Now, I can start with some simple statistical analysis includes finding frequencies of words, frequencies of stop words, statistical measurements for the lengths of the text based on the number of tokens, and statistical measurements for the lengths of the tokens based on the number of characters to analyze their relationships to the dataset content. To extract the most frequently used words accurately, I removed a list of stop words from the text.  Stop words could help in defining the context of the posts. I conduct simple frequency analysis to generate the top stop words, as stop words that appear only in a particular context or dataset might be better to consider in analysis as a regular word rather than as a stop word. I investigate the complexity of the text used in the dataset to check if there is any pattern or relationship between the complexity of the text used and the context of the dataset, I used two measures to peruse the goal of this analysis; the number of characters per token and the number of tokens per post. These analysis might be more informative if we are comparing among multiple datasets or among texts from different categories/ labels

Step 4: Tokens Frequencies:

Firstly, the text needs to be tokenized to breakdown all tweets into a single list of tokens. 

				
					# text tokenizer function 
def my_tokenizer(text):
    text = str(text)
    return text.split() if text != None else []

# transform list of documents into a single list of tokens
tokens = df.tweet.map(my_tokenizer).sum()
print(tokens)
				
			
Sample output

 

Now, lets check out the top 20 commonly occurred words. 

				
					from collections import Counter

counter = Counter(tokens)
counter.most_common(20)

				
			
Output

A very quick check to the results shows that most tweets contain username mentions, which means that most tweets are conversational tweets among users. Punctuations affect the results; the token “@USER” is treated as if it is different than “@USER:”. It would be better to remove punctuations to have better results. Among the top frequently used words are stop words, such as و/and , ما/no. Stop words are the extremely common words in any language that are of little value to the analysis. It is recommended to filter the text from stop words before developing any Natural Language processing System (NLP). After considering all these points, I will check the top 20 tokens again. 

To remove punctuations, a customized set of Arabic punctuation is defined as well as available English punctuation lists at the NLTK library are used to filter tweets from all punctuation marks. 

				
					# This function removes all punctuations from the tokens

def removingPunctuation(text):
    arabicPunctuations = [".","`","؛","<",">","(",")","*","&","^","%","]","[",",","ـ","،","/",":","؟",".","'","{","}","~","|","!","”","…","“","–"] # defining customized punctuation marks
    englishPunctuations = string.punctuation # importing English punctuation marks
    englishPunctuations = [word.strip() for word in englishPunctuations] # converting the English punctuation from a string to array for processing
    punctuationsList = arabicPunctuations + englishPunctuations # creating a list of all punctuation marks
    cleanTweet = ''
    for i in text:
      if i not in arabicPunctuations:
        cleanTweet = cleanTweet + '' + i 
    return cleanTweet

tokenFiltered = []
for i in tokens:
  token = removingPunctuation(i)
  tokenFiltered.append(token)

				
			
Sample from the output

Two lists of stop words were used to create an exclusive list. The stop words list available at Nuha Albadi’s github repository and 
Mohamed Taher Alrefaie’s github repository are used to further filter out and clean the tokens from unnecessary words. 

				
					# This function cleans the tokens from all stop words

def removingStopwords(intweet):
    cleanTweet = ''
    temptweet = word_tokenize(intweet)
    stopword = []
    
    for i in temptweet:
        # Stop words can be downloaded from https://github.com/nuhaalbadi/Arabic_hatespeech for list 1 and 
        # for list 2 from  https://github.com/mohataher/arabic-stop-words  
        path_1 = "/content/stop_words.csv" # List 1
        path_2 = "/content/stop-words-list.txt" # List 2
        with codecs.open(path_1, "r", encoding="utf-8", errors="ignore") as myfile:
            stop_words = myfile.readlines()
        stop_words_1 = [word.strip() for word in stop_words]
        with codecs.open(path_2, "r", encoding="utf-8", errors="ignore") as myfile:
            stop_words = myfile.readlines()
        stop_words_2 = [word.strip() for word in stop_words]
        stop_words = stop_words_1 + stop_words_2
        if i not in stop_words:
            cleanTweet = cleanTweet + ' ' + i 
    return cleanTweet

tokenFilteredCleaned = []
for i in tokenFiltered:
  token = removingStopwords(i)
  tokenFilteredCleaned.append(token)

				
			

You might notice some spaces. So, I will remove the spaces, then I will check the token’s frequencies again, and draw a frequency bar chart for the top 20 tokens.

				
					# This function removes spaces from the tokens
def removingSpace(text):
    cleanTweet = ''
    temptweet = word_tokenize(text)
    for i in temptweet:
        #remove more than one space
        i = re.sub(r"\s+"," ", i)
        cleanTweet = cleanTweet + '' + i 
    return cleanTweet

tokenFilteredCleanedExtra = []
for i in tokenFilteredCleaned:
  token = removingSpace(i)
  tokenFilteredCleanedExtra.append(token)

				
			
				
					#checking the top tokens again after filtering and cleaning 
counter = Counter(tokenFilteredCleanedExtra)
counter.most_common(20)
				
			
Output
				
					# This function creates a bar chart of the most frequent 20 tokens

import plotly.offline as po
from plotly.offline import iplot
import plotly.graph_objs as go

freq_df = pd.DataFrame.from_records(counter.most_common(20),
                                    columns=['token', 'count'])


def topTokensBarchart(df):
  y = freq_df['count']

  data = [go.Bar ( x = freq_df.token, y = y, name = 'Most Common Tokens in the Dataset')]
  layout = go.Layout(title = 'Most Common Tokens in the Dataset', xaxis_title="Tokens",
      yaxis_title="Counts")
  fig = go.Figure(data = data, layout = layout)
  po.plot(fig)

  # the graph will be find in html format in the temp folder

  fig.show()


topTokensBarchart(freq_df)
				
			
newplot
Top 20 Most Frequent Tokens

Results also show some unresolved tokens, which need customized preprocessing and cleaning. I will write a specific post to address these issues related to preprocessing Arabic text from user-generated content. For the purpose of this data exploration exercise, I will continue the analysis using this token list.

Step 5: Stop Words Frequencies:

The following bar chart plot the most commonly used stop words.  During this step also I used the same two stop words list I used during the previous step to maintain consistency among the analysis’ steps. 

				
					# This function create a barchart for the most frequently used stop words

def stopwordsBarchart(text):
    # Stop words can be downloaded from https://github.com/nuhaalbadi/Arabic_hatespeech for list 1 and 
    # for list 2 from  https://github.com/mohataher/arabic-stop-words  
    path_1 = "/content/stop_words.csv" # List 1
    path_2 = "/content/stop-words-list.txt" # List 2
    with codecs.open(path_1, "r", encoding="utf-8", errors="ignore") as myfile:
        stop_words = myfile.readlines()
    stop_words_1 = [word.strip() for word in stop_words]
    with codecs.open(path_2, "r", encoding="utf-8", errors="ignore") as myfile:
        stop_words = myfile.readlines()
    stop_words_2 = [word.strip() for word in stop_words]
    stop_words = stop_words_1 + stop_words_2
    
    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]
    from collections import defaultdict
    dic=defaultdict(int)
    for word in corpus:
        if word in stop_words:
            dic[word]+=1
            
    top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10] 
    x,y=zip(*top)
    plt.bar(x,y)

    data = [go.Bar ( x = x, y = y, name = 'Most Common Stop Words in the Dataset')]
    layout = go.Layout(title = 'Most Common Stop Words in the Dataset', xaxis_title="Stop Words",
     yaxis_title="Counts")
    fig = go.Figure(data = data, layout = layout)
    po.plot(fig)
    fig.show()


stopwordsBarchart(df['tweet'])
				
			
Bar chart of the top 10 stop words

Step 6: Number of Tokens per Tweet:

I will calculate the number of token per tweet and the number of character per token as that can give some indications of the complexity of the issue and the tweet.  The following histogram plots the number of token per tweet.

				
					# This function draw a histogram for the length of tweets based on  the number of tokens per tweet 

def tokenPerTweetHistogram(text):
    text.str.split().\
        map(lambda x: len(x)).\
        hist()

tokenPerTweetHistogram(df['tweet'])
				
			
Number of Token Per Tweet Histogram

As can be seen from the histogram above, most tweets contain less than 10 tokens. The next histogram shows the number of characters per token. 

Step 7: Number of Characters per Token:

				
					# This function creates a histogram based on the number of characters per token

def characterPerTokenHistogram(text):
    text.str.split().\
        apply(lambda x : [len(i) for i in x]). \
        map(lambda x: np.mean(x)).\
        hist()
characterPerTokenHistogram(df['tweet'])
				
			
Histogram of the Number of Character per Token

Step 8: Word Cloud Graph:

Word cloud graph provides a visual representation of the text, which could be very helpful to highlight keywords and most important terms in the textual data.  

				
					# this function generate the word cloud

!pip install wordcloud-fa==0.1.4
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from wordcloud_fa import WordCloudFa

wodcloud = WordCloudFa(persian_normalize=False)
# handcrafted terms/characters to further filter the text, selected based on a manual inspection of the data
stop = ['@USER:','@USER','.', ':', '?', '؟','...', '..', '"','!!','']
# words that has "ال" as a main part from them
no_prefix = ['الله', 'اللهم']
tokens_clean = []
for word in tokenFilteredCleanedExtra:
  if (word not in stop):
    if word.startswith('ال'): 
      if (word not in no_prefix):
        word = word[2:] # to remove the prefix "ال"
    tokens_clean.append(word)

#printing the list of tokens
print(tokens_clean)

# generating the word cloud graph
# Need to get downloaded because it will appear in the online temp folder
data = str(tokens_clean)
wc = wodcloud.generate(data)
image = wc.to_image()
image.show()
image.save('wordcloud.png')
image.show()
				
			
Word Cloud Graph

In addition to the techniques discussed in this article, there are many other possible ways to analyze the textual data. For example, investigating the use of hashtags in tweets, use of emojis, or if possible frequencies of different part of speech tags (nouns, verbs etc.). 

In this article, I have presented 8 basic steps for EDA that support Arabic text. I hope you find them valuable and informative. Please let me know if you have any comments. You can find the full code on this Google Colab project or on my GitHub

Leave a Comment

Your email address will not be published. Required fields are marked *