Extracting Quranic Lexicons

Extracting lexicons from raw text automatically enables the creation of valuable linguistic resources that can serve multiple tasks. Depending on the method applied in extracting the lexicons, lexicons provide insights into the context of the text, they can be used in developing systems and applications as a source of contextual knowledge (e.g. question answering systems, text classification, translation…etc.), they can be used to find collocations, and they can help in creating taxonomies.

We will explore five lexicon extraction statistical methods in this article as the following:

  1. Simple frequency count.
  2. Pointwise Mutual Information (PMI).
  3. T-test.
  4. Chi-square test.
  5. Likelihood ratio.

Some parts from this article are depending on the content from the following paper:

Husain, F. & Uzuner, O. (2021). SalamREPO: an Arabic Offensive Language Knowledge Repository. Future Technologies and Innovations (FTI) Proceedings: 4th International Conference on Computer Applications & Information Security (ICCAIS’2021). March 19-21, Tunisia. https://fti-tn.net/iccais-2021-list-of-papers

Importing Dataset and Python Libraries

We download the full text of the holly Quran from Tanzil website. We select the format of the text to be simple clean without Tatweel and without signs of pause or sajda or rub-el-hizb to make it simpler for analysis. In addition, we choose the file format to be text (.txt).

Once you are done downloading the file to your device, you can upload it to the Colab project directory by selecting the icon on the left with an upward arrow to upload it to session storage. Then, choose the Quran text file you have on your device. The name of the file we are using is Quran-simple-clean.txt.

We start coding as usual by importing python libraries:

					# importing libraries
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import spacy
import string
import os

To read the Quran text file and print its content on the screen, use the following script:

					# reading the txt file
a_file = open("/content/quran-simple-clean.txt")

# print file content
lines = a_file.readlines()
for line in lines:
A sample from the output

If you scroll down the output screen, you will notice that the file contains some English comment at the end that does not belong to the Quran content. It has some term-of-use conditions and version information, which are not relevant to our analysis. Thus, we will delete these lines using the following script:

					# delete last lines in the txt file that contains comments
# then save the quran text in another file called quran.txt

#with open("sample.txt", "r") as input:
with open("temp.txt", "w") as output:
    # iterate all lines from file
    for line in lines:
        # if line starts with symbol '#' then don't write it in temp file
        if not line.strip("\n").startswith('#'):

# replace temp file with quran.txt name file
os.replace('temp.txt', 'quran.txt')

Now, let’s print the content of the new quran.txt file to check if the comments are not there and to make sure that it only has Quranic text.

					b_file = open("/content/quran.txt")

b_lines = b_file.readlines()
for line in b_lines:
A sample of the output

You can scroll down the output screen and check that it ends with the chapter/Sura “Surat Al Naas”. Thus, all text in the file is from the Quran text.

I prefer to convert the txt file to a dataframe and then start analysis as that can help to add columns for each preprocessing step performed on the text, which can support comparison among the different versions of the text used based on the analysis procedures performed.

					# convert txt file to dataframe
data = pd.read_csv('quran.txt', sep=',', header = None)
A sample of the output

To process the text easily using an array, we extract the Quranic text into an array as the following:

					# selecting the input array from the dataframe to process
raw_data = data.iloc[:,0]
A sample of output

We use the NLTK library to tokenize the text. We need to split the text based on the words/string, which can support the extraction of lexicons. We save the tokenized text into a new column “tokens” in the same dataframe table that we created before. The full dataframe is shown below under the script to check differences between the raw text and the tokenized text.  

					# import library
import nltk

# tokenize text and saving the output to a new column "tokens"
data['tokens'] = raw_data.apply(lambda x: nltk.word_tokenize(x))

# print the entire dataframe including the tokenized text
A sample of output

All tokens were merged into one list to prepare the text for lexicon extraction as the following:

					#turn all tokens into one single list

list_tokens = [item for items in data['tokens'] for item in items]

# check tokens list
A sample of output

“He created the heavens and earth in truth and formed you and perfected your forms; and to Him is the [final] destination”

Sūrat l-taghābun [Quran 64:3]

Lexicons Extraction Methods

1. Simple Frequency Count of Adjacent Words

This method is very simple. It depends on the word order by their frequencies in text. The main issues that usually occur in its output is that it is too sensitive to very frequent pairs, which might represent pronouns, articles, and prepositions that are not valuable to the context of the text.

In all examples below, we will apply the lexicon extraction methods for two length: bigrams for two words lexicons and trigrams for three words lexicons. Initialize NLTK’s library function for Bigrams and Trigrams Finder as the following:

					# intialize bigrams and trigrams measures and extract them from the text
bigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()

bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(list_tokens)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(list_tokens)

# create frequency table
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)

# print table head
A sample of output
					# print most frequent top 15 bigrams
A sample of output

You might notice that most of the top frequent frequency bigrams include particles and pronouns, which are called stop words in Natural Language Processing (NLP). For more detail about the meaning of stop words, please refer to our previous blog.

The following method remove stop words from the text.

					#function to remove stop words
def rightTypes(ngram):
    #if stop word:
        #return False
    ar_stopwords = set(stopwords.words('arabic'))
    for word in ngram:
        if (word in ar_stopwords):
            return False
    sentence = ngram
    return True

Applying the filtering method to the frequency based bigrams and printing the top 50 filtered bigrams.

					#filter bigrams
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]

# print top 50 filtered bigrams
A sample of output
					# extracting trigrams and save the results into a table 
trigram_freq = trigramFinder.ngram_fd.items()
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)

# checking the head of the table
A sample of output
					# printing the top 50 frequency based trigrams
A sample of output
					# filtering the content
filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypes(x))]

#print the top 50 filtered frequency based trigrams
A sample of output

At the end of the article, we will save all output to a CSV file to be downloaded for further analysis. Thus, in the following script, we save all outputs into arrays as the following: 

					# saving the resulted bigrams and trigrams into arrays to organize the overall output later
freq_bi = filtered_bi[:50].bigram.values
freq_tri = filtered_tri[:50].trigram.values

# saving the resulted frequencies into arrays to organize the overall output later
freq_bi_no = filtered_bi[:50].freq.values
freq_tri_no = filtered_tri[:50].freq.values

“Indeed, in the creation of the heavens and earth, and the alternation of the night and the day, and the [great] ships which sail through the sea with that which benefits people, and what Allah has sent down from the heavens of rain, giving life thereby to the earth after its lifelessness and dispersing therein every [kind of] moving creature, and [His] directing of the winds and the clouds controlled between the heaven and the earth are signs for a people who use reason.”

Sūrat l-baqarah [Quran 2:164]

2. Pointwise Mutual Information (PMI)

The main concept of Pointwise Mutual Information (PMI) is borrowed from the information theory. The PMI measures how much more the two or three words co-occur than if they were independent.

To extract bigrams based on the PMI method, we use the NLTK library as it provides modules for most lexicon extraction methods. The following script is used:

					# extract bigrams based on PMI and create a table
bigramPMITable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)
print (bigramPMITable)
A sample of the output

Looking to the output above, you might notice multiple stop word, such as من/from, في/in, كانوا/were …etc. Thus, applying the filtering method as the following script shows to remove stop words is very important. 

					# filtering the PMI bigrams and printing the top 50
filtered_biPMI = bigramPMITable[bigramPMITable.bigram.map(lambda x: rightTypes(x))]
A sample of output

We also extract trigrams as we did using the other previously discussed methods. The following script extract the trigrams and organize them into a dataframe table, then printing the top trigrams:

					# extracting the trigrams, saving them into a dataframe table, and printing the top 50
trigramPMITable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.pmi)), columns=['trigram','PMI']).sort_values(by='PMI', ascending=False)
A sample of output

To remove stop words from the output, we also apply the filtering method defined before as the following:

					# filtering the output to remove stop words and printing the top 50
filtered_triPMI = trigramPMITable[trigramPMITable.trigram.map(lambda x: rightTypesTri(x))]
A sample of output

As we did before, all results are saved to arrays as the following: 

					# saving the resulted bigrams and trigrams into arrays to organize the overall output later
pmi_bi = filtered_biPMI[:50].bigram.values
pmi_tri = filtered_triPMI[:50].trigram.values

# saving the resulted frequencies into arrays to organize the overall output later
pmi_bi_no = filtered_biPMI[:50].PMI.values
pmi_tri_no = filtered_triPMI[:50].PMI.values
3. T-test

The t-test takes a sample from the data and assumes that the sample is drawn from a normal distribution with the same mean (𝜇).In the following script, we apply the method student_t from the NLTK library to extract bigrams, then we create a dataframe table and print the top results:

					# bigrams extraction and creating table
bigramTtable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.student_t)), columns=['bigram','t']).sort_values(by='t', ascending=False)

We apply the filtering method we defined before to remove stop words from the output. The following script shows how we apply the filtering method. We also print the top 50 bigram results to check the output as the following:

					# filtering the bigrams
filteredT_bi = bigramTtable[bigramTtable.bigram.map(lambda x: rightTypes(x))]

As we did earlier, trigrams are extracted for the t-test as well as the following:

					# extracting trigrams 
trigramTtable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.student_t)), columns=['trigram','t']).sort_values(by='t', ascending=False)

We filtered the trigram output list to remove any stop words as the following: 

					# filtering trigrams list 
filteredT_tri = trigramTtable[trigramTtable.trigram.map(lambda x: rightTypes(x))]

To add the results to the overall result CSV file at the end, we create arrays for the results as the following:

					# saving the resulted bigrams and trigrams into arrays to organize the overall output later
t_bi = filteredT_bi[:50].bigram.values
t_tri = filteredT_tri[:50].trigram.values

# saving the resulted frequencies into arrays to organize the overall output later
t_bi_no = filteredT_bi[:50].t.values
t_tri_no = filteredT_tri[:50].t.values
4. Chi-Sqaure

The chi-square test (𝑥2) assumes that the sample is not normally distributed. It compares observed frequencies to expected frequencies. The larger the difference, the larger the confidence that the co-occurred words are not co-occur by chance.

We apply the chi_sq from the NLTK library to find the bigrams and trigrams based on the chi-square test as the following script shows:

					# extracting bigrams and organizing them into a table 
bigramChiTable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.chi_sq)), columns=['bigram','chi_sq']).sort_values(by='chi_sq', ascending=False)
A sample of the output

As earlier, the output list should get filtered as the following:

					# Filtering the output to remove stop words
filtered_biChi = bigramChiTable[bigramChiTable.bigram.map(lambda x: rightTypes(x))]
A sample of the output

Now, we will extract the trigrams , save them to a dataframe table, and filtered them to remove stop words as the followings:

					# extracting trigrams and saving result to a dataframe table
trigramChiTable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.chi_sq)), columns=['trigram','chi_sq']).sort_values(by='chi_sq', ascending=False)
					# filtering trigram output 
filtered_triChi = trigramChiTable[trigramChiTable.trigram.map(lambda x: rightTypes(x))]

To save the overall results in one result CSV file, we create arrays for the output as the following:

					# saving output to arrays to be used at the end of the project
chi_bi = filtered_biChi[:50].bigram.values
chi_tri = filtered_triChi[:50].trigram.values

chi_bi_no = filtered_biChi[:50].chi_sq.values
chi_tri_no = filtered_triChi[:50].chi_sq.values
5. Likelihood Ratio

The Likelihood ratio is more appropriate for sparse data than the previous approaches. It is a number that tells us how much more likely one word in the bigram/trigram is depending on the other word.

We apply the NLTK function “Likelihood_ratio” to extract bigrams and trigrams based on the Likelihood ratio. The following script shows how it is applied:

					# extracting bigrams to a dataframe table
bigramLikTable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.likelihood_ratio)), columns=['bigram','likelihood_ratio']).sort_values(by='likelihood_ratio', ascending=False)
A sample of the output

We then apply the filtering function defined before to remove stop words as the following:

					# filtering the trigram list
filteredLik_bi = bigramLikTable[bigramLikTable.bigram.map(lambda x: rightTypes(x))]
A sample of the output

Similar to the previous methods, we also extract trigrams and filtered them as the following:

					# extracting trigram
trigramLikTable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.likelihood_ratio)), columns=['trigram','likelihood_ratio']).sort_values(by='likelihood_ratio', ascending=False)
A sample of the output
					# filtering the trigram list
filteredLik_tri = trigramLikTable[trigramLikTable.trigram.map(lambda x: rightTypes(x))]

Arrays are created to save the results and to create a large CSV file that contains all methods’ outputs.

					# creating arrays to save the results
lik_bi = filteredLik_bi[:50].bigram.values
lik_tri = filteredLik_tri[:50].trigram.values

lik_bi_no = filteredLik_bi[:50].likelihood_ratio.values
lik_tri_no = filteredLik_tri[:50].likelihood_ratio.values


Creating Comparison Tables

The Bigrams Table

Having the results from all methods next to each other in one large table supports better comparison to find patterns among them. Thus, we aggregate all results to one large dataframe table and convert it to a CSV file to be downloaded and saved for further analysis as the following:

					# aggregate results from all methods of the bigram extraction to one table
bigramsCompare = pd.DataFrame([freq_bi, freq_bi_no, pmi_bi, pmi_bi_no, t_bi, t_bi_no, chi_bi, chi_bi_no, lik_bi, lik_bi_no]).T
bigramsCompare.columns = ['Frequency', 'Freq.','PMI', '%','T-test', '%','Chi-Sq Test', '%','Likeihood Ratio Test', '%']

# saving the resulted table to a csv file
bigramsCompare.to_csv('bigramsCompare.csv', sep=',', index=False,encoding='utf-8-sig')
The Trigram Table

As we just did for the bigram results, we also follow the same steps for the trigram outputs.

					#aggregate results from all methods of the trigram extraction to one table
trigramsCompare = pd.DataFrame([freq_tri, freq_tri_no, pmi_tri, pmi_tri_no, t_tri, t_tri_no,  chi_tri,chi_tri_no, lik_tri, lik_tri_no]).T
trigramsCompare.columns = ['Frequency', 'Freq.','PMI', '%','T-test', '%','Chi-Sq Test', '%','Likeihood Ratio Test', '%'] 

# saving the resulted table to a csv file
trigramsCompare.to_csv('trigramsCompare.csv', sep=',', index=False,encoding='utf-8-sig')


In this article, we review some examples of lexicon extraction methods. We recommend to increase the accuracy of the output by adding some filtering techniques based on Part-of-Speech tagging tools to provide specific patterns in the extracted lexicons. The concept of “Collocation” is also very similar to our topic, we suggest this draft chapter from Stanford NLP group to learn about the available tools and methods in extracting terms.

The entire code used in this exercise is available on this Github repository and on this Colab Project. We hope you find this article useful and informative. Please let us know if you have any question or comment by commenting on this article or by email.


For attribution in academic contexts, please cite this work as:

Husain F., “Extracting Quranic Lexicons”, The Information Science Lab, 2 December 2022.


Note: All photos presented in this article are captured by the author at the Arabic Islamic Science Museum in Sheikh Abdullah Al Salem Cultural Centre in Kuwait.

Leave a Comment

Your email address will not be published. Required fields are marked *