Natural Language Processing (NLP) supports computers to understand natural languages and interact with human or other machines in a similar way to human beings. It is a subfield of linguistics, computer science, and artificial intelligence. Applications of NLP enable computers to read text, hear speech, understand it, analyze it, generate text, speak, and identify important parts of texts based on defined criteria. Twitter, WhatsApp, Instagram, Google search engine, Facebook, Amazon Alexa, email filters, Siri, automated translators, spell checker are all examples of NLP tools and applications among others. According to the latest NLP market forecast, worldwide NLP market revenue is predicted to increase from approximately three billion U.S. dollars in 2017 to over 43 billion in 2025. The bar chart below explains the rapid increase in NLP revenue forecast per year from 2017 until 2025 in U.S. dollar.
Considering this great importance and impact of NLP on our daily life, it is very important to cover low-resource languages when developing NLP methods and tools, such as the Arabic language. Arabic has very limited NLP applications that are still not as efficient as those that are developed to process the English language. This article focuses in introducing the Arabic language for NLP applications and discussing the challenges of processing Arabic text.
It is very important to consider the characteristics and properties of the Arabic language when designing a solution for Arabic text as that can directly impact the procedures taken. The This blog highlights some essential characteristics of Arabic, including dialectal Arabic. It includes examples from Arabic text to illustrate the challenges associated with the language.Arabic Language Background
Approximately 300 million people worldwide use Arabic as their primary language. Arabic is the official language for the majority of countries in the Middle East region. It is also the original language of the Quran and the Hadith; thus, it is widely studied by Muslims around the world.
Arabic is part of the Semitic language family. Scripts in Arabic are written in the opposite direction to English scripts; thus, Arabic text is read and written from right to left. The Arabic alphabet consists of 28 letters. It has only two vowels, “أ/alif” and “ي/yaa”. Arabic has two genders for nouns; feminine and masculine. The base form is the same as the masculine form. For example, the word “Qetta/قطة” refers to female cat and the word “Qett/قط” refers to male cat. In addition to the gender distinction in nouns, Arabic has singular, dual, and plural forms of nouns, verbs, pronouns, and adjectives. For instance, if we want to use the word cat to refer to two female cats in Arabic, the word “Qettatan/قطتان” is used, for two male cats, the word “Qettan/قطان” is used, and for plural cats, the word “Qettat/قطط” is used.
Arabic words are justified by using elongation or Tatweel, also called Kasheeda. The following picture shows an example of the use of elongation in Arabic words.There are different forms of Arabic: Classical Arabic Language (CAL) is the oldest and is often used in the Islamic manuscripts (e.g., the holy Quran); Modern Standard Arabic (MSA) is the official language for Arabic countries and is in use in schools, media, books, etc.; and Arabic dialects are the native language form of daily communication. The Arabic dialects can differ in form, primarily based on geography and social classes. Moreover, Arabic dialects are often used in user-generated content such as Twitter, Facebook, and Instagram. Nizar Habash in his book “Introduction to Arabic Natural Language Processing” divides the Arabic dialects into the following seven categories:
1. Egyptian Arabic covers Egypt and Sudan.
2. Levantine Arabic covers Lebanon, Syria, Jordan, Palestine, and Israel.
3. Gulf Arabic covers Kuwait, United Arab Emirates, Bahrain, Qatar, Saudi Arabia (wide range of sub-dialects), and Oman (sometimes included).
4. North African Arabic covers Morocco, Algeria, Tunisia, Mauritania, and Libya (sometimes included).
5. Iraqi Arabic is a mixture of both Levantine and Gulf.
6. Yemenite Arabic is often considered its own dialect.
7. Maltese Arabic, which is not always considered one of the Arabic dialects.
Challenges of Processing the Arabic Language
Multiple Letter Shapes
The letters in Arabic can be written in multiple shapes depending on the letter’s location within the word. For example, the letter “ ف/faa’” can take the shapes “ـفـ/ فـ / ف” based on whether it is located at the beginning, in the middle, or at the end of the word. The picture below shows an example.
Ambiguity
Multiple words can have the same spelling but have different pronunciations and meanings depending on diacritical marks and punctuation. Diacritics are often called Tashkīl or Harakāt in Arabic, and they are used as a phonetic guide. Most available Arabic text is written without these marks, which creates ambiguity. For example, the Arabic word ” شَعْرٌ”, ” شْعرٌ” means hair, ” شَعَرَ ” means he felt, and ” شِعْرٌ” means poetry. Below picture shows an example.
Ambiguity in Arabic can also be contextual. For example, the word “قتل /killing” in Arabic can be used in violent contexts as well as non-violent contexts, as it is in this statement “تستطيع قتل الازھار ولكن لا تستطيع أن تمنع قدوم الربيع”, which means “You may kill the flowers but cannot prevent the arrival of spring”.
Diversity of Arabic Dialect
The Arabic world’s official language is the MSA, but the spoken Arabic is the dialectal Arabic. There are multiple Arabic dialects that vary among different regions and even within the same country. Figure 5 shows an example for the variation among multiple Arabic dialects for the Arabic word cat, which is Qitah/ قطة in MSA.
Multiple dialects may share some words with the same spellings and pronunciations, while their meanings may differ. For instance, the word “/ناصح Nasih” means “overweight” in Levantine, “smart” in Egyptian, and “advisor” in Gulf.
Diversity of Arabic User-Generated Content
Arabic text in user-generated content is more diverse than Arabic dialects. For example, Arabic text used in social media is usually the informal dialectal form of Arabic, including local words and slang words that differ among regions within Arabic-speaking countries.
Posts in user-generated content are often written in Latin characters representing Arabic words in a transliterated form called Arabizi, Aralish, Arabish, or Franco-Arab. Arabizi utilizes Latin letters, and in some cases, when the Arabic letter does not have some equivalent phonetic letters in English, it utilizes numeric and punctuation characters. For example, the number “2” represents the letter “أ” in Arabic (that sounds like “a” as in apple), and the number “3” represents the letter “ع” in Arabic (that is a guttural “aa’”). Examples of words written in Arabizi are “3a2albik/عالبك/on your heart” and “raw3a/روعة/wonderful”. Moreover, code-switching among the various forms of Arabic and sometimes English can be observed in the same post in user-generated content. The following picture demonstrates the diversity of user-generated content in the Arabic language.
Some English terms are commonly used among Arabic speakers can be represented using Arabic letters, as shown below.
The Use of Urdu and Farsi Alphabets
In some Arabic dialects, the pronunciation of the Arabic alphabet differs. Thus, users use other non-Arabic languages to represent the dialectal sound of the letter to better describe the phonetic sound of the dialectal words. In this scenario, Urdu and Farsi alphabets are commonly used. The following picture shows an example of a tweet that is written in Iraqi Arabic, which replaces the letter “ك/kaa’” in Arabic with the letter “گ/Qa’” in Farsi and Urdu.
Conclusion
In this article, I shortly introduce the Arabic language to understand language-specific challenges that come in to play in developing effective Arabic natural language processing system. This introduction contains some general characteristics of Arabic and also discusses issues related to Arabic used in user-generated content.
References
Husain, F. & Uzuner, O. (2022). Investigating the Effects of Preprocessing Arabic Text for Offensive Language Detection and Hate Speech Detection Systems. The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). http://dx.doi.org/10.1145/3501398
Husain, F. (2021). Arabic Offensive Language Detection in Social Media. George Mason University, ProQuest Dissertations Publishing. 28494043.
Habash, N. (2010). Introduction to Arabic Natural Language Processing. Synthesis Lectures on Human Language Technologies, 3(1), 1-187. Morgan & Claypool Publishers.
Darwish, K. (2013, October). Arabizi Detection and Conversion to Arabic. Proceedings of the (EMNLP) 2014 Workshop on Arabic Natural Language Processing (ANLP) (pp. 217-224). Association for Computational Linguistics (ACL). Doha, Qatar.
Abdelfatah, K., Terejanu, G., & and Alhelbawy, A. (2017). Unsupervised Detection of Violent Content in Arabic Social Media. Comput. Sci. Inf. Technol. (CS IT) (pp. 1–7).
are not there any data we can use to train for different dialect ?