Evaluation of Topic Identification Methods on Arabic Corpora

Abbas, M.; Smaili, K.; Berkanis, D.
October 2011
Journal of Digital Information Management;Oct2011, Vol. 9 Issue 5, p185
Academic Journal
Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly comparable results of six text categorization methods on a new Arabic corpus Alwatan-2004. Hence, Topic Unigram Language Model (TULM), Term Frequency/Inverse Document Frequency (TF/DF), Neural Network, SVM, M-SVM and TR have been experimented, and showed that TR-Classifier is the most efficient among the set of classifiers, nevertheless, only binary SVM outperformed it thanks to its characteristics. Moreover, we should note that the size of Alwatan-2004 corpus used to achieve our experiments is considered the most important compared to any other Arabic corpus which had been used for topic identification experiments until now. In addition, we aim through using small sizes of vocabularies to reduce the time of computation. This is important for adaptive language modeling, particularly Topic Adaptation, which is required in real time applications such as speech recognition and machine translation systems. Our experiments indicate that the results are better than other works dealing with Arabic text categorization.


Related Articles

  • A Bayesian Network Nearest K-labels Method for Multi-label Classification. Xin Xia; Xiaohu Yang; Shanping Li; Chao Wu // Advances in Information Sciences & Service Sciences;May2012, Vol. 4 Issue 8, p27 

    Multi-label classification refers to the task that predicts one instance to be one or more labels in the set of labels. Nowadays, it is increasingly required by the real-world applications, such as text categorization, functional genomics and semantic scene classification. The main challenge for...

  • Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. Mesleh, Abdelwadood Moh'd A. // Journal of Computer Science;2007, Vol. 3 Issue 6, p430 

    This paper aims to implement a Support Vector Machines (SVMs) based text classification system for Arabic language articles. This classifier uses CHI square method as a feature selection method in the pre-processing step of the Text Classification system design procedure. Comparing to other...

  • Bayesian Learning for Automatic Arabic Text Categorization. Kadhim, Mahmood H.; Omar, Nazlia // Journal of Next Generation Information Technology;May2013, Vol. 4 Issue 3, p1 

    Automatic Text Categorization (ATC) is a task of categorizing an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. The complex morphology of...

  • The story of the tortoise and the hare - Speech rate in simultaneous interpretation and its influence on the quality of trainee-interpreters performance. Vančura, Alma // Jezikoslovlje;2013, Vol. 14 Issue 1, p85 

    This paper analyzes speech rate of the source text (ST) speaker and its implications on the rendition of the target text (TT) done by trainee-interpreters. Various studies (Gerver 1969, Pio 2003, Vik-Touvinen 2002) have shown that an increase in the presentation rate has an effect on the quality...

  • Linguistic Landscape as a Translational Space: The Case of Hervanta, Tampere. Koskinen, Kaisa // Collegium;2013, Vol. 13, p73 

    In this article, the linguistic landscape of the suburb of Hervanta in Tampere, Finland is studied from the perspective of translation studies. The data, collected in 2011, consists of 22 cases of translated signage. This data was analysed by using categorisations previously developed by Reh...

  • Noun Classification System in Mizo. Saha, Atanu // Language in India;Dec2008, Vol. 8 Issue 12, p7 

    This paper investigates the Noun classification system of Mizo language. After the initial analysis a lot of interesting things have been found which are described in the paper. Like any other classifier language, the Mizo classifier system is highly productive. In case of borrowing from other...

  • Using hybrid associative classifier with translation (HACT) for studying imbalanced data sets. Cleofas Sánchez, Laura; Guzmán Escobedo, M.; María Valdovinos Rosas, Rosa; Yáñez Márquez, Cornelio; Camacho Nieto, Oscar // Revista Ingeniería e Investigación;Jan-Apr2012, Vol. 32 Issue 1, p53 

    Class imbalance may reduce the classifier performance in several recognition pattern problems. Such negative effect is more notable with least represented class (minority class) Patterns. A strategy for handling this problem consisted of treating the classes included in this problem separately...

  • Blinded From a Sniper Bullet and Shortchanged by the System (Arabic). Miller, T. Christian // Pro Publica;1/17/2010, p10 

    The article presents information on an Arabic translation of the article titled "Blinded From a Sniper Bullet and Shortchanged by the System" on January 19, 2010.

  • Our Articles on Wounded Iraq and Afghan Interpreters-Now in Arabic. Miller, T. Christian // Pro Publica;1/17/2010, p16 

    The article presents information on ProPublica's articles on Iraq and Afghan interpreters translated to Arabic language. The story of Iraqi and Afghan citizens wounded while working as interpreters for U.S. soldiers facing health care problems has been translated and published by an Arabic...

  • An early attestation of the Arabic definite article. Livingstone, Alasdair // Journal of Semitic Studies;Autumn97, Vol. 42 Issue 2, p259 

    Presents a history and classification of languages of pre- Islamic Arabia with respect to the contents of the Arabic definite article. Indepth look at the article; Conclusion of the article.


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics