Mohamed Lubani - My Research

My Research

I am a dedicated researcher specializing in Natural Language Processing (NLP), with a particular focus on named-entity disambiguation, ontology learning and population, semantic relation extraction, and text analysis for under-resourced languages. My research journey has revolved around developing novel methods and techniques to enhance the understanding and processing of natural language text.

During my doctoral studies, I conducted extensive research on building representations for natural language text elements, such as words, named-entities, and contexts, to tackle the challenging task of named-entity disambiguation for ontology population. To address this problem, I employed a predictive approach to generate continuous vector representations, or embeddings, which effectively captured the semantic and syntactic properties of these text elements. Moreover, I aimed to establish semantic relationships between the representations of named-entities and their corresponding contexts, enabling disambiguation through similarity measurements. To accomplish this, I leveraged machine learning tools, particularly TensorFlow, to construct and train specialized deep neural networks known as Autoencoders. These networks played a crucial role in generating the accurate representations required for named-entity disambiguation. Additionally, my research explored the utilization of these embeddings for semantic relation extraction between named-entities. By measuring the semantic similarity between candidate relations and their respective contexts, I aimed to enhance the extraction of meaningful relationships from natural language text.

In addition to my doctoral research, I have also delved into the field of under-resourced languages, focusing primarily on Malay and Banjar. I have contributed to the development of NLP resources for these languages, including tagged datasets that facilitate the training and construction of NLP models. Furthermore, my work within the research and development industry enabled me to delve into Malay text understanding. I designed and implemented methods for Malay word sense disambiguation and co-reference resolution, contributing to the improvement of NLP applications for the Malay language.

As a dedicated researcher, I am highly motivated to delve deeper into the realm of Natural Language Processing (NLP) and explore cutting-edge techniques that push the boundaries of our understanding. In particular, I am excited about the potential of multi-modal approaches to enhance NLP systems. By integrating information from multiple modalities, such as text, images, audio, and video, we can create a more comprehensive and nuanced representation of natural language data. For instance, incorporating image data can aid in tasks like named-entity recognition by analyzing visual cues, while audio data can be leveraged for sentiment analysis or speaker identification. Such multi-modal fusion can capture rich and diverse information, allowing NLP models to make more informed and nuanced decisions.

Another avenue of exploration involves leveraging advanced deep learning architectures. I aim to investigate and employ state-of-the-art models like transformer-based architectures, graph neural networks, or hierarchical recurrent neural networks to effectively capture and process the intricate relationships and dependencies within multi-modal data. These advanced architectures have shown promising results in other domains and adapting them to NLP tasks holds great potential for improving the performance and understanding of natural language.

Furthermore, I believe that incorporating external knowledge sources can significantly enrich the representations used in NLP systems. By leveraging existing ontologies, knowledge graphs, or semantic networks, we can provide contextual information and semantic relationships that enhance the comprehension of text elements and their interactions. These external knowledge sources can serve as valuable priors, guiding the learning process and enabling more accurate disambiguation, relation extraction, and ontology population.

By venturing into these innovative techniques, I hope to contribute to the advancement of NLP research and foster the development of intelligent systems that can better understand and process human language. Ultimately, my goal is to improve the overall performance of NLP systems by harnessing the power of advanced deep learning architectures, integrating external knowledge sources, and exploring the potential of multi-modal approaches.

Publications:

PhD thesis: Named-Entity Disambiguation for Ontology Population Using Embedding-Based Context Entity Semantic Relatedness/ Mohamed Lubani
The National University of Malaysia (Universiti Kebangsaan Malaysia), 2020.
[Abstract]

Ontology population is the task of updating an ontology with new realizations of concepts and relations extracted from unstructured text. To maintain a sound and useful ontology, multiple instance assertions should be avoided and instances should be asserted to their correct places in the ontology. For this reason, entities need to be disambiguated and linked to their correct senses in the ontology based on their contexts. In order to be used for context-entity semantic relatedness, text elements need to be associated with representations that reflect their properties. The distributed vector representations capture the semantic and syntactic properties of text elements and embed them in numerical vector representations referred to as embeddings. Building these vectors for the task of entity disambiguation requires a tagged text corpus in which entities are detected and linked to their correct senses. However, manually building such a corpus is too expensive and should be avoided. In addition, the corpus should include entities that already exist in the ontology as well as other entities that co-occur with them. Furthermore, to assess context-entity semantic relatedness, a context vector representation (embedding) is needed. Most existing methods either use pre-annotated text corpora or utilize the hyperlinks in Wikipedia pages to construct the training corpus. Such methods are unable to automatically annotate new unstructured plain text when Wikipedia hyperlinks are not present. For context representation, most existing methods simply combine the vector representations of the context's elements without considering the specific nature of the entity disambiguation task. This study aims to propose a method of building a tagged text corpus from unstructured plain text related to the entities in the ontology. In addition, a method of building context vector representations is proposed to enhance the features related to the entity disambiguation task. To achieve this objective, Wikidata knowledge base is utilized to link entity mentions in the text to their correct senses. Entity acronyms, aliases and semantic relations are used to detect and link entities to their senses. Once generated, the corpus is used to build the vector representations of words and entities using an extended skip-gram model. A modified autoencoder is used to build the final context and entity representations as well as map the related representations to close points in the vector space. This assists in the building of dedicated context vector representations in which entity disambiguation related features are enhanced and noisy-irrelevant features are eliminated. Based on the similarities between the built vector representations as well as other entity context-independent features, an entity disambiguation model is proposed. The proposed model achieved near state-of-the-art disambiguation accuracy of 93.76% and outperformed recent embedding-based disambiguation methods when tested using the AIDA CoNLL-YAGO dataset.

Text Relation Extraction Using Sentence-Relation Semantic Similarity
Mohamed Lubani and Shahrul Azman Mohd Noah
International Conference on Multi-disciplinary Trends in Artificial Intelligence, pp. 3-14, 2019.
🏆 Best Paper Award
[Abstract]

There is a huge amount of available information stored in unstructured plain text. Relation Extraction (RE) is an important task in the process of converting unstructured resources into machine-readable format. RE is usually considered as a classification problem where a set of features are extracted from the training sentences and thereafter passed to a classifier to predict the relation labels. Existing methods either manually design these features or automatically build them by means of deep neural networks. However, in many cases these features are general and do not accurately reflect the properties of the input sentences. In addition, these features are only built for the input sentences with no regard to the features of the target relations. In this paper, we follow a different approach to perform the RE task. We propose an extended autoencoder model to automatically build vector representations for sentences and relations from their distinctive features. The built vectors are high abstract continuous vector representations (embeddings) where task related features are preserved and noisy irrelevant features are eliminated. Similarity measures are then used to find the sentence-relation semantic similarities using their representations in order to label sentences with the most similar relations. The conducted experiments show that the proposed model is effective in labeling new sentences with their correct semantic relations.

Ontology population: approaches and design aspects
Mohamed Lubani, Shahrul Azman Mohd Noah and Rohana Mahmud
Journal of Information Science 45 (4), 502-515, 2019.
[Abstract]

Ontologies provide a means to store knowledge in a machine-readable format. Ontology population is the task of updating an ontology with new facts from an input knowledge resource. These facts are represented in a structured format and integrated thereafter into the existing knowledge in the ontology. Textual resources are the dominant online knowledge resources that contain a large number of facts expressed either explicitly or implicitly. Hence, the automatic processing of the extensive knowledge available in these resources has recently gained increasing interest. This study discusses the major components of ontology population process and the different design aspects to be considered when building ontology population systems. In addition, this research explains the different approaches and techniques adopted to carry out the task of ontology population. The possible choices of the design aspects and the related issues are identified and analysed using a set of representative ontology population systems. This study concludes by describing the remaining open issues that should be further explored in ontology population.

A Method and System for Co-Reference Resolution for Multi-Lingual Text Understanding
Benjamin Chu Min Xian, Mohammad Arshi Saloot, Mohamed Lubani, Khalil Bouzekri, Dickson Lukose.
MY Patent PI 2,016,002,112, 2018.

Building Compact Entity Embeddings Using Wikidata
Mohamed Lubani and Shahrul Azman Mohd Noah.
International Journal on Advanced Science, Engineering and Information Technology, vol. 8, no. 4-2, pp. 1437-1445, 2018.
[Abstract]

Representing natural language sentences has always been a challenge in statistical language modeling. Atomic discrete representations of words make it difficult to represent semantically related sentences. Other sentence components such as phrases and named-entities should be recognized and given representations as units instead of individual words. Different entity senses should be assigned different representations even though they share identical words. In this paper, we focus on building the vector representations (embedding) of named-entities from their contexts to facilitate the task of ontology population where named-entities need to be recognized and disambiguated in natural language text. Given a list of target named-entities, Wikidata is used to compensate for the lack of a labeled corpus to build the contexts of all target named-entities as well as all their senses. Description text and semantic relations with other named-entities are considered when building the contexts from Wikidata. To avoid noisy and uninformative features in the embedding generated from artificially built contexts, we propose a method to build compact entity representations to sharpen entity embedding by removing irrelevant features and emphasizing the most detailed ones. An extended version of the Continuous Bag-of-Words model (CBOW) is used to build the joint vector representations of words and named-entities using Wikidata contexts. Each entity context is then represented by a subset of elements that maximizes the chances of keeping the most descriptive features about the target entity. The final entity representations are built by compressing the embedding of the chosen subset using a deep stacked auto encoders model. Cosine similarity and t-SNE visualization technique are used to evaluate the final entity vectors. Results show that semantically related entities are clustered near each other in the vector space. Entities that appear in similar contexts are assigned similar compact vector representations based on their contexts.

Master thesis: Self-tuned deep learning model for text relation extraction/ Mohamed Lubani
University of Malaya Library, 2015.
[Abstract]

The amount of available textual information stored in databases or available on the Web is rapidly increasing due to the accelerating rate of scientific progress and the use of modern advanced technologies to publish and share information. Therefore, new automatic techniques are required to facilitate the task of seeking new knowledge or to extract meaningful patterns from the text. Relation Extraction (RE) which is part of Information Extraction (IE) is the task of determining semantic relationships expressed in sentences of unstructured free natural text to populate a knowledge base with new discovered facts. A self-tuned RE paradigm is proposed to extract semantic relationships using a deep learning model and ontology learning techniques namely fuzzy clustering without the need for any labeled corpus. The sole input for the system is the unlabeled corpus and the output is a trained model that is able to extract semantic relationships between entities in the text. To build the model, the system first preprocesses the corpus and automatically chooses sentences that are possible candidates for the RE task. Then fuzzy clustering is used to build training examples where each set of sentences in a cluster represents a semantic relationship. The performance of the model is evaluated and compared with other relation extractors. A set of English historical articles extracted from Wikipedia is used to test the model. Results show that the proposed model balances between precision and recall and has 74% F-measure value in terms of the quality of the extracted relations.

Generating Conversion Rules for Malay-Banjar Translation
Mohamed Lubani, Rohana Mahmud
Proceedings of the 16th International Conference on Translation (ICT-16), University of Malaya, Kuala Lumpur, pp. 671-683, , 2017.

Benchmarking Mi-POS: Malay Part-of-Speech Tagger
Benjamin Chu Min Xian, Mohamed Lubani, Liew Kwei Ping, Khalil Bouzekri, Rohana Mahmud, and Dickson Lukose.
International Journal of Knowledge Engineering vol. 2, no. 3, pp. 115-121, 2016.
[Abstract]

A part-of-speech tagger as signs the correct grammatical category to each word in a given text based on the context surrounding the word. This paper presents Mi-POS, a Malay language Part-of-Speech tagger that is developed using a probabilistic approach with information about the context. The results of benchmarking Mi-POS against several similar systems are also presented in this paper and the lessons learnt from it are highlighted. The dataset used for evaluation consists of manually annotated texts. The authors used the accuracy and time to measure the results of this evaluation. The final results show that Mi-POS outperforms other Malay Part-of-Speech taggers in terms of accuracy with an accuracy of 95.16% obtained by tagging new words from the same training corpus type and 81.12% for words from different corpora types.

Building A Dictionary of Malay Language Part-of-Speech Tagged Words Using Bahasa WordNet and Bahasa Indonesia Resources
Mohamed Lubani, Rohana Mahmud
International Conference on Malay Heritage and Civilisation (ICOMHAC 2015), Langkawi, Kedah, Malaysia, , 2015.

Optical flow based dynamic curved video text detection
Palaiahnakote Shivakumara, Mohamed Lubani, KokSheik Wong, Tong Lu
The 21st IEEE International Conference on Image Processing (ICIP 2014), Paris, France, , 2014.