Scope-Anna University collaborative work on a ML and NLP based algorithm for keyphrase extraction from unstructured text presented at NLDB 2018
In conventional information retrieval systems, keywords play a major role. For instance, the keywords of a document are indexed and the indexed keywords are used for retrieving the document. However, different keywords are sometimes used to represent a unique document. This poses an obvious challenge in extracting a relevant document. A conference paper presented at the 23rd International Conference on Natural Language & Information Systems (June 13 – 15, 2018, Paris, France) describes a concept based approach to overcoming this challenge.
Concept based indexing and retrieval, which semantically identifies similar documents, overcomes the challenge by mapping the document phrases to a domain repository. The paper, “A Supervised Learning to Rank Approach for Dependency Based Concept Extraction and Repository Based Boosting for Domain Text Indexing” describes an approach that ranks concepts (key phrases) based on statistical and cue phrases. In addition, the concepts are ranked based on the dependency relations in which the candidate concept occurs.
A vector is formed with the phrase weight and dependency relations for each concept. Cue features, which are present in the title and abstract of a document, and C-value in case of multiple keywords are used to re-rank and weigh the vector corresponding to the concept. Additionally, the frequency at which the keywords occur and the type of dependency relations it has with the candidate concept are considered.
The ranking process utilizes the machine learning algorithm RankingSVM, to rank the candidate concepts based on the feature vectors. In order to make the ranking domain sensitive and to determine the domain relevance, the candidate concepts are partially or fully matched with the domain repository.
Based on the depth of the concept and the presence of parent and siblings, the domain relevant concepts are boosted up the order. The results indicate that the use of dependency based context vector and domain repository provide significant advantages when it is compared with other methods for extracting keywords.
Click here to download the conference paper.