How does a semantic indexing system based on the title perform against one based on the full-text?
Does a semantic indexing system based on the title equal the performance of a system based on the full-text if the number of samples for training the title-based method is much larger than the number of full-text samples? The answer is important, as it will deepen our understanding of the implications for automatic semantic indexing systems in digital libraries of scientific content. So, Florian Mai, Lukas Galke and Ansgar Scherp set out to find the answer.
Florian Mai, Lukas Galke and Ansgar Scherp from the Kiel University evaluated how models obtained from training on increasing amounts of title training data compare with the models derived from training on a constant number of full-texts. To this end, they developed three deep learning methods and evaluated them on two large-scale scientific datasets—one from the medical domain (PubMed) and the other from economics (EconBiz). They found that in one case, such a system is competitive with the full text system, and in the other case, it even yields considerably better scores.