A free fortnightly newsletter on Taxonomy, Thesauri & Ontology and Semantic Publishing
TF-IDF algorithms and eCommerce search
In information retrieval systems, algorithms that rely on probability and statistical measures decide documents that are relevant for a given search term and for an appropriate ranking of the documents. One of these statistical measures is TF-IDF—Term Frequency (TF) and Inverse Document Frequency (IDF). According to David Argüello Sánchez, editor of EmpathyBroker, TF-IDF is a good relevancy algorithm that works in many use cases. However, there are challenges when TF-IDF is used in eCommerce search.
In an eCommerce platform, the products are tagged or listed based on marketing priorities and not specifically for a search engine. Therefore, there are assumptions that term repetition makes a document more relevant and that finding a term in a less commonly used field is more accurate. This assumption can produce unusual results. Even though it makes sense from an engineering perspective, it does not make sense from a customer’s point of view.
Similarly, a search engine that is built for an eCommerce platform has to retrieve documents that contain a certain term. Additionally, the search engine has to recover documents that are relevant to that term from a commercial perspective. Hence, the search engine needs to find products that the user is looking for and the products that the retailer wants to sell.
TF-IDF is used in information retrieval algorithms, along with other normalization values, to infer document relevance for a given term. Using this algorithm makes sense from two perspectives. From the TF point of view, every document comprising of the given term can be relevant, but in a document if the term occurs 10 times, the probability of the document being subject related to that term is much higher. Taking the IDF viewpoint, when a field contains the term from an uncommon subject that particular term is given detailed information than a common one due to higher relevance.
To summarize, TF-IDF is an algorithm that infers relevance based on the principles of repetition and rareness. Therefore, when a TF-IDF algorithm is used for information retrieval in an eCommerce platform, different values (TF, IDF, normalizations…) are applied to infer relevance. This might cause description matches to appear on top of name matches, and the results might appear confusing for the customers.
Click here to read the in-depth analysis of the TF-IDF algorithm in an eCommerce environment.