This year we have made some significant improvements to the Summon algorithm, designed to improve the researcher’s experience in a variety of situations, including exploratory searching, searching for specific topics, and searching for a known item. Now, as a result of these changes, we believe researchers will be able to find the best information possible, more quickly and easily.
Better Exploratory Searching; Better Topical Searching
Summon’s relevance ranking algorithm uses two types of relevance factors: the dynamic rank and the static rank. The dynamic rank factors describe how well a given query matches each record. The static rank factors represent the importance based on the characteristics of each record. A common relevance issue we have observed involves cases where the influence of the dynamic rank is too strong, such that records with low static ranks are appearing among the top search results. Another common type of relevance issues involves a bias toward short titles.
For example, consider short topical queries, such as linguistics or global warming. Many of top results Summon returns would have short titles (this does not include subtitles), and many of them would have titles that exactly match the given queries. Items with longer titles are often ranked lower even if they have higher importance to the users (e.g., newer publications, higher citation counts, more important content types, etc.).
The New Algorithm
The new algorithm has two primary changes from the current algorithm.
- More emphasis on the static rank – The new algorithm uses more emphasis on the static rank factors than the current algorithm. This fixes many issues mentioned previously. However, it’s important to find the right balance of the dynamic rank and the static rank, as putting too much emphasis on the static rank would cause new problems. We have experimented with various weights to find a balance that we believe will work best.
- Less influence of field length normalization and exact title match boost – With the new algorithm, both field length normalization and exact title (and exact title+subtitle) have less of an impact. These changes help reduce the bias toward short titles, and allow longer titles with high static ranks to appear among the top results.
In addition, we made a variety of tweaks to both compliment two primary changes described above and to add to those changes, based on our relevance analyses and experimentations. These tweaks include, but are not limited to, adjusting the influence of term frequency, better influence of phrase matching, adjusting the weight of various content types (e.g., Book, eBook, Journal), modifying the impact of recency, and adjusting the influence of citation counts.
What does this mean for the end user?
With these improvements, we expect users to see:
- Fewer problems of records with low static ranks appearing among top results, such as:
- Old publications (especially journal articles and newspaper articles)
- Less important content type, such as non-scholarly magazine articles, book reviews, etc.
- More records with long titles included among top results if they are relevant to the query
- Fewer problems of book reviews appearing above the book records themselves
- Fewer problems of older editions of a book appearing before the newest edition
Overall, with the new algorithm, short and general topical queries (e.g., linguistics, global warming) will tend to return more books, eBooks and journals among the top results, while long and specific topical queries (e.g., linguistics universal grammar, global warming Kyoto protocol) will tend to return more journal articles among the top results.
Our ultimate goal is to improve the overall quality of our relevance ranking algorithm. However, given the complexity of the relevance ranking algorithm, fixing existing relevance issues may cause new relevance issues elsewhere. To ensure the new algorithm was an improvement, we submitted it to the Summon advisory board and some other external testers for feedback. Overall, the testers reported that the new algorithm was as good as, if not better, than the current algorithm, with the majority rating it “Better” or “Much better.” As one tester put it: “The results ‘speak better’ to the user and provides them with more help on how to expand their searches by the context of results.”
Known Item Searching
We’ve also released an improvement targeting known item searching—a type of search where a user knows the title, author and/or other information about an item, and searches for the item. Even though there are variations of this concept, known item searching is considered to be "one of the most widely deployed concepts in the field of library and information science" (Lee, et. al. 2007). It seems that there is a general consensus among librarians that known item searching is a weakness of web-scale discovery systems.
While title+author queries, citation queries and other known item searches work well from Summon’s Advanced Search Interface, most casual users would probably not use these methods, and simply use the basic search box. As such, known item search queries via the basic search, especially the title+author and title+subtitle+author queries, are very common.
One could see such queries in Summon’s autocomplete suggestions and query suggestions, which are based on Summon’s query logs. For example, if a user types in common sense in Summon’s search box, the user would see common sense thomas paine as an autocomplete suggestion. However, with Summon’s old relevance algorithm, this query common sense thomas paine might not return the book record among the top results.
Our known item search improvements improve the relevance of known item queries when the user does not use the Advanced Search Interface or the special query syntax. The improvements should work especially well for title+author queries and title+subtitle+author searches for books, eBooks and journal articles. Other combinations of fields, such as title, subtitle, author, publication title (for journal articles) or edition (for books and eBooks), also benefit from these improvements. The end result is that end users should be able to find those known items more easily than before.
Relevance Improvements Going Forward
While we believe the recent changes to our relevance algorithm show a demonstrable improvement, relevance is an ongoing, challenging problem to tackle. We are continually researching new use cases or relevance challenges. As we solve for current problems, new forms of content, changes in metadata practices, evolving standards and changing expectations and behaviors from users ensures there will always be new use cases to solve for.
ProQuest has completed its acquisition of Ex Libris Group and formed a new business unit: Ex Libris, a ProQuest company. This new business unit will manage existing discovery, knowledgebase, and management solutions including: Alma, Aleph, bX,…
Soon enough it will be time for our Data Retrieval Service (DRS) to upload your year-end usage reports and make the data available in the consolidated reports you run on the Intota Assessment platform. Review our recommended action items to…
This year we have made some significant improvements to the Summon algorithm, designed to improve the researcher’s experience in a variety of situations, including exploratory searching, searching for specific topics, and searching for a known…