Advanced Apache Solr boosting: a case study
Recently Triquanta build a specialized search engine for the Gemeentemuseum Den Haag (www.gemeentemuseum.nl). The goal of this project was to develop a search engine for finding museum objects within large collections. This search engine was based on research done by Marijn Koolen who works as an assistent professor at the University of Amsterdam (www.uva.nl). His research was part of the Catch Project (http://www.catchplus.nl/en).
Marijn Koolen showed that the traditional database systems that used to describe the objects are not the best option to find objects if you don't know exactly what you are looking for.
Main reason for this is that museum objects are generally described in more than 50 different fields. A search must be done on one or more of these fields to actually find what you are looking for. Performing such searches requires an expert who knows exactly how the objects are stored in the database in the last 20 years. Marijn stated in his research that if all fields are mapped and brought back to only 5 fields, and would be injected into a modern search engine, many more relevant documents could be found. You don not need the help of an expert anymore as the 5 fields itself are very descriptive:
- Who
- What
- Where
- How
- When
The requirements
The requirements for this search engine where quite straightforward: it should be able to do an advanced search query based on these 5 fields.
For example: If we would search for a 'violoncello made in Italy by Antonio Stradivarius' the query would be:
-
Who: Antonio Stradivarius
-
What: Violoncello
-
Where: Italy
In order to get other relevant objects an additional search query had to be done though all fulltext fields. The query would than yield:
-
Who: Antonio Stradivarius
-
What: Violoncello
-
Where: Italy
-
Fulltext: Antonio Stradivarius Violoncello Italy
The first implementation
For this type of search we used Apache Solr. For those who don't know Solr, here is an excerpt from the homepage of Apache Solr:
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest Internet sites.
For the first implementation of the search engine the standard Solr scoring was used. We got some interesting results: objects made in Italy where always on the top of the list: it occurred for instance that a painting made in Italy and also collected in Italy was above the violoncello made in Italy. This was unwanted behavior and was caused by (amongst others) the so called 'length norm': matches on a smaller field scored higher than matches on a larger field.
This was caused by the large descriptions indexed in the 'what' fields. The 'what' fields contained much more terms than the 'where' fields. The result was that the 'where' fields were given a lower score then the 'what' fields. If a match was found in the 'where' fields, the score for the 'where' field would be higher than the score for the 'what' field.
We could have set the omitNorms of all the fields to true but this would result that the length of the field would not be used in the calculation of the score.
When comparing two documents we wanted a higher score for the document containing only the word Italy, than for a document in which only one of the words in the where field would be Italy.
So we needed another way of scoring.
Scoring model
The default scoring model of Solr, which we thought would be sufficient, appeared not suitable for this kind of search query. During the research another, more appropriate, scoring model was used:
Each hit in one of the specific fields added a fixed value to the score. (for simplicity we used '1' but each value would be possible as long if as it is equal for all fields). Besides the fixed value each hit would also add a smaller value, which is the relative score for the document compared to other documents which had a hit in this field (let's call this the normalized field score value for now).
For example:
Suppose that a query in a specific field matches (say who: antonio stradivarius), the score stemming from this field is 1 + and the normalized field score value would be a value between 0 and 0.2.
If we use the above mentioned example, a possible outcome would be:
- Who: Antonio Stradivarius: Match. The score is 1 + 0.11
- What: Violoncello: No match. The score is 0
- Where: Italy: Match. The score is 1 + 0.16
- Fulltext: Antonio Stradivarius Violoncello Italy: Match. The score is 1 + 0.09
The total score would be 3.36.
Using this scoring model, a document which matches 2 fields would always end up higher than a document which only matches on 1 field. Research showed that this scoring model is highly effective in retrieving documents from museum collections.
Solr implementation
Now we have this scoring model, we have to implement this in Solr. This was a non-trivial exercise because it means that we have to have a fine grained control of the scoring of fields independently. We did the following to achieve this behavior. If you have never used Solr, just skip the next section:
We used the edismax query parser. To disable the normal Solr scoring we gave the fields a very low boost factor: 1e-10. We did not set it to zero, because this has the undesired effect that the search snippet cannot be generated. The qf parameters are:
qf<span class="sy0">=</span>who^<span class="nu19">1e-10</span><span class="sy0">&</span>amp<span class="sy0">;</span>qf<span class="sy0">=</span>what^<span class="nu19">1e-10</span><span class="sy0">&</span>amp<span class="sy0">;</span>qf<span class="sy0">=</span>where^<span class="nu19">1e-10</span><span class="sy0">&</span>amp<span class="sy0">;</span>qf<span class="sy0">=</span>how^<span class="nu19">1e-10</span><span class="sy0">&</span>amp<span class="sy0">;</span>qf<span class="sy0">=</span>when^<span class="nu19">1e-10</span><span class="sy0">&</span>amp<span class="sy0">;</span>qf<span class="sy0">=</span>fulltext^<span class="nu19">1e-10</span>
Now the normal Solr scoring is disabled and we have to add the new scoring model. To implement the scoring model we use the so called boost function parameter (bf). Using the boost functions so we can calculate the score for each field independently. The calculation of the score is done in five steps:
-
Create a subquery for the field. This can be done using the 'query' function. This function returns the score of the subquery.
-
Determine if there is a match. If there is a match the query function returns a positive value.
-
Using the map function, we assign a value of 1 to the score of the field if it is higher than 0: we map a value between 1e-10 and 1e3 to 1.
-
Using a scale function, we scale the score of the subquery between 0 and 0.2
-
We add the result of the map and the scale functions.
In Solr parameters this is written as follows (for the 'what' query):
bf<span class="sy0">=</span>sum<span class="br0">(</span>map<span class="br0">(</span>query<span class="br0">(</span><span class="re0">$qwhat</span><span class="br0">)</span><span class="sy0">,</span><span class="nu19">1e-10</span><span class="sy0">,</span><span class="nu19">1e3</span><span class="sy0">,</span><span class="nu0">1</span><span class="sy0">,</span><span class="nu0">0</span><span class="br0">)</span><span class="sy0">,</span>scale<span class="br0">(</span>query<span class="br0">(</span><span class="re0">$qwhat</span><span class="br0">)</span><span class="sy0">,</span><span class="nu0">0</span><span class="sy0">,</span><span class="nu19">0.2</span><span class="br0">)</span><span class="br0">)</span><span class="sy0">&</span>amp<span class="sy0">;</span>qwhat<span class="sy0">=</span><span class="br0">{</span><span class="sy0">!</span>dismax qf<span class="sy0">=</span>what_search mm<span class="sy0">=</span><span class="nu0">1</span> ps<span class="sy0">=</span><span class="nu0">100</span> bf<span class="sy0">=</span><span class="nu0">1</span> pf<span class="sy0">=</span>what_search<span class="br0">}</span>Violoncello
We used param dereferencing to put the subquery in a seperate parameter. This subquery uses the dismax query parser (!dismax) using some parameters (like qf and mm).
Last thing to do is to parse the original query so we could assemble the boost function parameters for the 6 different fields.
Conclusion
We created this advanced search engine and the results are promising: it is now much easier to retrieve relevant objects from the museum database. One of the benefits of this approach (performing searches based on the 5 fields) is that it is no longer necessary to clean up the database to get relevant objects: the new search engine will always outperform the search in the database.
Another benefit of this approach is that it is possible to get related objects you weren't expecting. An example of an unexpected result is a painting on which an instrument is depicted. Suppose you want to make an exhibition about violins, if the description of a painting record contains the word 'violin', the search engine will return this as a result. You would never have found this painting if this was done by a search query on 'violin' in the (fictive) field object_type.
This example also shows the power of a modern search engine, like Apache Solr; using only configuration highly specialized solutions are possible.