Mastering solr part II: A solr request en the schema.xml
Now we have a Solr running in a Drupal site, let's have a look at what is actually happening when a request is sent to Solr. We created a basic page titled 'Fox', so let's try to find this page!
If we go the search page (in our case '/solrsearch') and search for fox, we should see the following result:
In order to get this page, a request has been sent to solr. Let's have a look at this request. The easiest way to do this, is to have a look at the solr log. You should see something like this:
We see that a request is done to Solr (webapp=/solr), on the path select (path=/select) using several params (params={start=0&q=fox ......}.
The parameters are the most important in this log file. Much of the tweaking can be done using these parameters. In the very simple query for fox, the following parameters were send:
- start=0
- q="fox"
- qf=t_title^5.0
- qf=t_body:value^1.0
- json.nl=map
- wt=json
- fq=index_id:default_node_index
- rows=10
The parameters start and rows have to do with the pagination. Rows is the amount of results returned per request and start is where to start from. The wt parameter tells solr how to return the results. In the case of Search API, this will always be json. The parameter json.nl has something to do with how the json is returned and is apparently needed by the Search API. The most important parameters however are q, qf and fq. The q parameter, as you may have mentioned, is the actual query. The qf parameters define in which fields the searching should be done. In this simple case, the searching has been performed in the t_title and the t_body fields. The fq parameter is used for filtering. Apperently search api is able to store more indexes in one solr instance, and uses a filter query to get only results for a specific index (in this case the default_node_index).
The first technique which is very usefull in tweaking solr, is to do a request to solr directly from a browser. In that way we can simply add or delete parameters in order to see the effect on the query. So let's do this request right now! The base url is localhost:8983, than we need to go to the webapp solr (localhost:8983/solr) and to the path select (localhost:8983/solr/select). Finally we need to add the parameters (localhost:8983/solr/select?start=0&q=fox etc). The complete url now reads:
http://localhost:8983/solr/select?start=0&q=%22fox%22&qf=t_title^5.0&qf=t_body:value^1.0&json.nl=map&wt=json&fq=index_id:default_node_index&rows=10
If you do this request in the browser, you should see the following result:
The result is as expected json. Json is not easy to read for humans. You can do 2 things to fix this: either you add '&indent=true' to the request or you remove the json related parameters (wt=json and json.nl=map) in which case you get xml. I prefer to get the result back as xml, but you might prefer json.
So lets add and alter some parameters to see the effect. First of all, only two fields are returned: score and item_id. If we tweak solr in many cases you want all fields to be returned. You can achieve this by added fl=*,score to the request. You now see all the fields that are stored in the Solr index. Let's have a look at what happens if we remove the parameter 'qf=t_title^5'. If your article about foxes is the same as suggested in part I of this tutorial, the article should be gone right now. Now there is a problem: in the main body text we use the word foxes, which the plural of fox. In the real world, we would like to find an article containing the word foxes if we search for fox! A very easy fix in the schema.xml can solve this problem for us. So what's the schema.xml?
The schema.xml
The schema.xml is used to define how Solr should handle fields. If we look at the fields returned from a document, we see for instance that each document has an id and that this is a string (see image, the field at the bottom).
If a document is send to Solr for indexing, and the document contains an a field with the name 'id', Solr will threat this field as a string, because it is defined to be a string in the schema.xml.
Let's have a look at the schema which is supplied by the Search API module. You can find the schema.xml in the admin interface of Solr (http://localhost:8983/solr/admin/file/?contentType=text/xml;charset=utf-...).
The schema.xml starts with the definition of all the field types. After the field type definitions, the fields are defined. Now scroll down in the schema.xml until you find the field named id:
What we can see is from this definition is the following:
- The type is string
- The field is indexed
- The field is stored
- The field is required
(As you can see a field might be indexed but not stored! This means that a field is analyzed and indexed, but that the original value of the field is not stored in the index. A field that is not stored can be searched in, but the you cannot retrieve the value of the field.)
The important attribute here is the type. The field type is string, and the definition of this type can be found higher in the schema.xml:
The definition of string is very simple and you can leave it as it is.
Now try to find what type of the field t_body is. Are you able to find the field definition of t_body? You won't find it! This is because the field t_body is a so called dynamic field. Go to the section in the schema.xml in which the dynamic fields are defined:
As you can see, each field starting with the string 't_' is a text field. Now have a look at the definition of the text type:
The definition of the text type is different than all the other types. This is because text is analyzed. On a text field several analyzers are defined including:
- tokenizer: breaking the text into words (in this case the words are defined by whitespaces)
- html filter: filter html tags
- remove duplicates: remove duplicate words.
This analyses will be done on the text when a document is indexed () and on the keywords when a search is performed in a text field ().
Remember the problem that we couldn't find fox in the t_body field while the word foxes was present? This can be fixed by adding so called 'stemming' (http://en.wikipedia.org/wiki/Stemming). We can enable this for the text field by uncommenting the filter with class SnowballPorterFilterFactory both in the indexing analyzers and in the query analyzers. Edit your schema.xml file and after saving restart solr and delete and reindex from the search index status page. Now search for fox only in the t_body text again. As all went well, you should be able to find the fox article now!
As you can see, several other language specific features are commented out in the schema.xml. Your first tweaking of solr starts by adjusting these filters! Don't be afraid to do this. Just test it locally and once your satisfied deploy the schema.xml to your production environment.