Alternative Indexing/Search Solutions

alm.solrindex

alm.solrindex is another addon for connecting Plone search to Solr.

It takes a different approach:

  • collective.solr wraps the Zope catalog. Each item is indexed both in the ZCatalog and in Solr, typically including many indexes in both. When a search is performed, based on the indexes used, it decides to query either ZCatalog or solr but not both.
  • alm.solrindex operates as an index within the Zope catalog, replacing the standard SearchableText index. Solr only needs to index the fulltext, and the ZCatalog no longer needs to do so. When a search is performed that includes a SearchableText criterion, first alm.solrindex will query solr for results, then those results will be further filtered by other ZCatalog indexes.

Pros:

  • Solr is more efficient than ZCTextIndex at indexing and querying fulltext.
  • Avoids duplication of index storage.
  • Less data needs to be sent between Plone and solr when indexing.
  • Don’t need to add new indexes to Solr and reindex.

Cons:

  • No admin UI in Plone control panel.
  • Customizations can require monkey patching.
  • Potential for missing some results. (see below)

Setup

We set up Solr in our buildout in a similar way, using the hexagonit.recipe.download and collective.recipe.solr buildout recipes.

The solr-instance buildout part looks a bit different.

[solr-instance]
recipe = collective.recipe.solrinstance
solr-location = ${solr-download:location}
host = ${settings:solr-host}
port = ${settings:solr-port}
basepath = /solr
max-num-results = 500
default-search-field = SearchableText
unique-key = docid
index =
    name:docid          type:integer  stored:true     required:true
    name:SearchableText type:text     stored:false
    name:Title          type:text     stored:false
    name:Description    type:text     stored:false
  • We set the unique-key identifying the record to docid. alm.solrindex will pass the ZCatalog’s internal integer record id (rid) in this field.
  • We set the default-search-field to SearchableText, so that Solr queries which don’t specify a field will use SearchableText.
  • We configure fields for docid and each of the standard Plone fulltext indexes, but not any other fields.
  • We set stored: false on the indexes so that Solr will only store the docid.

We also need to reference the Solr URI in an environment variable for the Plone instance part, so that alm.solrindex knows where to connect

[instance]
environment-vars =
SOLR_URI http://${settings:solr-host}:${settings:solr-port}/solr

After running buildout, we can start Plone and activate alm.solrindex in the Add-ons control panel.

Note

The default installation profile removes the existing SearchableText, Title, and Description indexes, but does not automatically reindex existing content.

If you have existing content in the site, you’ll need to do a full reindex of the ZCatalog to get them indexed in Solr.

Why Are Results Missing?

There is a limitation to this approach.

Solr is configured with a maximum limit on the number of results it will return (max-num-results in the buildout configuration). This is done because it hurts performance if there are thousands and thousands of results, and Solr has to serialize all of them and Plone has to deserialize all of them.

For queries that only use indexes that are in Solr (i.e. the fulltext indexes), this is not a big problem.

Solr ranks the results so the limited set it returns should be the most relevant results, and most users are not going to navigate past more than a few pages of results anyway.

It can be a problem when the search term is very generic (so there are many results and its hard for Solr to determine the most relevant ones) and the results are also going to be filtered by other indexes (such as in a faceted search solution).

In this case the limited result set from Solr is fairly arbitrary, the other filters only get to operate on this limited set, and we might end up missing results that should be there.

Example: Consider a site where there are 10,000 items with the term ‘pdf’, including one in a folder “/annual-reports/2015”. If a search is performed for ‘pdf’ within the path ‘/annual-reports/2015’:

  1. First Solr finds all documents matching ‘pdf’, and ranks them.
  2. Next it returns the top 500 results to Plone.
  3. Next Plone filters those results by path. There is a good chance that our target document was not included in the 500 that Solr returned, so this filters down to no results.

There are a couple workarounds for this problem, both of which have their own tradeoff:

  1. Increase max-num-results above the total number of documents (but this will hurt performance for queries that return many results).
  2. Make sure that other indexes that are likely to narrow down the results a lot are also included in Solr (but this detracts from the main advantages of using alm.solrindex over collective.solr).

Customization

Each type of field has its own handler which takes care of translating between ZCatalog and Solr queries. These can be overridden to handle advanced customization:

Example: monkey patch the TextFieldHandler to use an edismax query that allows boosting some fields

from Products.PluginIndexes.common.util import parseIndexRequest
from alm.solrindex.handlers import TextFieldHandler
from alm.solrindex.quotequery import quote_query

def parse_query(self, field, field_query):
    name = field.name
    request = {name: field_query}
    record = parseIndexRequest(request, name, ('query',))
    if not record.keys:
        return None

        query_str = ' '.join(record.keys)
        if not query_str:
            return None

        if name == 'SearchableText':
            q = quote_query(query_str)
        else:
            q = u'+%s:%s' % (name, quote_query(query_str))

        return {
            'q': q,
            'defType': 'edismax',
            'qf': 'Title^10 Description^2 SearchableText^0.2',  # boost fields
            'pf': 'Title~2^20 Description~5^5 SearchableText~10^2',  # boost phrases
        }
        TextFieldHandler.parse_query = parse_query

Example: Add a path index that works like Zope’s ExtendedPathIndex (i.e. it’ll find anything whose path begins with the query value):

solr.cfg

[solr-instance]
...
index =
    ...
    name:path           type:descendent_path stored:false

handlers.py

from alm.solrindex.handlers import DefaultFieldHandler

class PathFieldHandler(DefaultFieldHandler):

    def parse_query(self, field, field_query):
        query = super(PathFieldHandler, self).parse_query(field, field_query)
        if query == {'fq': 'path:""'}:
            return {}
        return query

    def convert_one(self, value):
        # avoid including the site path in the index data
        if value.startswith('/Plone'):
            value = value[6:]
        return super(PathFieldHandler, self).convert_one(value)

ZCML:

<utility component=".handlers.PathFieldHandler"
         provides="alm.solrindex.interfaces.ISolrFieldHandler"
         name="path" />

DIY Solr

If both collective.solr and alm.solrindex are too much for you or you have special needs, you can access Solr by custom code. This might be, if you:

  • need to access a Solr server with a newer version / multicore setup and you don’t have access to the configuration of Solr
  • Only want a fulltext search page of a small site with no need for full realtime support

You can find a full-featured example of a full-fledged custom Solr integration at the Ploneintranet (advanced!):

https://github.com/ploneintranet/ploneintranet/pull/299

collective.elasticsearch

Another option for an advanced search integration is the younger project Elasticsearch. Like for Solr, the technical foundation is the Lucene index, written in Java.

Pros of Elasticsearch

  • It uses JSON instead of an XML schema for (field) configuration, which might be easier to configure.
  • Clustering and replication is built in from the beginning. It is easier to configure. Especially ad-hoc cluster which can (re)configure automatically.
  • The project and community is agile and active.

Cons of Elasticsearch

  • JSON is abused as Query DSL. It can lead to queries with up to 10 layers. This can be annoying especially if you write them programatically.

The integration of Elasticsearch with Plone is done with https://pypi.python.org/pypi/collective.elasticsearch/