One of the nice things about working with open source software is that sometimes you get to contribute back to a project and (hopefully) add some useful features or improvements. We’ve just spent about 15 months working with OpenSource Connections on a project for a major client in the legal search area, who were transitioning from a legacy search engine to a Lucene-based system. Part of this process involved evaluating and improving relevancy, and we identified early on that Sease Ltd’s Rated Ranking Evaluator (RRE) was a great tool for doing this.
RRE encourages a versioned approach to relevance improvement, with the intention of using a static corpus (data set), and modifying one factor at a time, whether that is the analysis process or the structure of the query in use – a common approach in relevance tuning. Each time it is run, it executes a set of queries against the various versioned data sets, generating metrics from the results.
When we first used it, it would take a fixed data set and spin up either Solr or Elasticsearch internally, indexing the data for each iteration. This is a fine approach if your data set is limited to a few thousand documents, but we were looking at nearly four million – enough that re-indexing all of them, possibly multiple times, during each run-through was not very practical.
As a result of this, our first contribution was to give the option for RRE to persist the stored data between runs, meaning the documents only needed to be indexed on the first run-through if the user made this choice. RRE is built to use Maven to execute each run, so passing the option in was a case of modifying the pom.xml or passing from the command line.
This was a solid improvement, and cut down run-times, but still, indexing a large data-set was still going to take a long time on the first run. Additionally, the indexed data couldn’t be used anywhere else. Wouldn’t it be better if RRE could talk to an external search engine, where the data had already been indexed? Thankfully, RRE’s search engine framework is designed to be pluggable, so writing connectors for external Elasticsearch and Solr instances was relatively straightforward and was our next pull request.
Another change we contributed, and the one of which I’m personally most proud, was to allow RRE to store (or persist) the results of each run-through in an external data store. By default, RRE will write the results to a JSON file – this is then sent to a separate server process and/or converted to a spreadsheet. With the number of queries we’d pulled together for our test set, this became much too unwieldy to POST over HTTP: it would be more useful to us if we could store the results in Elasticsearch as the queries were executed. This led to the creation of the pluggable persistence framework, to which the query results were sent during the execution process. While our first step was to duplicate the current behaviour and create a JSON persistence handler, we then added an Elasticsearch handler. This batches up the results and asynchronously writes them out to a configured destination index (creating the index if it doesn’t already exist).
We made some further changes after this, in particular adding the option to run queries asynchronously to improve execution speeds on decently specced machines, and making it possible to pass parameters to generic metric implementations rather than needing separate classes (for example, Precision at 5 and Precision at 10 could be implemented with a single class, rather than needing two). OSC colleagues added a number of additional metric implementations that took advantage of this.
It was very rewarding to be able to contribute back to a great project, not to mention fun to just dig in and get on with it – thankfully RRE was still a relatively small codebase, so not too terrifying to jump into. Thanks should also go to Andrea Gazzarini at Sease for being so accepting of the (sometimes quite large) changes we made to their code!