One of the nice things about working with open source software is that sometimes you get to contribute back to a project and (hopefully) add some useful features or improvements. We’ve just spent about 15 months working with OpenSource Connections on a project for a major client in the legal search area, who were transitioning from a legacy search engine to a Lucene-based system. Part of this process involved evaluating and improving relevancy, and we identified early on that Sease Ltd’s Rated Ranking Evaluator (RRE) was a great tool for doing this.
RRE encourages a versioned approach to relevance improvement, with the intention of using a static corpus (data set), and modifying one factor at a time, whether that is the analysis process or the structure of the query in use – a common approach in relevance tuning. Each time it runs, it executes a set of queries against the various versioned data sets, generating metrics from the results.
When we first used RRE, it took a fixed data set, spun up either Solr or Elasticsearch internally, and indexed the data for each iteration. This is a fine approach if your data set is only a few thousand documents. However, ours was nearly four million – enough that it was not practical to re-index all of them, possibly multiple times, every run-through.
As a result of this, our first contribution was to allow RRE to persist the stored data between runs. Choosing this option means the first run-through indexes the documents, and later runs re-use the same index. RRE uses Maven to execute each run, so passing the option in is a case of modifying the pom.xml, or using a command line argument.
This was a solid improvement and cut down run-times, but still, indexing a large data-set was still going to take a long time on the first run. Additionally, the data can’t be used anywhere else. Wouldn’t it be better if RRE could talk to an external search engine, where the data had already been indexed? Thankfully, RRE’s search engine framework is pluggable, so writing connectors for external Elasticsearch and Solr instances was relatively straightforward and was our next pull request.
Another change we contributed, and the one of which I’m personally most proud, was to allow RRE to store (or persist) the results of each run-through in an external data store. By default, RRE writes the results to a JSON file, then sends it to a separate server process and/or converts it to a spreadsheet. With the number of queries we’d pulled together for our test set, this was much too unwieldy to POST over HTTP: storing the results in Elasticsearch would be much more useful. As a result, we created the pluggable persistence framework, which receives the query results during the execution process. While our first step was to duplicate the current behaviour and create a JSON persistence handler, we then added an Elasticsearch handler. This batches up the results and asynchronously writes them out to a configured destination index (creating the index if it doesn’t already exist).
We continued making changes after this. In particular, we added the option to run queries asynchronously, improving execution speeds on decently specced machines. We also added generic metric implementations, parameterising common options to reduce the number of classes required (for example, we can implement Precision at 5 and Precision at 10 with a single class, rather than two). OSC colleagues added a number of additional metric implementations that took advantage of this.
It was very rewarding to be able to contribute back to a great project, not to mention fun to just dig in and get on with it – thankfully RRE was still a relatively small codebase, so not too terrifying to jump into. Thanks should also go to Andrea Gazzarini at Sease for being so accepting of the (sometimes quite large) changes we made to their code!