February 24, 2011

ElasticSearch vs SOLRCloud

For an upcoming work project I need a scalable search platform - scalable to tens or hundreds of millions of documents (news articles), and millions of queries per day. We're a (mostly) Java shop, and have a lot of experience with Lucene, so two solutions that pique my curiosity are SOLRCloud (SOLR + ZooKeeper) and ElasticSearch.

Initial Impressions - ElasticSearch

ElasticSearch is impressive. Its clean, simple, and elegant. For those who are familiar with Compass, ElasticSearch can be considered as Compass 3.0 (quoting Shay Bannon, author of Compass). ElasticSearch has been under development for about 9 months at time of writing, and is currently at version 0.15. It appears to be very actively developed, with new features and fixes flowing steadily.

My main worry at this point is that there appears to be only one "resource" active on the project - Shay Bannon (@kimchy) himself, who seems to be architect, developer, documentation-writer, and a prolific commenter on forums.

Noteworthy features include:

Document-oriented / Schema-free (JSON documents)
Store, retrieve, index and search multiple versions of documents
Self-hosting RESTful web-service api
Exposes the full power of lucene queries
Multiple Indexes in one cluster (described as Multi-Tenancy)
Built from the ground-up with scalability and distributed-operation in mind - supporting distributed search, automatic fail-over and re-balancing, with no single point of failure
Support for async write/backup to shared storage (Gateway, in ElasticSearch parlance)
"Percolator" (aka. prospective search)

Initial Impressions - SOLRCloud

SOLR is a project from the same (Apache) stable as Lucene itself, and the projects have recently merged to some degree. SOLRCloud is an extension that integrates ZooKeeper with SOLR with the express aim of "enabling and simplifying the creation and use of Solr clusters."

SOLRCloud is described as "still under development", ie., not yet a GA release.

Currently proclaimed features include:

Central configuration of the entire cluster
Automatic load-balancing and fail-over for queries
ZooKeeper integration for cluster coordination and configuration (not sure I would have listed that as a feature personally!)
I'll add that SOLRCloud is part of the SOLR code-base, and is being developed by core Lucene and SOLR committers including Mark Miller and Yonik Seeley. This can only be a good thing :). On top of all that, SOLR has been around for a good long time now, so it is battle-tested and there's lots of information available (including numerous books).

That said, I still have two big worries about SOLRCloud: * Setup/deployment just sounds fiddly - it is recommended not to deploy zookeeper embedded with SOLR (though I cannot find any explanation to back up that recommendation), which means you need both a ZooKeeper ensemble - multiple ZooKeeper instances - and a SOLRCloud ... er ... cloud. * No GA release as yet, and no roadmap that I can find (this is the closest I got).

Next Steps

My next steps are to dive in to both technologies and really get to see which best suits our needs, and really how difficult these things are likely to be to manage in a medium/large-scale deployment.

Because I'll forget it if I don't write it down...