The origins of Lucene
Lucene began development in 1997 as Doug Cutting’s side project during his time at the web search engine Excite. Anticipating the burst of the internet bubble, Cutting reduced his working hours to teach himself Java and begin working on a set of search tools. These would become the popular search engine library, Lucene, named after his wife’s middle name.
The operations performed by Lucene can essentially be simplified to two key steps. Firstly, it creates indexes of the content to be made searchable. Secondly, Lucene queries the created indexes to search for content.
The structure of indexes
Within a Lucene index, Lucene documents are constructed. These Lucene documents contain fields with associated values, which are essentially key and value pairs. To create an index, whatever document is being index must be parsed, and the fields extracted. As the Lucene document, stored in the index, is independent of file type, any type of document that can be parsed into fields and values can be made searchable. This advantage is indispensible, giving Lucene the ability to index structured database objects and unstructured or semi-structured documents, such as Word documents or PDF formats.
When indexes are queried, Lucene looks for matches between the indexed values and the query terms. Deciding how to correctly structure the way in which fields are extracted from documents is an important stage in setting up the search. To ensure that the search returns good results in response to queries, the fields must be extracted appropriately for the target data type.
Indexing unstructured content
For some documents, in particular structured documents, field and value pairs will be fairly trivial to associate. When searching for an author name, or contact details, such as address and phone number, it is typical that the desired result is a complete match to the query. As such, this field can simply be indexed with its value and return a hit to a matching query.
For more generic content, it is likely that matches and partial matches within the text will be desired, as opposed to a strictly identical match to the query term. To overcome this, generic text is analysed by Lucene and broken down accordingly before indexing. In the analysis stage, most unimportant filler words are discarded. The analyser also uses a stemming algorithm to stem each word to its root form.
How indexes are stored
When Lucene was developed, a key difference between itself and other search engine libraries was the way in which it handled index storage. Index storage was one of the hurdles preventing search engines scaling in size and maintaining speed. In Lucene, when the number of indexes reaches three, these indexes are consolidated. Similarly, these larger indexes are also grouped and consolidated as threes, in a continuing fashion. This system keeps the majority of indexed material consolidated in larger collections in a progressive merge sort. As the larger portion of the data is grouped and sorted, this keeps searching time efficient, whilst still allowing indexes to be added “on the fly”.
Lucene is behind some of the largest searches including LinkedIn, Twitter and Wikipedia. As evidenced by the “powered by” page above, Lucene has had a far-reaching impact on search. The simplicity of the Lucene library and its ease of implementation have contributed to its overwhelming success. Cutting notes that, what he believes was a key point in Lucene’s unexpected widespread adoption, was his decision to make it available as open source. As such, Lucene has grown into a fast and robust search tool that remains competitive and will undoubtedly hold its place in the history of search development.
Watch Lucene creator Doug Cutting give a talk about its creation and development. Courtesy of Twitter University.
Leave a Reply