Apache Lucene - the ghetto search engine

Posted 2008-01-15 in Java by Johann.

The background

Back in 2002, I started using Lucene as the search engine on my site.

I tried the little demo application to index my pages and – to my surprise – it couldn’t even index Java Server Pages. Think about it: A Java application that cannot index Java Server Pages.

Anyway, I wrote a working parser for JSPs and emailed it to Doug Cutting, Lucene’s inventor. I never heard anything from Doug which I found rather rude.

Doug…

…Cutting used to work for Excite. I don’t know about you, but I have problems remembering anything about Excite. Maybe because they weren’t really good.

Why Lucene?

In the new layout of johannburkard.de, I will integrate search results in the pages directly. This is caused by a move away from tree-like site structures and towards “relatedness”-based linking.

In other words, instead of forcing a tree-based navigation structure (Home -> Blog -> Programming -> Java), I will link to pages that are related to the currently viewed page, regardless of their location within the site.

For this to work, search results must obviously be really good.

Unfortunately, Lucene consistently ranked all blog entries about inc above the original entry which caused me to look at the ranking formula of Lucene.

An example of just how much Lucene sucks

To illustrate the obvious ranking problems of Lucene, here are two example documents:

Document 1

Hallo welt hallo, hallo!

Document 2

Ein Hallo-Welt-Programm ist ein kleines Computerprogramm und soll auf möglichst einfache Weise zeigen, welche Anweisungen oder Bestandteile für ein vollständiges Programm in einer Programmiersprache benötigt werden und somit einen ersten Einblick in die Syntax geben. Aufgabe des Programms ist, den Text Hallo Welt! oder auf Englisch Hello, world! auszugeben. Ein solches Programm ist auch geeignet, die erfolgreiche Installation eines Compilers für die entsprechende Programmiersprache zu überprüfen. Aufgrund der einfachen Aufgabenstellung kann ein Hallo-Welt-Programm aber nicht als Einführung in die Sprache selbst dienen, denn es folgt zumeist nur dem Programmierparadigma der imperativen Programmierung und demonstriert somit nur einen Bruchteil der Möglichkeiten der meisten Sprachen.

The question: Which one of these ranks first for “hallo Welt”?

If you guessed document 2, you are wrong. It’s document 1. It ranks better than document 2 by a large margin.

Why? Simply because the frequency of “hallo” over the document is higher in document 1 than it is in document 2.

Sounds stupid? Do you remember keyword stuffing? Search engines in the 90’s were vulnerable to the same problem.

Learning from Lucene’s epic fail

Even if you do not use Lucene, you can still learn from the massive mistakes in Lucene’s design:

Document numbers are stored as signed ints. This means that Lucene will never be able to index more than 2 billion documents. Two billion documents is just ridiculous for Internet search. Exalead say they index 8 billion documents. Eight billion documents might have been a lot in 2000 or so. Consequently, their results aren't great. Now imagine what their results would look like if they had one fourth of their index size. Still you have lots of people trying to become the next Google using Nutch (which uses Lucene).
Text is stored in one field. With Lucene, it is impossible to increase or decrease the weight of individual terms in a document. For example, linking to inc with the anchor text “inc” should decrease the weight of inc in the current document.
Horribly inconsistent API. If I remember correctly, the first versions of Lucene were no interfaces and all final classes or so. In the last years, someone must have done some half-assed refactoring so there is now a Fieldable interface that – uhm – does the same thing as the Field class. W00t. The lesson to learn here is that a bad API can ruin your project for years so don’t let people that are new to Java design Java APIs.
Using tf/idf. I admit I’m not an information retrieval expert but in all text books that I read (Ricardo Baeza-Yates’ books come to mind), tf/idf was always presented as a basic, “beginner’s” ranking formula that only yields inferior results.
Locking down APIs. When I experimented with Lucene, I thought to myself “Maybe there is a CrappyRanking that I can change to GoodRanking by calling IndexReader#setRanking(Ranking),” then I got lost writing wrappers for Query, tried to access the Weight instance from the wrapper which didn’t work because it is package private, tried to find the ranking formula, found out that by default, 50 results are fetched (hardcoded)… and gave up.

What Lucene does well

Surprisingly, Lucene does a few things really well.

Indexation and index access performance. I believe index performance is something that is getting better all the time.
Query analysis. There’s a variety of query parsers available so even complex queries can be parsed.
Resource usage. I have never noticed any excessive RAM or disk usage.

The verdict

If you plan on using Lucene for anything else than simple site search (“enter keywords, return documents that contain keywords”), you should look somewhere else.

6 comments

#1 2008-04-13 by JP

What version of Lucene were you using? Any suggestions on alternatives?

#2 2008-04-19 by Johann

JP,

I used several versions over the years. I'm pretty satisfied with ht://Dig right now. For site search, it's good and customizable.

#3 2008-12-24 by softwarevisualization

I thought 1 (one) would / should rate better, and you say it did. That makes sense, since it's *about* one thing and one thing only- your searched-for term.

Regarding keyword stuffing: the second text is just as vulnerable to keyword stuffing as the first, just stuff more kw in.

To reliably defeat those the evil kw stuffers requires the engine to read and understand the page as a human would. Computers aren't there and they're not going to be there in our lifetimes.

Google does this using PageRank, which is a proxy for understanding; presumably, real people have read (and understood) the page well enough to want to link to it. This means they judge it to be what it appears to be about, and is not just kw stuffed onto a page.

Link farms are the stuffers reply, and Google has a way to deal with those too, to complex to detail here accurately, but it doesn't have so much to do with the document itself as to the document's relationship to other web pages and links and accumulated knowledge Google keeps about those pages and links. In fact, that accumulated knowledge represents a company's competitive advantage and a also barrier to entrance to the marketplace for search engines. No algorithm I know of is going to replace accumulated, specialized knowledge about specific entities in the real world (pages) and their relationship to other pages considered longitudinally over time- and that is just the tip of the iceberg with regards to the meta-knowledge that's brought to bear to defeat would be keyword stuffers.

It's cat (search engine) and mouse (stuffers) game. There is no magic bullet. Judging 1 (one) to be more *about* hallo walt than 2 is *right*, as far as right can be out of the box, which is what Lucene can rationalize and provide to its users.

cheers!

softwarevisualization@gmail.com

#4 2009-11-03 by dumbfounder

@softwarevisualization PageRank is not the fundamental algorithm that helps with relevance. It is not "proxy for understanding", it is a proxy for popularity. It's all about the anchor text, without it PageRank is absolutely useless. It isn't that I am linking to a document, it's that I am linking to a document, say http://twicsy.com, and saying that it is a "twitter picture search engine". So when you search using those words, the engine can say someone used those words to describe that doc, and I trust that person (PageRank) so therefore I will weight those words high.

Unfortunately due to Lucene's method of "indexing a document and then it's done with" is not compatible with the thought of using anchor text for relevance. What's needed is an engine that can accommodate extra pointers to documents that are indexed at different times, and then rolled up that when searched they can be merged. Basically it would need to be a database index that allows potentially hundreds or thousands of pointers for a single word to a document. Lucene can't do it, and it is the reason I have never used Lucene for any kind of search engine where good relevance is desired.

Unfortunately the depth and breadth of features in Lucene has spread its adoption to the point where everyone uses it without thinking. Ignorance is bliss, but unfortunately makes for terrible search engines. What we need is a movement to either allow multiple types of storage engines in Lucene (a la MySQL), or to make it generic enough to allow new data to be inserted that can add relevance to a document. It is a fundamental change, so it would need support from the top to gain any traction. There is a problem with the method merging in new relevance data, it is inherently slower. To most people that is a blocker, but it shouldn't be. Performance needs to take a back seat to relevance. Machines are getting faster, relevance seems to be staying its same sucky self.

Does anyone know how Nutch indexes anchor text for relevance? Or does it not do that? I can't imagine that it doesn't, but who knows.

@Johann - I think the doc id's of signed ints isn't a problem because it just means that any one server can only index 2 billion documents. That is way beyond reasonable, and therefore not an issue. Also, there are no unsigned ints in Java, so they would need to move to longs, which would be an issue for performance.

dumbfounder

#5 2009-11-04 by Johan

dumbfounder (nice nick, BTW),

I don't know about Nutch but I can only think of adding some extra fields to each document that contain the terms that were present (or near) anchors.

And I do think that 2 billion documents isn't much. :-)

#6 2009-11-05 by dumbfounder

Adding extra fields isn't sufficient because it means you need the anchor text at index time. That is only the case with a static index that you compile once, or it necessitates recompiling the entire index. That is very much less than ideal. It also doesn't take into account 100's or 1000's of links that may point to a document, or the fact that this number varies document to document.

2 billion documents PER SERVER is a very massive amount. Unless you are talking documents with only a few words, in which case, it is merely a very unreasonable amount. Either way, the advice is build your own specialized engine, or split it up onto multiple machines. Probably you need to do both.

RSS 2.0, Atom or subscribe by Email.