Wednesday, February 27, 2008

Why can you not use a Wildcard as the first character in a Lucene search?

Keywords:
apache lucene wildcard search index "why not" leading first

Problem:
Reading the Lucene query parser syntax documentation for their latest release, there's a line that says "Note: You cannot use a * or ? symbol as the first character of a search.".

Why not? Sure it may be inefficient, but what if you're willing to wear that. Is it something that is technically not possible, is it something that's up to users of the library to work around or will it eventually be implemented in Lucene?

Solution:
Out of curiosity for why this is a limitation I found notes in the Lucene Wiki that indicate that you can do it and it is possible.

Seems the Query Syntax guide in the Lucene release simply needs updating(?).

queryParser.setAllowLeadingWildcard(true);


Notes:
Note the Wiki says this feature is available as of 2.1. But it seems there was a bug in this release. I've tested this with lucene 2.3.1 and it works fine.

1 comment:

Rob Young said...

The reason it's not advisable can be seen in the Lucene file formats documentation. The term directory file is based on term prefixes and only differences from the previous term are stored. This means that it's very expensive to do a wildcard at the start of a query.