htsearch

ht://Dig © 1995-1998 Andrew Scherpbier
Please see the file COPYING for license information.


Search Method Used

The way htsearch performs it search and applies its ranking rules are fairly complicated. This is an attempt at explaining in global terms what goes on when htsearch searches.

htsearch gets a list of words from the HTML form that invoked it. If htsearch was invoked with boolean expression parsing enabled, it will do a quick syntax check on the input words. If there are syntax errors, it will display the syntax error file that is specified with the syntax_error_file attribute.

If the boolean parser was not enabled, the list of words is converted into a boolean expression by putting either "and"s or "or"s between the words. (This depends on the search type.)

In both cases, each of the words in the list is now expanded using the search algorithms that were specified in the search_algorithm attribute. For example, the endings algorithm will convert a word like "person" into "person or persons". In this fashion, all the specified algorithms are used on each of the words and the result is a new boolean expression.

The next step is to perform database lookups on the words in the expression. The result of these lookups are then passed to the boolean expression parser.

The boolean expression parser is a simple recursive descent parser with an operand stack. It knows how to deal with "and", "or" and parenthesis. The result of the parser will be one set of matches.

At this point, the matches are ranked. The rank of a match is determined by the weight of the words that caused the match and the weight of the algorithm that generated the word. Word weights are generally determined by the importance of the word in a document. For example, words in the title of a document have a much higher weight than words at the bottom of the document.

Finally, when the document ranks have been determined and the documents sorted, the resulting matches are displayed. If paged output is required, only a subset of all the matches will be displayed.


Andrew Scherpbier <andrew@contigo.com>
Last modified: Wed Jan 1 20:39:21 PST