Frequently Asked Questions

ht://Dig © 1995-1998 Andrew Scherpbier
Please see the file COPYING for license information.


This FAQ is compiled by Geoff Hutchison <ghutchis@wso.williams.edu> and the most recent version is available at <http://www.htdig.org/FAQ.html>. Questions (and answers!) are greatly appreciated.

Questions

1. General

1.1. Can I search the internet with ht://Dig?
1.2. Can I index the internet with ht://Dig?
1.3. What's the difference between htdig and ht://Dig?
1.4. I sent mail to Andrew but I never got a response!
1.5. I sent a question to the mailing list but I never got a response!
1.6. I have a great idea/patch for ht://Dig!
1.7. Is ht://Dig Y2K compliant?

2. Getting ht://Dig

2.1. What's the latest version of ht://Dig?
2.2. Are there binary distributions of ht://Dig?
2.3. Are there mirror sites for ht://Dig?
2.4. Is ht://Dig available by ftp?
2.5. Are patches around to upgrade between versions?

3. Compiling

3.1. When I compile ht://Dig I get an error about libht.a.
3.2. I get an error about -lg

4. Configuration

4.1. How come I can't index my site?
4.2. How can I change the output format of htsearch?
4.3. How do I index pages that start with '~'?
4.4. Can I use multiple databases?
4.5. OK, I can use multiple databases. Can I merge them into one?

6. Troubleshooting

6.1. I can't seem to index more than X documents in a directory.
6.2. I can't index PDF files.
6.3. When I run "rundig," I get a message about "DATABASE_DIR" not being found.
6.4. When I run htmerge, it stops with an "out of diskspace" message.
6.5. I have problems running rundig from cron under Linux.
6.6. When I run htmerge, it stops with an "Unexpected file type" message.

Answers

1. General

1.1. Can I search the internet with ht://Dig?

No, ht://Dig is a system for indexing and searching a small set of sites or intranet. It is not meant to replace any of the many internet-wide search engines.

1.2. Can I index the internet with ht://Dig?

No, as above, ht://Dig is not meant as an internet-wide search engine. While there is theoretically nothing to stop you from indexing as much as you wish, practical considerations (e.g. time, disk space, memory, etc.) will limit this.

1.3. What's the difference between htdig and ht://Dig?

The complete ht://Dig consists of several programs, one of which is called "htdig." This program performs the "digging" or indexing of the web pages. Of course an index doesn't do you much good without a program to sort it, search through it, etc.

1.4. I sent mail to Andrew but I never got a response!

Andrew works on ht://Dig on the side. Since he is often busy with work and gets a lot of e-mail, it can take a while to respond.

1.5. I sent a question to the mailing list but I never got a response!

As with Andrew, the members of the mailing list have jobs too! Don't worry, someone will almost definitely get back to you.

1.6. I have a great idea/patch for ht://Dig!

Great! Development of ht://Dig continues through suggestions and improvements from users. If you have an idea (or even better, a patch), please send it to the ht://Dig mailing list so others can use it. For suggestions on how to submit patches, please check the "Guidelines for Patch Submissions"

1.7. Is ht://Dig Y2K compliant?

ht://Dig should be y2k compliant since it never stores dates as two-digit years. Under ht://Dig's copyright (GPL), there is no warranty whatsoever. If you would like an iron-clad, legally-binding guarantee, feel free to check the source code itself.

2. Getting ht://Dig

2.1. What's the latest version of ht://Dig?

The latest version is 3.1.0b1 as of this writing. Development is beginning on htdig4 as well as a few interim releases of htdig3.

2.2. Are there binary distributions of ht://Dig?

Not at the moment. As a test, the release of ht://Dig 3.0.8b2 saw a pre-compiled SPARC/Solaris package in addition to the standard source release. Binary distributions are on the TODO list.

2.3. Are there mirror sites for ht://Dig?

Currently two sites exist for ht://Dig, www.htdig.org and htdig.sdsu.edu. The www.htdig.org site contains the most up-to-date information and releases. Currently no other mirrors exist.

2.4. Is ht://Dig available by ftp?

Not at the moment.

2.5. Are patches around to upgrade between versions?

Most versions are also distributed as a patch to the previous version's source code. The most recent exception to this was version 3.1.0b1. Since this version switched from the GDBM database to DB2, the new database package needed to be shipped with the distribution. This would make the patch almost as large as the regular distribution.

3. Compiling

3.1. When I compile ht://Dig I get an error about libht.a

This usually indicates that either libg++ is not installed or is installed incorrectly. To get libg++ or any other GNU too, check ftp://prep.ai.mit.edu/pub/gnu/

3.2. I get an error about -lg

This is due to a bug in the Makefile.config.in of version 3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then type "./config.status" to rebuild the Makefiles and recompile.

4. Configuration

4.1. How come I can't index my site?
4.2. How can I change the output format of htsearch?

Answer contributed by: Malka Cymbalista

You can change the output format of htsearch by creating different header, footer and result files that specify how you want the output to look. You then create a configuration file that specifies which files to use. In the html document that links to the search, you specify which configuration file to use.

So the configuration file would have the lines:
search_results_header: ${common_dir}/ccheader.html
search_results_footer: ${common_dir}/ccfooter.html
template_map: Long long builtin-long \
Short short builtin-short \
Default default ${common_dir}/ccresult.html
template_name: Default
You would also put into the configuration file any other lines from the default configuration file that apply to htsearch.

The files ${common_dir}/ccheader.html and ${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be tailored to give the output in the desired format.

Assuming your configuration file is called cc.conf, the html file that links to the search has to set the config parameter equal to cc. The following line would do it:

<input type=hidden name=config value="cc">

4.3. How do I index pages that start with '~'?

ht://Dig should index pages starting with '~' as if it was another web browser. If you are having problems with this, check your server log files to see what file the server is attempting to return.

4.4. Can I use multiple databases?

Yes, though you may find it easier to have one larger database and use restrict or exclude fields on searches. To use multiple databases, you will need a config file for each database. Then each file will set the "database_base" option to change the name of the databases.

4.5. OK, I can use multiple databases. Can I merge them into one?

Not at the moment. This is on the TODO list.

6. Troubleshooting

6.1. I can't seem to index more than X documents in a directory.

This usually has to do with the default document size limit. If you set "max_doc_size" in your config file to something enough to read in the directory index (try 100000 for 100K) this should fix this problem. Of course this will require more memory to read the larger file.

6.2. I can't index PDF files.

As above, this usually has to do with the default document size. What happens is ht://Dig will read in part of a PDF file and try to index it. This usually fails. Try setting "max_doc_size" in your config file to a larger value than your largest PDF file.

6.3. When I run "rundig," I get a message about "DATABASE_DIR" not being found.

This is due to a bug in the Makefile.in file in version 3.1.0b1. The easiest fix is to edit the rundig file and change the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory with a large amount of temporary disk space for htmerge.

6.4. When I run htmerge, it stops with an "out of diskspace" message.

This means that htmerge has run out of temporary disk space for sorting. Either in your "rundig" script (if you run htmerge through that) or before you run htmerge, set the variable TMPDIR to a temp directory with lots of space.

6.5. I have problems running rundig from cron under Linux.

This problem seems to be fixed by upgrading to a recent version of cron or vixie-cron. If this doesn't completely fix the problem, edit the first line of rundig to "#!/bin/csh" which will run the script through the csh shell.

6.6. When I run htmerge, it stops with an "Unexpected file type" message.

Often this is because the databases are corrupt. Try removing them and rebuilding. If this doesn't work, some have found that the solution for question 3.2 works for this as well.


Geoff Hutchison <ghutchis@wso.williams.edu>
Last modified: Sun Oct 11 23:30:34 EDT