Homepage
htdig-3.1.6
htdig is a wonderful tool to index all the documents you have available on your Web server.
here is step by step all what is important to do/know/install/configure
- install htdig 3.1.6
htdig 3.1.6 is already installed on my Slackware 8.1
as on many distributions, so I will bypass this step.
If it is not installed on your machine, install the corresponding package.
- install catdoc and xlscsv
catdoc is needed by some parsers (see below),
it is transforming a Word .DOC file to a readable text file,
so get it on:
http://www.45.free.net/~vitus/ice/catdoc/
http://www.45.free.net/~vitus/ice/catdoc/ver-0.9.html
# cp catdoc-0.92.tar.gz /usr/local/src/
# cd /usr/local/src/
# tar zxvf catdoc-0.92.tar.gz
# cd catdoc-0.92
# ./configure
# make
# make install
it will install in /usr/local/bin/:
catdoc
wordview
xls2csv
- install rtf2html
on the same Website as catdoc
you can download rtf2html,
also needed by some parsers (see below),
it is transforming a .RTF file to a readable HTML file,
http://www.45.free.net/~vitus/ice/catdoc/
# cp rtf2html /usr/local/src/
# cd /usr/local/src/
# tar zxvf rtf2html
# cd rtf2html
# make
# cp rtf2html /usr/local/bin/
- install xlHtml
xlHtml is also needed by some parsers (see below),
it is transforming an Excel .XLS file to a readable HTML file,
so get it on:
http://chicago.sourceforge.net/xlhtml/
# cp xlhtml-0.5.tgz /usr/local/src/
# cd /usr/local/src/
# tar zxvf xlhtml-0.5.tgz
# cd xlhtml-0.5
# ./configure
# make
# make install
it will install in /usr/local/bin/:
xlhtml
nsopen
nspptview
nsxlview
ppthtml
- install doc2html parser
find this main parser doc2html on:
http://www.htdig.org/files/contrib/parsers/
as it is said in the README file:
Readme for doc2html
External converter scripts for ht://Dig (version 3.1.4 and later), that
convert Microsoft Word, Excel and Powerpoint files, and PDF,
PostScript, RTF, and WordPerfect files to text (in HTML form) so they
can be indexed. Uses a variety of conversion programs:
wp2html - to convert Wordperfect and Word7 & 97 documents to HTML
catdoc - to extract text from Word documents
catwpd - to extract text from WordPerfect documents [alternative to wp2html]
rtf2html - to convert RTF documents to HTML
pdftotext - to extract text from Adobe PDFs
ps2ascii - to extract text from PostScript
pptHtml - to convert Powerpoint files to HTML
xlHtml - to convert Excel spreadsheets to HTML
xls2csv - to extract data from Excel spreadsheets [alternative to xlHtml]
swfparse - to extract links from Shockwave flash files.
The main script, doc2html.pl, is easily edited to include the available
utilities, and new utilities are easily incorporated.
Written by David Adams (University of Southampton), and based on the
conv_doc.pl script by Gilles Detillieux.
For more information see the DETAILS file.
and in the DETAILS file:
The set of files is:
DETAILS - this file
doc2html.pl - the main Perl script
doc2html.cfg - configuration file for use with wp2html
doc2html.sty - style file for use with wp2html
pdf2html.pl - Perl script for converting PDF files to HTML
swf2html.pl - Perl script for extracting links from Shockwave flash files.
README - brief description
doc2html.pl is a Perl5 script for use as an external converter with
htdig 3.1.4 or later. It takes as input the name of a file containing a
document in a number of possible formats and its MIME type. It uses
the appropriate conversion utility to convert it to HTML on standard
output.
doc2html.pl was designed to be easily adapted to use whatever conversion
utilities are available, and although it has been written around the
"wp2html" utility, it does not require wp2html to function.
NOTE: version 3.0 has only been tested on Unix.
pdf2html.pl is a Perl script which uses a pair of utilities (pdfinfo and
pdf2text) to extract information and text from an Adobe PDF file and
write HTML output. It can be called directly from htdig, but you are
recommended to call it via doc2html.pl.
swf2html.pl is a Perl script which calls a utility (swfparse) and
outputs HTML containing links to the URL's found in a Shockwave flash
file. It can be called directly from htdig, but you are recommended to
call it via doc2html.pl.
# tar zxvf doc2html.tar.gz
# cd doc2html
# cp doc2html.cfg doc2html.pl doc2html.sty pdf2html.pl swf2html.pl /usr/local/bin/
# cp DETAILS /usr/local/bin/doc2html_DETAILS
# cp README /usr/local/bin/doc2html_README
- configure doc2html.pl
edit doc2html.pl and modify like this:
my $RTF2HTML = '/usr/local/bin/rtf2html';
my $CATDOC = '/usr/local/bin/catdoc';
my $CATPS = "/usr/bin/ps2ascii";
# add to search path the directory which contains gs
# (edit for your environment)
$ENV{PATH} .= ":/usr/bin"; # where "gs" is
my $PDF2HTML = '/usr/local/bin/pdf2html.pl'; # do not forget the '.pl' !!
my $XLS2HTML = '/usr/local/bin/xlhtml';
my $CATXLS = '';
my $PPT2HTML = '/usr/local/bin/ppthtml';
#my $SWF2HTML = '/usr/local/bin/swf2html'; # not really needed
my $SWF2HTML = '';
and also at line 170:
$ED = "/usr/bin/sed -e"; # very important !
- configure pdf2html.pl
edit pdf2html.pl and modify like this:
my $PDFTOTEXT = "/usr/X11R6/bin/pdftotext";
my $PDFINFO = "/usr/X11R6/bin/pdfinfo";
- htdig summary
this is how htdig works:
.HTML --> htdig
.TXT --> htdig
.DOC --> htdig --> doc2html.pl --> catdoc
.RTF --> htdig --> doc2html.pl --> rtf2html
.XLS --> htdig --> doc2html.pl --> xlhtml
.PPT --> htdig --> doc2html.pl --> ppthtml
.PDF --> htdig --> doc2html.pl --> pdf2html.pl --> pdftotext & pdfinfo
Alright ? good, let's configure htdig now.
- configure htdig
htdig on Slackware 8.1 is installed in /opt/www/htdig/ directory
1) create a directory for the database of the index:
# cd /opt/www/htdig/
# mkdir db
2) edit /opt/www/htdig/conf/htdig.conf with:
database_dir: /opt/www/htdig/db
start_url: http://zejack/doc/ http://zejack/doc2/ # many URLs can be here
limit_urls_to: ${start_url}
exclude_urls: /cgi-bin/ .cgi ?N=A ?N=D ?M=A ?M=D ?S=A ?S=D ?D=A ?D=D
# ?N=A ?N=D etc. are created by Apache in directory listings
allow_numbers: true # to index numbers
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi \
.png .ppt .swf .sw2 .wp .wp1 .wp2 .wp3 .wp4 .wp5 .wp6 \
.css .dll .cmp .chm .ctg .dl_ .dsc .dtd .js .xsl \
.mht .mdb .req \
.rep .wqy .unv .key .wis
external_parsers:
application/msword->text/html /usr/local/bin/doc2html.pl \
application/rtf->text/html /usr/local/bin/doc2html.pl \
text/rtf->text/html /usr/local/bin/doc2html.pl \
application/msexcel->text/html /usr/local/bin/doc2html.pl \
application/vnd.ms-excel->text/html /usr/local/bin/doc2html.pl \
application/pdf->text/html /usr/local/bin/doc2html.pl \
application/postscript->text/html /usr/local/bin/doc2html.pl
#application/wordperfect5.1->text/html /usr/local/bin/doc2html.pl \
#application/vnd.ms-powerpoint->text/html /usr/local/bin/doc2html.pl \
#application/x-shockwave-flash->text/html /usr/local/bin/doc2html.pl \
#application/x-shockwave-flash2-preview->text/html /usr/local/bin/doc2html.pl \
maintainer: xxx@xxx.com
max_head_length: 10000
# very important: by default htdig does not dig in documents more than 100KB
#max_doc_size: 200000
max_doc_size: 100000000
no_excerpt_show_top: true
search_algorithm: exact:1 synonyms:0.5 endings:0.1
...
- create the htdig index
# rundig
as simple as that!
- install the CGI tool
# cd /opt/www/cgi-bin/
# cp htsearch /var/www/cgi-bin/
- copy htdig files and images
# cd /var/www/htdocs/
# mkdir htdig
# cp /opt/www/htdocs/htdig/* htdig/
- create a search.html file with a form
sample:
http://www.htdig.org/FAQ.html#q4.9
récup parsers :
external_parsers:
application/rtf->text/html /usr/local/bin/doc2html.pl \
text/rtf->text/html /usr/local/bin/doc2html.pl \
application/pdf->text/html /usr/local/bin/doc2html.pl \
application/postscript->text/html /usr/local/bin/doc2html.pl \
application/msword->text/html /usr/local/bin/doc2html.pl \
application/wordperfect5.1->text/html /usr/local/bin/doc2html.pl \
application/msexcel->text/html /usr/local/bin/doc2html.pl \
application/vnd.ms-excel->text/html /usr/local/bin/doc2html.pl \
application/vnd.ms-powerpoint->text/html /usr/local/bin/doc2html.pl \
application/x-shockwave-flash->text/html /usr/local/bin/doc2html.pl \
application/x-shockwave-flash2-preview->text/html /usr/local/bin/doc2html.pl
The hardware: PC Shuttle SS51G
# lspci
00:00.0 Host bridge: Silicon Integrated Systems [SiS]: Unknown device 0651 (rev 01)