Documentation for linklint version 1.35 May 24, 1997
by Jim Bowlin (bowlin@sirius.com).

Udates available at: http://www.goldwarp.com/bowlin/linklint/

INSTALLATION

Installation is straightforward. All you really need is the linklint
program (and Perl).  No other files are needed.  These instructions
are for people who are not familiar with the standard procedures.  If
you run into trouble you can always just download the linklint.txt
file and run it via "perl linklint.txt [options]".

Unix Installation

1. Download the linklint.tar or linklint.tar.gz file
2. un-gzip (if needed)    > gzip -d linklint.tar.gz
3. untar                  > tar -xf linklint.tar
4. make executable        > chmod a+x linklint
5. copy linklint to a directory on your path

If your perl program is not located at /usr/bin/perl and you want to
run linklint as a command then you will have to change the first line
of linklint to point to your perl program.  Use the command "which
perl" to find out where your perl program resides.


Windows/Dos Installation

1. download the linklint.zip file
2. unzip                  > pkunzip linklint.zip
3. copy linklint and batch files to a place on your path.
4. Edit batch files changing "\bin" to the directory containing linklint.

Windows users need to run linklint in a Dos window.

You can always run linklint as a perl script:

    > perl linklint [options]

On a Unix machine the you can also run linklint as a command:

    > linklint [options]

On a Windows/Dos machine you can run linklint via the batch file
linklint.bat.  Unfortunately batch files don't allow you to redirect
the output so I've provided another batch file, linklog.bat, that
redirects all the output to the file linklint.log. You will have to
edit these one line batch files if you put linklint in a directory
other than \bin.


FILE NAME CONVENTIONS

Linklint makes the following assumptions about files names.  Html
files are assumed to have an extension of .html, .shtml, or .htm
(case insensitive).  Only these files will be parsed. 

If one of your links is to a directory instead of a file, linklint
looks for: home.html, index.html, index.shtml, index.htm, index.cgi,
wwwhome.html, or welcome.html in that directory.  If none of these
files exist then a missing default file error is given.  You can
change this behavior only by modifying the $htmlexts variable and the
@DefaultFiles array.  Please feel free to do so.  Perhaps I should
add a .linklintrc file someday.


OPERATION

Linklint reads in the command line arguments and exists quickly if
any of them do not make sense.  Local files are checked starting with
the seed file(s) from the command line.  If no filelist is specified
then linklint checks for an index file in the current ( or -root)
directory. The rest of the files on the site are found by recursion.

If an @@file is specified, site checking is skipped and the data in
the @@file is used instead of actually re-checking the site.  If -u
(-unused) is specified linklint checks all directories found in its
site search (or @@file) for files not used in the site.  Lists of
found files and/or errors are printed.

If -n (-net) specified, or if a remote link is given in the filelist
or an @file, all remote http:// links are checked.  Finally linklint
prints a brief summary of the results and exits.


CHECKING REMOTE LINKS

All remote links are sorted in alphabetical order before checking.
The first time a remote host is contacted, linklint looks for a
/robots.txt file (if it is not already cached) and obeys the robot
exclusion rules contained therein for all requests to that host.  The
IP addresses for each named host are cached in memory.

Consecutive requests to the same host are spaced 2 seconds apart.
Use the -delay option to change the amount of time between requests.

In order to increase performance and decrease network traffic the
first attempt at checking a file is done with the "HEAD" method.  If
this method fails (which often happens with CGI scripts) we try again
with the "GET" method.  When "GET" is used we close the connection
after the header information is read.

If a server error (-4, -5, 500 or 502) is received or if we cannot
connect to the host then  alternate IP addresses will be checked if
they were provided by gethostbyname(). If one of the other servers
works we generate a warning instead of an error.

If a "URL has moved" (301 or 302) code is received, the new location
will be checked and a warning is generated.  The -r (-redirect) flag
will cause linklint to parse (<head> ... </head>) html headers for
possible redirects.

On Unix platforms, if it takes longer than 15 seconds to connect to a
host, linklint will give up and give a timeout error.  Use the
-timeout option to change how long linklint will wait. On Windows/Dos
linklint could possibly wait forever for a response.


ROBOT EXCLUSION PROTOCOL

The robot exclusion protocol asks that programs that automatically
retrieve web pages (robots) first look for the file robots.txt in the
root directory of the server.  This file tells where robots are
disallowed.  Linklint obeys this protocol.  Unfortunately this slows
down link checking since a robots.txt file must be downloaded for
each host visited.  As a compromise, linklint caches the relevent
exclusion information for each host in the file linklint.bot.  The
information for each host is automatically discarded after 30 days.
You can adjust the expiration time with the -expire option.

The default location for linklint.bot is your home directory
(obtained from the HOME environment variable).  If this variable does
not exist (as is the case on many Windows/Dos systems) the file is
stored in the current working directory.  You can override this
behavior by setting the environment variable LINKLINTBOT to the path
and filename of the cache file.  You can also turn off disk caching
with the -nocache flag.  The -nobot flag disables the checking of
robots.txt entirely.


DOS/UNIX DIFFERENCES

There are a few places in the program that operate differently under
Dos and Unix.  I try to detect a Dos/Windows machine by looking for
the environment variable "windir".  If your Dos/Windows machine does
not have this environment variable set then either 1) set it to
something in your autoexec.bat or 2) change the code to look for
something else.  Likewise if your Unix machine has this environment
variable set then you may have to change the code to get it to work
properly.  The parts of the code that differ are:

  1) Getting the hostname for checking remote URLS.
  2) The timeout is disabled on Dos/Windows.
  3) \ is changed to / when getting current working directory (CWD).
  4) The leading C: is stripped from CWD under Dos.
  5) CWD is obtained with 'cd' command under Dos ('pwd' is used in Unix).


HINTS FOR LARGE SITES

1) Redirect the output of linklint to a file:

   linklint -A > linklint.log

   This file will contain all the information about your local site.

2) Then either examine the file with a text editor or with linklint:

   linklint -s @@linklint.log      produce a 1 page summary
   linklint    @@linklint.log      just show errors
   linklint -l @@linklint.log      list all files and links found
   linklint -x @@linklint.log      show cross references to errors
   linklint -u @@linklint.log      show all unused files
   etc.

3) Check remote links seperately after the local site is fixed.

   linklint -wn @@linklint.log > remote.log

   This will check all the remote links that were found on your site.


INPUTS TO LINKLINT

Most flags have a long version for readability and a shorter
abbreviated version that is at most two characters long.  The short
versions can be strung together as in -lax.  The short option flags
can stand alone or be at the end of a list of flags as in "-laxtr
targetfile". All of the flags can be used in combination.  A few of
the combinations are silly.


DETAILED DESCRIPTION OF ALL COMMAND LINE FLAGS AND OPTIONS

-A (-All)

Print all directories, files, remote links, and named anchors found.
Also print cross references for all of the above and all warning
messages.  This is the same as "-list -anchor -xref -warn" or "-laxw".

-a (-anchor)

Print a list of all named anchors found.

-c (-case)

Check the case of each local file against the case used in html
references.  On Windows systems file names are case insensitive but
on Unix systems file names are case sensitive.  This flag is very
handy if you are developing a site on a Windows system that is to be
hosted on a Unix system.

-db

You can generate a whole lot of interesting output with the -db flag.
Use it together with -p (-progress) to show header and robots.txt
info when checking remote links.

-delay

The next argument sets the delay in seconds between consecutive
requests to the same remote server.  The default value is 2 seconds.

-expire

The next argument sets the expire time in days for the contents of
the linklint.bot cache file which contains robot exclusion
information for every host that has been visited.

-g

The next argument is the name of a file to which linklint will send
all of its standard output.

-f (-forward)

Print forward links for each html file.  This is a listing of every
link in every html file.  Sort of silly, maybe I will get rid of it.

-h (-help)

Print a help page that has examples of simple ways to use linklint.

-i (-ignore)

The next argument is a literal expression. Any path/filename which
matches this expression is NOT checked and will not appear in the
output.  Any file that matches the expression will not be listed as
an error. This is useful for avoiding certain files and or
directories or skipping remote links that you don't want to check.
Files or directories that are not linked by your site are skipped
automatically.

-ir (-ignorereg)

Same as -i but treats the next argument as a regular expression. To
ignore everything in the stage/ and weblogs/ directories and below
use -ir "(/stage|/weblogs)".  To ignore all files outside of the
server root directory use -ir ^FILE:

-l (-list)

Print lists of directories, files, and remote links found.  The files
are sorted according to file type: html, text, image, etc.  Remote
links are sorted by scheme: http:, ftp:, mailto:, etc. If the -x
(-xref) flag is set then a cross reference of which html files
referred to these resources is also printed.

-map file[=replace]

Strips off leading text on links that match "file" and replaces it
with "replace" if specifed.  "File" should be the URL of your
server-side imagemap CGI program.  If supplied, this text is stripped
off of the beginning of any link checked (hopefully) leaving the path
info pointing to the map file.  If the path to the map files does not
start at the server root then add "=replace" and "replace" will be
prepended to the link.  Some servers don't need to use this flag
since server-side image maps point directly to the map file.

-n (-net)

Check remote "http://" URL's via network.  If you include any remote
links on the command line (or in an @file) then all remote links will
be checked.  All remote links entered on the command line or in an
@file must begin with "http://". There are three ways of checking
remote links without rechecking your entire site:

  1) Put the remote links on the command line:
     linklint http://www.goldwarp.com

  2) Make a list of links (one per line) and then use
     linklint @links.txt

  3) have linklint read its own output (see @@ below for detail)
     linklint -n @@linklint.log

If you are running linklint on a Windows/Dos machine then you should
use the -p (-progress) flag when you check external links.

-nobot

This flag turns off all checking of robot.txt files.  The disk cache
is also turned off.  This can speed up the checking of external links.


-nocache

By default the file linklint.bot is created in your home directory
to store robot exclusion information for each host that had a remote
URL checked.  Use this flag to prevent linklint from reading or writing
this file.  This will slow down remote link checking.

-o (-one)

Only check files on command line (or in an @file).  Use this flag
to check one or more files without checking the entire site.
Linklint relies on Perl or your OS for expanding (globbing) filenames
such as subdir/*.html into a list of files.  You can also put a list of
filenames (one per line) into a single file and then check only these
files with "linklint -o @list.txt".

-p (-progress)

Print "Checking: file" as each html file is parsed or remote link is
checked.  Most useful for debugging or for checking remote links to
see who are the slowpokes.

-r (-redirect)

When checking remote links will read in header from html files and
try to find a redirection of the form <meta http-equiv="refresh" 
content="any; url=...">.  This can slow down the checking of remote
links but is useful for warning you if the link has been moved.

-root

The next argument is an absolute path to the server root directory
used for evaluating links starting with "/".  If no server root is
specified the default value is the current working directory.

-s (-summary)

Print out a summary of found and missing files instead of listing
each file explicitly.  This flag overrides the -list and -xref flags.
Very handy for a quick check of a site. Very unhandy if you just checked
a massive site and wanted some details. If used in conjunction with
-t (-target) or -tr (-targetreg) then only a summary of the files that
match the target will be given.  Thus "-st zzzzzz" will result in a
minimalist printout even if there are errors.

-server

The next argument is the hostname of the server in the form
"http://www.server.com" or "www.server.com".  This is only needed if
you have "http://" references to documents on your server that you
want linklint to check as local files.  Otherwise they will be
treated as remote links and only checked via the network when the
-n (-net) flag is set.

-t (-target)

The next argument is a literal expression that a path/filename must
match in order to be printed.  Example: use "-lt .gif" to only see
gif files. It is most effective when used with the -l (-list) and
optionally the -x (-xref) flags.  This option only affects the
printout it does not alter the searching.

-tr (-targetreg)

Same as -t (-target) but treats the next argument as a regular
expression. Example: use "-ltr \.(gif|jpg)$" to see only gif and jpg
files.  This flag is most useful to those who know Perl regular
expressions.

-timeout

The next argument is the amount of time (in seconds) to wait to hear
back from a remote host.  This code is currently disabled in Dos
(Windows) mode.  The default value is 15 seconds. If you set timeout
to 0 then then timeout code is not executed and linklint could wait
forever.

-u (-unused)

Print all unused (orphan) files in every directory that has
files that are used in the site.  After the site is checked we go
back to every directory that contained at least one file used in the
site and check for any other files that are not used.  This is useful
for finding old versions of pages and images that are no longer
needed.

-w (-warn)

Print out warning messages for the following conditions:

    1) no trailing "/" for a local link to a directory,
    2) any file not world readable,
    3) the use of "\" instead of "/" in any link,
    4) any local file that does not reside under the server root,
    5) redirects to relative URL's,
    6) a remote URL has been moved,
    7) a server is down but we connected with an alternate,
    8) a remote URL is ignored due to -i* flag,
    9) unable to write the linklint.bot cache file,
   10) case of file name does not match case used in reference,
   11) unterminated comments in html files

-x (-xref)

Print cross references for each file and remote link found.  This can
be useful to find out which files are referencing missing files or
bad links.


FILELIST INPUT TO LINKLINT

filelist

For normal operation the filelist is just the name of one "seed" file
which links to other files in your site. If no files are specified
then linklint looks for an index file in the server root directory.
You can also use the filelist to specify remote URL's.  These inputs
are interpreted just as if they were inside an "<a href=...>" tag, so
remote links must start with "http://".

@listfile

If you have a long list of files or URL's that you want to check you
can put them all in a file (one per line) and then use "@list.txt" as
the filelist.  NEW: If a line in the file contains "http://" then
linklint will try to extract just that link from the line.  This
simplifies the processing of (NCSA) server side image map files.

One reason to use an @file is to check a list of remote URL's using
"linklint -net @list.txt".  Another reason would be if you want to
check a list of files on your site without checking the entire site.
Then you would use "-o @list.txt" which would only check the files in
list.txt.  On most Unix machines you can generate a list of remote
links with "linklint -l | grep 'http:/' > list.txt"

@@outputfile

Linklint can read back its own output.  You generate the output file
with the -l (-list) flag and optionally the -x (-xref) and -a
(-anchor) flags thus:

     linklint -A  > linklint.log

then you can check all your remote links without rechecking the local
files with

     linklint -n @@linklint.log

Or if you are in the process of cleaning out your orphan files and
you want to recheck them without rechecking your entire site use

     linklint -u @@linklint.log

If you have a large site (or lots of errors) and you don't want to
sift through a large output you can generate a complete cross
referenced output file

    linklint -A > linklint.log

and then go back and selectively print portions of the output using
the -x (-xref) flag or the -t (-target) or -tr (-targetreg) flags:

   linklint -lt .gif  @@linklint.log
   linklint -lxt file1.html @@linklint.log

Currenty the only parts of the output file that are read back in are:

  1) directories,
  2) remote links,
  3) found or missing files,
  4) found or missing named anchors,
  5) cross references to all of the above.


OUTPUT OF LINKLINT

The output of linklint is for the most part self explanatory.  The -s
(-summary) flag overrides the -l (-list) and -x (-xref) flags and
causes a terse summary to be printed instead of long lists.

If a file contains a redirect inside a <meta http-equiv=...> tag then
in addition to the standard cross referencing the redirect is
indicated by " => newfile" at the end of the file name.

If a file is found that is not underneath the server root then this
file is listed as "FILE:/absolutepath/file".  All such files are
flaged as warnings.

To produce the least amount of output possible use "-t zzzzzz" which
will cause linklint to check your site but only print the bottom
summary (unless you have file names containing "zzzzzz").


Enjoy, Jim Bowlin