WEBGRAB(1)WEBGRAB(1)NAME
webgrab - fetch web page content as files
SYNOPSIS
webgrab [ -r ] [ -v ] [ -o stem ] [ -p body ] url
DESCRIPTION
Webgrab connects to the web server named in the url. It fetches the
content of the web page also determined by the url, and stores it
locally in a file. If the page is written in HTML, webgrab reads it to
build a list of sub-component pages (eg, frames) and images. It
fetches those, saving the content in separate files. It adds a comment
to the end of each HTML file giving the time, and the file's origin.
It automatically follows redirections offered by the server.
The stem of the names of the output files is normally derived from a
component of the url. If the url contains a path name, the stem is the
component of that path, less any dot-separated suffix and prefix. For
example, given
http://www.vitanuova.com/inferno/old.index.html
the stem would be index. If there is no path name, but the url con‐
tains a domain name, the stem is the penultimate component of the
domain name (eg, excluding trailing .com, and initial www, etc). For
example, given
www.innerhost.vitanuova.com
the stem would be vitanuova. If all else fails, webgrab uses the stem
webgrab.
Given a stem, the initial page is stored in stem.suffix where suffix is
the suffix (eg, .html) of the name of the original page. Subordinate
pages are saved in a similar way in files named stem_1.suffix1,
stem_2.suffix2, ... .
The options are:
-r do not fetch subcomponents (just the `raw' source of url itself)
-v print a progress report
-vv print a chatty progress report
-o stem
use the stem as given
-p body
Use HTTP POST instead of GET, posting body as the data
Webgrab reads the configuration file /services/webget/config (if it
exists), to look for the address of an optional HTTP proxy (in the
entry), and list of domains for which a proxy should not be used (in
the noproxy or noproxydoms entry). If symbolic network and service
names might be involved, the connection server lib/cs needs to be
already running.
FILES
/services/webget/config
SOURCE
/appl/cmd/webgrab.b
BUGS
It should read the proxy name from the charon(1) configuration file and
not the webget configuration file.
It cannot do `secure' transfers (https).
Its HTML parsing is naive, but on the other hand, it is less likely to
trip over HTML novelties.
SEE ALSOcs(8)WEBGRAB(1)