@dircategory Net Utilities @dircategory World Wide Web * Wget: (wget). The non-interactive network downloader.
Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.
Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided also that the sections entitled "Copying" and "GNU General Public License" are included exactly as in the original, and provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one.
Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation.
GNU Wget is a freely available network utility to retrieve files from the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP (File Transfer Protocol), the two most widely used Internet protocols. It has many useful features to make downloading easier, some of them being:
norobots
convention.
REST
with FTP
and Range
with HTTP servers that support them.
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
Wget will simply download all the URLs specified on the command line. URL is a Uniform Resource Locator, as defined below.
However, you may wish to change some of the default parameters of Wget. You can do it two ways: permanently, adding the appropriate command to `.wgetrc' (See section Startup File), or specifying it on the command line.
URL is an acronym for Uniform Resource Locator. A uniform resource locator is a compact string representation for a resource available via the Internet. Wget recognizes the URL syntax as per RFC1738. This is the most widely used form (square brackets denote optional parts):
http://host[:port]/directory/file ftp://host[:port]/directory/file
You can also encode your username and password within a URL:
ftp://user:password@host/path http://user:password@host/path
Either user or password, or both, may be left out. If you leave out either the HTTP username or password, no authentication will be sent. If you leave out the FTP username, `anonymous' will be used. If you leave out the FTP password, your email address will be supplied as a default password.(1)
You can encode unsafe characters in a URL as `%xy', xy
being the hexadecimal representation of the character's ASCII
value. Some common unsafe characters include `%' (quoted as
`%25'), `:' (quoted as `%3A'), and `@' (quoted as
`%40'). Refer to RFC1738 for a comprehensive list of unsafe
characters.
Wget also supports the type
feature for FTP URLs. By
default, FTP documents are retrieved in the binary mode (type
`i'), which means that they are downloaded unchanged. Another
useful mode is the `a' (ASCII) mode, which converts the line
delimiters between the different operating systems, and is thus useful
for text files. Here is an example:
ftp://host/directory/file;type=a
Two alternative variants of URL specification are also supported, because of historical (hysterical?) reasons and their wide-spreadedness.
FTP-only syntax (supported by NcFTP
):
host:/dir/file
HTTP-only syntax (introduced by Netscape
):
host[:port]/dir/file
These two alternative forms are deprecated, and may cease being supported in the future.
If you do not understand the difference between these notations, or do
not know which one to use, just use the plain ordinary format you use
with your favorite browser, like Lynx
or Netscape
.
Since Wget uses GNU getopts to process its arguments, every option has a short form and a long form. Long options are more convenient to remember, but take time to type. You may freely mix different option styles, or specify options after the command-line arguments. Thus you may write:
wget -r --tries=10 http://fly.cc.fer.hr/ -o log
The space between the option accepting an argument and the argument may be omitted. Instead `-o log' you can write `-olog'.
You may put several options that do not require arguments together, like:
wget -drc URL
This is a complete equivalent of:
wget -d -r -c URL
Since the options can be specified after the arguments, you may terminate them with `--'. So the following will try to download URL `-x', reporting failure to `log':
wget -o log -- -x
The options that accept comma-separated lists all respect the convention
that specifying an empty list clears its value. This can be useful to
clear the `.wgetrc' settings. For instance, if your `.wgetrc'
sets exclude_directories
to `/cgi-bin', the following
example will first reset it, and then set it to exclude `/~nobody'
and `/~somebody'. You can also clear the lists in `.wgetrc'
(See section Wgetrc Syntax).
wget -X '' -X /~nobody,/~somebody
<base
href="url">
to the documents or by specifying
`--base=url' on the command line.
<base
href="url">
to HTML, or using the `--base' command-line
option.
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.ZIf there is a file name `ls-lR.Z' in the current directory, Wget will assume that it is the first portion of the remote file, and will require the server to continue the retrieval from an offset equal to the length of the local file. Note that you need not specify this option if all you want is Wget to continue retrieving where it left off when the connection is lost--Wget does this by default. You need this option only when you want to continue retrieval of a file already halfway retrieved, saved by another FTP client, or left by Wget being killed. Without `-c', the previous example would just begin to download the remote file to `ls-lR.Z.1'. The `-c' option is also applicable for HTTP servers that support the
Range
header.
default
style each dot represents 1K, there are ten dots
in a cluster and 50 dots in a line. The binary
style has a more
"computer"-like orientation--8K dots, 16-dots clusters and 48 dots
per line (which makes for 384K lines). The mega
style is
suitable for downloading very large files--each dot represents 64K
retrieved, there are eight dots in a cluster, and 48 dots on each line
(so each line contains 3M). The micro
style is exactly the
reverse; it is suitable for downloading small files, with 128-byte dots,
8 dots per cluster, and 48 dots (6K) per line.
wget --spider --force-html -i bookmarks.htmlThis feature needs much more work for Wget to get close to the functionality of real WWW spiders.
m
suffix, in hours using h
suffix, or in days using d
suffix.
Specifying a large value for this option is useful if the network or the
destination host is down, so that Wget can wait long enough to
reasonably expect the network error to be fixed before the retry.
No options -> ftp.xemacs.org/pub/xemacs/ -nH -> pub/xemacs/ -nH --cut-dirs=1 -> xemacs/ -nH --cut-dirs=2 -> . --cut-dirs=1 -> ftp.xemacs.org/xemacs/ ...If you just want to get rid of the directory structure, this option is similar to a combination of `-nd' and `-P'. However, unlike `-nd', `--cut-dirs' does not lose with subdirectories--for instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be placed to `xemacs/beta', as one would expect.
basic
(insecure) or the
digest
authentication scheme.
Another way to specify username and password is in the URL itself
(See section URL Format). For more information about security issues with
Wget, See section Security Considerations.
Content-Length
headers, which makes Wget
go wild, as it thinks not all the document was retrieved. You can spot
this syndrome if Wget retries getting the same document again and again,
each time claiming that the (otherwise normal) connection has closed on
the very same byte.
With this option, Wget will ignore the Content-Length
header--as
if it never existed.
wget --header='Accept-Charset: iso-8859-2' \ --header='Accept-Language: hr' \ http://fly.cc.fer.hr/Specification of an empty string as the header value will clear all previous user-defined headers.
basic
authentication scheme.
User-Agent
header field. This enables distinguishing the
WWW software, usually for statistical purposes or for tracing of
protocol violations. Wget normally identifies as
`Wget/version', version being the current version
number of Wget.
However, some sites have been known to impose the policy of tailoring
the output according to the User-Agent
-supplied information.
While conceptually this is not such a bad idea, it has been abused by
servers denying information to clients other than Mozilla
or
Microsoft Internet Explorer
. This option allows you to change
the User-Agent
line issued by Wget. Use of this option is
discouraged, unless you really know what you are doing.
NOTE that Netscape Communications Corp. has claimed that false
transmissions of `Mozilla' as the User-Agent
are a copyright
infringement, which will be prosecuted. DO NOT misrepresent
Wget as Mozilla.
wget ftp://gnjilux.cc.fer.hr/*.msgBy default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently. You may have to quote the URL to protect it from being expanded by your shell. Globbing makes Wget look for a directory listing, which is system-specific. This is why it currently works only with Unix FTP servers (and the ones emulating Unix
ls
output).
wget -r -nd --delete-after http://whatever.com/~popular/page/The `-r' option is to retrieve recursively, and `-nd' not to create directories.
GNU Wget is capable of traversing parts of the Web (or a single HTTP or FTP server), depth-first following links and directory structure. This is called recursive retrieving, or recursion.
With HTTP URLs, Wget retrieves and parses the HTML from
the given URL, documents, retrieving the files the HTML
document was referring to, through markups like href
, or
src
. If the freshly downloaded file is also of type
text/html
, it will be parsed and followed further.
The maximum depth to which the retrieval may descend is specified with the `-l' option (the default maximum depth is five layers). See section Recursive Retrieval.
When retrieving an FTP URL recursively, Wget will retrieve all
the data from the given directory tree (including the subdirectories up
to the specified depth) on the remote server, creating its mirror image
locally. FTP retrieval is also limited by the depth
parameter.
By default, Wget will create a local directory tree, corresponding to the one found on the remote server.
Recursive retrieving can find a number of applications, the most important of which is mirroring. It is also useful for WWW presentations, and any other opportunities where slow network connections should be bypassed by storing the files locally.
You should be warned that invoking recursion may cause grave overloading on your system, because of the fast exchange of data through the network; all of this may hamper other users' work. The same stands for the foreign server you are mirroring--the more requests it gets in a rows, the greater is its load.
Careless retrieving can also fill your file system unctrollably, which can grind the machine to a halt.
The load can be minimized by lowering the maximum recursion level (`-l') and/or by lowering the number of retries (`-t'). You may also consider using the `-w' option to slow down your requests to the remote servers, as well as the numerous options to narrow the number of followed links (See section Following Links).
Recursive retrieval is a good thing when used properly. Please take all precautions not to wreak havoc through carelessness.
When retrieving recursively, one does not wish to retrieve the loads of unnecessary data. Most of the time the users bear in mind exactly what they want to download, and want Wget to follow only specific links.
For example, if you wish to download the music archive from `fly.cc.fer.hr', you will not want to download all the home pages that happen to be referenced by an obscure part of the archive.
Wget possesses several mechanisms that allows you to fine-tune which links it will follow.
When only relative links are followed (option `-L'), recursive
retrieving will never span hosts. No time-expensive DNS-lookups
will be performed, and the process will be very fast, with the minimum
strain of the network. This will suit your needs often, especially when
mirroring the output of various x2html
converters, since they
generally output relative links.
The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default mode for following links) all URLs the that refer to the same host will be retrieved.
The problem with this option are the aliases of the hosts and domains.
Thus there is no way for Wget to know that `regoc.srce.hr' and
`www.srce.hr' are the same host, or that `fly.cc.fer.hr' is
the same as `fly.cc.etf.hr'. Whenever an absolute link is
encountered, the host is DNS-looked-up with gethostbyname
to
check whether we are maybe dealing with the same hosts. Although the
results of gethostbyname
are cached, it is still a great
slowdown, e.g. when dealing with large indices of home pages on different
hosts (because each of the hosts must be and DNS-resolved to see
whether it just might an alias of the starting host).
To avoid the overhead you may use `-nh', which will turn off DNS-resolving and make Wget compare hosts literally. This will make things run much faster, but also much less reliable (e.g. `www.srce.hr' and `regoc.srce.hr' will be flagged as different hosts).
Note that modern HTTP servers allows one IP address to host several
virtual servers, each having its own directory hieratchy. Such
"servers" are distinguished by their hostnames (all of which point to
the same IP address); for this to work, a client must send a Host
header, which is what Wget does. However, in that case Wget must
not try to divine a host's "real" address, nor try to use the same
hostname for each access, i.e. `-nh' must be turned on.
In other words, the `-nh' option must be used to enabling the retrieval from virtual servers distinguished by their hostnames. As the number of such server setups grow, the behavior of `-nh' may become the default in the future.
With the `-D' option you may specify the domains that will be followed. The hosts the domain of which is not in this list will not be DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that nothing outside of MIT gets looked up. This is very important and useful. It also means that `-D' does not imply `-H' (span all hosts), which must be specified explicitly. Feel free to use this options since it will speed things up, with almost all the reliability of checking for all hosts. Thus you could invoke
wget -r -D.hr http://fly.cc.fer.hr/
to make sure that only the hosts in `.hr' domain get DNS-looked-up for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked (only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be checked.
Of course, domain acceptance can be used to limit the retrieval to particular domains with spanning of hosts in them, but then you must specify `-H' explicitly. E.g.:
wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
will start with `http://www.mit.edu/', following links across MIT and Stanford.
If there are domains you want to exclude specifically, you can do it with `--exclude-domains', which accepts the same type of arguments of `-D', but will exclude all the listed domains. For example, if you want to download all the hosts from `foo.edu' domain, with the exception of `sunsite.foo.edu', you can do it like this:
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
When `-H' is specified without `-D', all hosts are freely spanned. There are no restrictions whatsoever as to what part of the net Wget will go to fetch documents, other than maximum retrieval depth. If a page references `www.yahoo.com', so be it. Such an option is rarely useful for itself.
When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading GIFS, you will not be overjoyed to get loads of Postscript documents, and vice versa.
Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files having `zelazny' as a part of their name, but not the postscript files.
Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.
Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this--the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. `/cgi-bin' or `/dev' directories.
Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
wget -I /people,/cgi-bin http://host/people/bozo/
wget -r --no-parent http://somehost/~luzer/my-archive/You may rest assured that none of the references to `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be followed. Only the archive you are interested in will be downloaded. Essentially, `--no-parent' is similar to `-I/~luzer/my-archive', only it handles redirections in a more intelligent fashion.
The rules for FTP are somewhat specific, as it is necessary for them to be. FTP links in HTML documents are often included for purposes of reference, and it is often inconvenient to download them by default.
To have FTP links followed from HTML documents, you need to specify the `--follow-ftp' option. Having done that, FTP links will span hosts regardless of `-H' setting. This is logical, as FTP links rarely point to the same host where the HTTP server resides. For similar reasons, the `-L' options has no effect on such downloads. On the other hand, domain acceptance (`-D') and suffix rules (`-A' and `-R') apply normally.
Also note that followed links to FTP directories will not be retrieved recursively further.
One of the most important aspects of mirroring information from the Internet is updating your archives.
Downloading the whole archive again and again, just to replace a few changed files is expensive, both in terms of wasted bandwidth and money, and the time to do the update. This is why all the mirroring tools offer the option of incremental updating.
Such an updating mechanism means that the remote server is scanned in search of new files. Only those new files will be downloaded in the place of the old ones.
A file is considered new if one of these two conditions are met:
To implement this, the program needs to be aware of the time of last modification of both remote and local files. Such information are called the time-stamps.
The time-stamping in GNU Wget is turned on using `--timestamping'
(`-N') option, or through timestamping = on
directive in
`.wgetrc'. With this option, for each file it intends to download,
Wget will check whether a local file of the same name exists. If it
does, and the remote file is older, Wget will not download it.
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the time-stamps say.
The usage of time-stamping is simple. Say you would like to download a file so that it keeps its date of modification.
wget -S http://www.gnu.ai.mit.edu/
A simple ls -l
shows that the time stamp on the local file equals
the state of the Last-Modified
header, as returned by the server.
As you can see, the time-stamping info is preserved locally, even
without `-N'.
Several days later, you would like Wget to check if the remote file has changed, and download it if it has.
wget -N http://www.gnu.ai.mit.edu/
Wget will ask the server for the last-modified date. If the local file is newer, the remote file will not be re-fetched. However, if the remote file is more recent, Wget will proceed fetching it normally.
The same goes for FTP. For example:
wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
ls
will show that the timestamps are set according to the state
on the remote server. Reissuing the command with `-N' will make
Wget re-fetch only the files that have been modified.
In both HTTP and FTP retrieval Wget will time-stamp the local
file correctly (with or without `-N') if it gets the stamps,
i.e. gets the directory listing for FTP or the Last-Modified
header for HTTP.
If you wished to mirror the GNU archive every week, you would use the following command every week:
wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
Time-stamping in HTTP is implemented by checking of the
Last-Modified
header. If you wish to retrieve the file
`foo.html' through HTTP, Wget will check whether
`foo.html' exists locally. If it doesn't, `foo.html' will be
retrieved unconditionally.
If the file does exist locally, Wget will first check its local
time-stamp (similar to the way ls -l
checks it), and then send a
HEAD
request to the remote server, demanding the information on
the remote file.
The Last-Modified
header is examined to find which file was
modified more recently (which makes it "newer"). If the remote file
is newer, it will be downloaded; if it is older, Wget will give
up.(2)
Arguably, HTTP time-stamping should be implemented using the
If-Modified-Since
request.
In theory, FTP time-stamping works much the same as HTTP, only FTP has no headers--time-stamps must be received from the directory listings.
For each directory files must be retrieved from, Wget will use the
LIST
command to get the listing. It will try to analyze the
listing, assuming that it is a Unix ls -l
listing, and extract
the time-stamps. The rest is exactly the same as for HTTP.
Assumption that every directory listing is a Unix-style listing may sound extremely constraining, but in practice it is not, as many non-Unix FTP servers use the Unixoid listing format because most (all?) of the clients understand it. Bear in mind that RFC959 defines no standard way to get a file list, let alone the time-stamps. We can only hope that a future standard will define this.
Another non-standard solution includes the use of MDTM
command
that is supported by some FTP servers (including the popular
wu-ftpd
), which returns the exact time of the specified file.
Wget may support this command in the future.
Once you know how to change default settings of Wget through command line arguments, you may wish to make some of those settings permanent. You can do that in a convenient way by creating the Wget startup file---`.wgetrc'.
Besides `.wgetrc' is the "main" initialization file, it is convenient to have a special facility for storing passwords. Thus Wget reads and interprets the contents of `$HOME/.netrc', if it finds it. You can find `.netrc' format in your system manuals.
Wget reads `.wgetrc' upon startup, recognizing a limited set of commands.
When initializing, Wget will look for a global startup file, `/usr/local/etc/wgetrc' by default (or some prefix other than `/usr/local', if Wget was not installed there) and read commands from there, if it exists.
Then it will look for the user's file. If the environmental variable
WGETRC
is set, Wget will try to load that file. Failing that, no
further attempts will be made.
If WGETRC
is not set, Wget will try to load `$HOME/.wgetrc'.
The fact that user's settings are loaded after the system-wide ones means that in case of collision user's wgetrc overrides the system-wide wgetrc (in `/usr/local/etc/wgetrc' by default). Fascist admins, away!
The syntax of a wgetrc command is simple:
variable = value
The variable will also be called command. Valid values are different for different commands.
The commands are case-insensitive and underscore-insensitive. Thus `DIr__PrefiX' is the same as `dirprefix'. Empty lines, lines beginning with `#' and lines containing white-space only are discarded.
Commands that expect a comma-separated list will clear the list on an empty command. So, if you wish to reset the rejection list specified in global `wgetrc', you can do it with:
reject =
The complete set of commands is listed below, the letter after `=' denoting the value the command takes. It is `on/off' for `on' or `off' (which can also be `1' or `0'), string for any non-empty string or n for a positive integer. For example, you may specify `use_proxy = off' to disable use of proxy servers by default. You may use `inf' for infinite values, where appropriate.
Most of the commands have their equivalent command-line option (See section Invoking), except some more obscure or rarely used ones.
Content-Length
header; the same as
`--ignore-length'.
Content-Length
.
This is the sample initialization file, as given in the distribution. It is divided in two section--one for global usage (suitable for global startup file), and one for local usage (suitable for `$HOME/.wgetrc'). Be careful about the things you change.
Note that all the lines are commented out. For any line to have effect, you must remove the `#' prefix at the beginning of line.
### ### Sample Wget initialization file .wgetrc ### ## You can use this file to change the default behaviour of wget or to ## avoid having to type many many command-line options. This file does ## not contain a comprehensive list of commands -- look at the manual ## to find out what you can put into this file. ## ## Wget initialization file can reside in /usr/local/etc/wgetrc ## (global, for all users) or $HOME/.wgetrc (for a single user). ## ## To use any of the settings in this file, you will have to uncomment ## them (and probably change them). ## ## Global settings (useful for setting up in /usr/local/etc/wgetrc). ## Think well before you change them, since they may reduce wget's ## functionality, and make it behave contrary to the documentation: ## # You can set retrieve quota for beginners by specifying a value # optionally followed by 'K' (kilobytes) or 'M' (megabytes). The # default quota is unlimited. #quota = inf # You can lower (or raise) the default number of retries when # downloading a file (default is 20). #tries = 20 # Lowering the maximum depth of the recursive retrieval is handy to # prevent newbies from going too "deep" when they unwittingly start # the recursive retrieval. The default is 5. #reclevel = 5 # Many sites are behind firewalls that do not allow initiation of # connections from the outside. On these sites you have to use the # `passive' feature of FTP. If you are behind such a firewall, you # can turn this on to make Wget use passive FTP by default. #passive_ftp = off ## ## Local settings (for a user to set in his $HOME/.wgetrc). It is ## *highly* undesirable to put these settings in the global file, since ## they are potentially dangerous to "normal" users. ## ## Even when setting up your own ~/.wgetrc, you should know what you ## are doing before doing so. ## # Set this to on to use timestamping by default: #timestamping = off # It is a good idea to make Wget send your email address in a `From:' # header with your request (so that server administrators can contact # you in case of errors). Wget does *not* send `From:' by default. #header = From: Your Name <username@site.domain> # You can set up other headers, like Accept-Language. Accept-Language # is *not* sent by default. #header = Accept-Language: en # You can set the default proxy for Wget to use. It will override the # value in the environment. #http_proxy = http://proxy.yoyodyne.com:18023/ # If you do not want to use proxy at all, set this to off. #use_proxy = on # You can customize the retrieval outlook. Valid options are default, # binary, mega and micro. #dot_style = default # Setting this to off makes Wget not download /robots.txt. Be sure to # know *exactly* what /robots.txt is and how it is used before changing # the default! #robots = on # It can be useful to make Wget wait between connections. Set this to # the number of seconds you want Wget to wait. #wait = 0 # You can force creating directory structure, even if a single is being # retrieved, by setting this to on. #dirstruct = off # You can turn on recursive retrieving by default (don't do this if # you are not sure you know what it means) by setting this to on. #recursive = off # To have Wget follow FTP links from HTML files by default, set this # to on: #follow_ftp = off
The examples are classified into three sections, because of clarity. The first section is a tutorial for beginners. The second section explains some of the more complex program features. The third section contains advice for mirror administrators, as well as even more complex features (that some would call perverted).
wget http://fly.cc.fer.hr/The response will be something like:
--13:30:45-- http://fly.cc.fer.hr:80/en/ => `index.html' Connecting to fly.cc.fer.hr:80... connected! HTTP request sent, awaiting response... 200 OK Length: 4,694 [text/html] 0K -> .... [100%] 13:30:46 (23.75 KB/s) - `index.html' saved [4694/4694]
wget --tries=45 http://fly.cc.fer.hr/jpg/flyweb.jpg
wget -t 45 -o log http://fly.cc.fer.hr/jpg/flyweb.jpg &The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use `-t inf'.
$ wget ftp://gnjilux.cc.fer.hr/welcome.msg --10:08:47-- ftp://gnjilux.cc.fer.hr:21/welcome.msg => `welcome.msg' Connecting to gnjilux.cc.fer.hr:21... connected! Logging in as anonymous ... Logged in! ==> TYPE I ... done. ==> CWD not needed. ==> PORT ... done. ==> RETR welcome.msg ... done. Length: 1,340 (unauthoritative) 0K -> . [100%] 10:08:48 (1.28 MB/s) - `welcome.msg' saved [1340]
wget ftp://prep.ai.mit.edu/pub/gnu/ lynx index.html
wget -i fileIf you specify `-' as file name, the URLs will be read from standard input.
wget -r -t1 http://www.gnu.ai.mit.edu/ -o gnulog
wget -r -l1 http://www.yahoo.com/
wget -S http://www.lycos.com/
wget -s http://www.lycos.com/ more index.html
wget -P/tmp -l2 ftp://wuarchive.wustl.edu/
wget -r -l1 --no-parent -A.gif http://host/dir/It is a bit of a kludge, but it works. `-r -l1' means to retrieve recursively (See section Recursive Retrieval), with maximum depth of 1. `--no-parent' means that references to the parent directory are ignored (See section Directory-Based Limits), and `-A.gif' means to download only the GIF files. `-A "*.gif"' would have worked too.
wget -nc -r http://www.gnu.ai.mit.edu/
wget ftp://hniksic:mypassword@jagor.srce.hr/.emacs
wget --dot-style=binary ftp://prep.ai.mit.edu/pub/gnu/READMEYou can experiment with other styles, like:
wget --dot-style=mega ftp://ftp.xemacs.org/pub/xemacs/xemacs-20.4/xemacs-20.4.tar.gz wget --dot-style=micro http://fly.cc.fer.hr/To make these settings permanent, put them in your `.wgetrc', as described before (See section Sample Wgetrc).
crontab 0 0 * * 0 wget --mirror ftp://ftp.xemacs.org/pub/xemacs/ -o /home/me/weeklog
wget --mirror -A.html http://www.w3.org/
wget -rN -Dsrce.hr http://www.srce.hr/Now Wget will correctly find out that `regoc.srce.hr' is the same as `www.srce.hr', but will not even take into consideration the link to `www.mit.edu'.
wget -k -r URL
wget -O - http://jagor.srce.hr/ http://www.srce.hr/You can also combine the two options and make weird pipelines to retrieve the documents from remote hotlists:
wget -O - http://cool.list.com/ | wget --force-html -i -
This chapter contains all the stuff that could not fit anywhere else.
Proxies are special-purpose HTTP servers designed to transfer data from remote servers to local clients. One typical use of proxies is lightening network load for users behind a slow connection. This is achieved by channeling all HTTP and FTP requests through the proxy which caches the transferred data. When a cached resource is requested again, proxy will return the data from cache. Another use for proxies is for companies that separate (for security reasons) their internal networks from the rest of Internet. In order to obtain information from the Web, their users connect and retrieve remote data using an authorized proxy.
Wget supports proxies for both HTTP and FTP retrievals. The standard way to specify proxy location, which Wget recognizes, is using the following environment variables:
http_proxy
ftp_proxy
no_proxy
no_proxy
is `.mit.edu', proxy will not be used to retrieve
documents from MIT.
In addition to the environment variables, proxy location and settings may be specified from within Wget itself.
Some proxy servers require authorization to enable you to use them. The
authorization consists of username and password, which must
be sent by Wget. As with HTTP authorization, several
authentication schemes exist. For proxy authorization only the
Basic
authentication scheme is currently implemented.
You may specify your username and password either through the proxy URL or through the command-line options. Assuming that the company's proxy is located at `proxy.srce.hr' at port 8001, a proxy URL location containing authorization data might look like this:
http://hniksic:mypassword@proxy.company.com:8001/
Alternatively, you may use the `proxy-user' and
`proxy-password' options, and the equivalent `.wgetrc'
settings proxy_user
and proxy_passwd
to set the proxy
username and password.
Like all GNU utilities, the latest version of Wget can be found at the master GNU archive site prep.ai.mit.edu, and its mirrors. For example, Wget 1.5.3 can be found at ftp://prep.ai.mit.edu/pub/gnu/wget-1.5.3.tar.gz
Wget has its own mailing list at wget@sunsite.auc.dk, thanks to Karsten Thygesen. The mailing list is for discussion of Wget features and web, reporting Wget bugs (those that you think may be of interest to the public) and mailing announcements. You are welcome to subscribe. The more people on the list, the better!
To subscribe, send mail to wget-subscribe@sunsite.auc.dk. the magic word `subscribe' in the subject line. Unsubscribe by mailing to wget-unsubscribe@sunsite.auc.dk.
The mailing list is archived at http://fly.cc.fer.hr/archive/wget.
You are welcome to send bug reports about GNU Wget to bug-wget@gnu.org. The bugs that you think are of the interest to the public (i.e. more people should be informed about them) can be Cc-ed to the mailing list at wget@sunsite.auc.dk.
Before actually submitting a bug report, please try to follow a few simple guidelines.
gdb `which
wget` core
and type where
to get the backtrace.
Since Wget uses GNU Autoconf for building and configuring, and avoids using "special" ultra--mega--cool features of any particular Unix, it should compile (and work) on all common Unix flavors.
Various Wget versions have been compiled and tested under many kinds of Unix systems, including Solaris, Linux, SunOS, OSF (aka Digital Unix), Ultrix, *BSD, IRIX, and others; refer to the file `MACHINES' in the distribution directory for a comprehensive list. If you compile it on an architecture not listed there, please let me know so I can update it.
Wget should also compile on the other Unix systems, not listed in `MACHINES'. If it doesn't, please let me know.
Thanks to kind contributors, this version of Wget compiles and works on Microsoft Windows 95 and Windows NT platforms. It has been compiled successfully using MS Visual C++ 4.0, Watcom, and Borland C compilers, with Winsock as networking software. Naturally, it is crippled of some features available on Unix, but it should work as a substitute for people stuck with Windows. Note that the Windows port is neither tested nor maintained by me--all questions and problems should be reported to Wget mailing list at wget@sunsite.auc.dk where the maintainers will look at them.
Since the purpose of Wget is background work, it catches the hangup
signal (SIGHUP
) and ignores it. If the output was on standard
output, it will be redirected to a file named `wget-log'.
Otherwise, SIGHUP
is ignored. This is convenient when you wish
to redirect the output of Wget after having started it.
$ wget http://www.ifi.uio.no/~larsi/gnus.tar.gz & $ kill -HUP %% # Redirect the output to wget-log
Other than that, Wget will not try to interfere with signals in any
way. C-c, kill -TERM
and kill -KILL
should kill it
alike.
This chapter contains some references I consider useful, like the Robots Exclusion Standard specification, as well as a list of contributors to GNU Wget.
Since Wget is able to traverse the web, it counts as one of the Web robots. Thus Wget understands Robots Exclusion Standard (RES)---contents of `/robots.txt', used by server administrators to shield parts of their systems from wanderings of Wget.
Norobots support is turned on only when retrieving recursively, and never for the first page. Thus, you may issue:
wget -r http://fly.cc.fer.hr/
First the index of fly.cc.fer.hr will be downloaded. If Wget finds
anything worth downloading on the same host, only then will it
load the robots, and decide whether or not to load the links after all.
`/robots.txt' is loaded only once per host. Wget does not support
the robots META
tag.
The description of the norobots standard was written, and is maintained by Martijn Koster m.koster@webcrawler.com. With his permission, I contribute a (slightly modified) texified version of the RES.
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
This document represents a consensus on 30 June 1994 on the robots
mailing list (robots@webcrawler.com
), between the majority of
robot authors and other people with an interest in robots. It has also
been open for discussion on the Technical World Wide Web mailing list
(www-talk@info.cern.ch
). This document is based on a previous
working draft under the same title.
It is not an official standard backed by a standards body, or owned by any commercial organization. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.
The latest version of this document can be found at http://info.webcrawler.com/mak/projects/robots/norobots.html.
The format and semantics of the `/robots.txt' file are as follows:
The file consists of one or more records separated by one or more blank
lines (terminated by CR
, CR/NL
, or NL
). Each
record contains lines of the form:
<field>:<optionalspace><value><optionalspace>
The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the `#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.
The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognized headers are ignored.
The presence of an empty `/robots.txt' file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is `*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the `/robots.txt' file.
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, `Disallow: /help' disallows both `/help.html' and `/help/index.html', whereas `Disallow: /help/' would disallow `/help/index.html' but allow `/help.html'.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The following example `/robots.txt' file specifies that no robots should visit any URL starting with `/cyberworld/map/' or `/tmp/':
# robots.txt for http://www.site.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear
This example `/robots.txt' file specifies that no robots should visit any URL starting with `/cyberworld/map/', except the robot called `cybermapper':
# robots.txt for http://www.site.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space # Cybermapper knows where to go. User-agent: cybermapper Disallow:
This example indicates that no robots should visit this site further:
# go away User-agent: * Disallow: /
When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.
ps
. If this
is a problem, avoid putting passwords from the command line--e.g. you
can use `.netrc' for this.
GNU Wget was written by Hrvoje Nik@v{s}i'{c} hniksic@srce.hr. However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature proposals, patches, or letters saying "Thanks!".
Special thanks goes to the following people (no particular order):
ansi2knr
-ization.
Digest
authentication.
The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:
Tim Adam, Martin Baehr, Dieter Baron, Roger Beeman and the Gurus at Cisco, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Tim Charron, Noel Cragg, Kristijan @v{C}onka@v{s}, Damir D@v{z}eko, Andrew Davison, Ulrich Drepper, Marc Duponcheel, Aleksandar Erkalovi'{c}, Andy Eskilsson, Masashi Fujita, Howard Gayle, Marcel Gerrits, Hans Grobler, Mathieu Guillaume, Karl Heuer, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Simon Josefsson, Mario Juri'{c}, Goran Kezunovi'{c}, Robert Kleine, Fila Kolodny, Alexander Kourakos, Martin Kraemer, Tage Stabell-Kulo, Hrvoje Lacko, Dave Love, Jordan Mendelson, Lin Zhe Min, Charlie Negyesi, Andrew Pollock, Steve Pothier, Marin Purgar, Jan Prikryl, Keith Refson, Tobias Ringstrom, Heinz Salzmann, Robert Schmidt, Toomas Soome, Sven Sternberger, Markus Strasser, Szakacsits Szabolcs, Mike Thomas, Russell Vincent, Douglas E. Wegscheid, Jasmin Zainul, Bojan @v{Z}drnja, Kristijan Zimmer.
Apologies to all who I accidentally left out, and many thanks to all the subscribers of the Wget mailing list.
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 675 Mass Ave, Cambridge, MA 02139, USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and modification follow.
If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found.
one line to give the program's name and an idea of what it does. Copyright (C) 19yy name of author This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. signature of Ty Coon, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.
Jump to: . - a - b - c - d - e - f - g - h - i - l - m - n - o - p - q - r - s - t - u - v - w
This document was generated on 4 November 2000 using texi2html 1.56k.