wgrab is a perl script that can be used to selectively download parts of a foreign website and store things in the local filesystem. Instead of the unspecified way, in which 'wget -r' downloads and stores everything, wgrab allows you to use iteration over dates and numbers, as well as regular expressions to specify, which references to follow.
Here is the help you get when you call 'wgrab -h' (i need to write a more polished documentation - i know):
wgrab [opts] [-a saveAs] {counters} startURL {patterns} v1.18 of Oct. 31st 2001 make http downloads based on dates, numbers and regular expressions (c) Heiko Hellweg (hellweg@snark.de), 2001; see http://snark.de/wgrab/ options include: -h: long help , -hX prints some eXamples... -v: verbose (multiple -v make it more verbose) -p: don't get last pattern - just print matches to stdout -P substPat: like -p, but apply % substitution (like in saveAs) before printig -n: noClobber - overwrite existing files instead of renaming -w seconds: wait between downloads (don't overrun the other host) -A: save all (not just the files retrieved with the last pattern) -r: rewrite saved (make links to unsaved docs absolute; links to saved relative) -H: span hosts - auto-patterns may contain ":" (like http://) -R: follow HREFs only -S: follow SRC only --user string: username for basic www-authentification --pass string: passphrase for basic www-authentification --userAgent string: UserAgent header transmitted to server counters are: type start end [-s step] with these types: -d : iterate over dates, relative to now (start and end are integer) -D : iterate over absolute dates (start and end are YYYYMMDD) -e enumerate over integers (may occur multiple times - use %e...%h) -E enumerate over characters (may occur multiple times - use %E...%H) All patterns following the startURL may contain everything that makes up a perl regex. With multiple patterns, wgrab gets the startURL document, extracts all quoted strings, tries to match the first pattern, interprets the results as URLs, downloads them, applies the next pattern to their content... and saves the documents downloaded via the last pattern. If a pattern contains "(" and ")", it is left unmodified. Otherwise a new pattern is constructed: ["']([^"']*yourpattern)["'] (details of the prefix depend on -H, -R and -S options). Anyway: perls $1 [the matched part in '()'] is used for going on. Patterns starting with "+" are recursed into on the same level and applied to their own result again (this way you can easily iterate thru "next" links) additional rexexp shorthand: \I (image) expands to "[gGjJpP][iIpPnN][eE]?[fFgG]" (i.e. gif or jpg or jpeg or png - with arbitrary mixes of upper/lower case). saveAs and all the patterns may contain %[[+-]number[.fillchar]]X ('+' truncates from the right and '-' truncates from the left) where X is one of d: Day of Month m: Month of Year y: Year (4 digit) D: Day of Week [0..6] w: Day of Week name lowercase W: Day of Week name Uppercase n: Month name lowercase N: Month name Uppercase T: That day (shorthand for %4y%2m%2d) t: today (shorthand for %4y%2m%2d with current date) e/f/g/h: counter 1..4 (depends on number of -e/-E parameters) E/F/G/H: chr(counter 1..4) (depends on number of -e/-E parameters) =: referenced filename in -a or -P: URL split at "/"(from right: %1= = filename) i: counter for saved files in saveAs ...better look at the examples with "-hX" to see how %substitution works Tricks with the saveAs pattern (-a): %.= flattens the filename, replacing '/' in the url with a '.' in the filename. use "-" for the saveAs pattern to print all results to stdout. if the saveAs pattern starts with a "|", it is executed as a shell command. (e.g. mail each document to yourself with -a '|mutt -s "%=" myself@mail.edu') LICENSE: use, modify and redistribute as much as you want - as long as credit to me (hellweg@snark.de) remains in the docs. No warranty that wgrab does or does not perform in any specific or predictable way... It would be nice (not a condition of the license) if you told me about bugs you encounter (maybe even with a fix) and/or the uses you find for wgrab. |
The % patterns work a bit (but only a bit) like in printf - and a bit more (but still only a bit) like the '+FORMAT' mechanism of the gnu date tool...
Here are some examples (assuming, the current month is September):
wgrab -p '%m' => 9 wgrab -p '%2m' => 09 (left side padding to length 2 - 0 is the default fillchar) wgrab -p '%3.#m' => ##9 (padding with a specific char) wgrab -p '%-3.#m' => 9## (padding on the right) wgrab -p '%N' => September wgrab -p '%3N => Sep (truncating on the right) wgrab -p '%-3N => ber (truncating on the left) wgrab -p '%15._N => ______September (padding on the left again)If you need a real '%' sign - and wgrab might confuse it with a substitution pattern like %2d - escape it with a backslash (\%) or a second '%' (%%).
Here are some examples, what wgrab can get you (these work today [20 Dec. 2000] - if the sites reorganize or change their naming scheme, you will have to find your own patterns):
wgrab -v -v -d -1 -28 -a ./dilbert%T.gif http://www.dilbert.com/comics/dilbert/archive/dilbert-%T.html '/archive/images/dilbert[^"]+\I' |
wgrab -v -v -n http://sinfest.net/strips_page.htm '\d+.html' 'sf\d+.\I' |
wgrab -H -v -v -a %3i.%1= http://www.snark.de/dela/archiv.html '\d.html' '\d.\I' |
wgrab -H -v -v -a %3i.%1= http://www.snark.de/dela/first.html '+\d.html' '\d.\I' |
Caveat: you'd better not use this script to load really enormous files - the current download is allways sucked completely into memory before being written to disk (it's not the number of files that matters, just the size of the biggest). A few MB won't hurt but i would not advise you to get the ISO CD-ROM images for your favourite Linux distro this way...
Have fun with it and tell me, if you find it usefull...
btw.: another usefull (but much more specific) tool is my comix collector. And there are some other
tools available too.
Author:
Heiko Hellweg,
last modified: 31. Oct. 2001