trevoristall
1/14/2015 - 3:32 PM

wget

wget

##
# Here are a couple of recipes to download and archive an entire Web site, starting with the given page and recursing down.
# 
# Pitfalls
# As of 2008, WGet doesn't follow @import links in CSS.
#
# Credit to http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php
# And http://www.veen.com/jeff/archives/000573.html


# Get page.com and each page it links to as well as linked assets like images and CSS.  Change hyperlinks to point to the locally downloaded pages. Adjust how many levels deep by changing the numeric argument given after -l
wget -pkr -l 1 http://site
 
 
# Same as above but also follow links to other domains.
wget -Hpkr -l 1 http://site
 
 
# Same as the first example, but use a cookie
wget -pkr -l 1 --no-cookies --header "Cookie: JSESSIONID=12345" https://securesite
 
 
# Mirror an html site.
# Read time-stamps when overwriting files that already exist.
# Wait about 10 seconds beteen tries
wget -m -N -w10 --random-wait http://site
 
 
# Behave very badly by ignoring the robots.txt directive. 
# And spoof Mozilla.
# Also output is appended to site.com.log 
wget -m -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/
 
wget -pkr -l 1 -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/
 
 
# Then of course you can see the current output from wget with
tail -f site.com.log