Today a collaborator who is working toward more automation in her ordinary computing work asked me how to use the amazing command-line browser wget to get the images out of a web page.
wget has options to grab a whole web site: if you want a page and everything in child directories that is linked off the server you can use the “recursive”, “no parents” and “no clobber” options:
wget -r -np -nc http://www.gnu.org/software/wget/manual/html_node/index.html
will give you a basic mirror of that site, and it will work quite well for that one because it does not have active content that needs to be followed heuristically. But it will not retrieve linked files or images unless they are under the top level directory of that URL.
So this specific page is “The Illustrated Guide to a Ph.D.”, which has gone viral recently. Note that when the openculture web site took it in they separated out the images, so if you look at the web page’s source to the page (control-U in most browsers) you will find it to be under http://www.openculture.com/2010/09/ , but its images are stored at locations such as: http://www.openculture.com/images/PhDKnowledge.001.jpg
So the simple wget command will not grab the images. You might want to automate the process of downloading them all, and there is a very simple shell command sequence to do so. Start by making yourself a sandbox area:
mkdir ~/sandbox cd ~/sandbox
Then grab the top level URL:
Now you will have the file the_illustrated_guide_to_a_phd.html in your current directory. Next experiment with grep and sed to get the list of jpg URLs:
grep '\.jpg' the_illustrated_guide_to_a_phd.html
Now that you see what lines have a .jpg in them, use the sed command to edit out everything except for the URL. A typical line looks like:
<div class="graphic"><img alt="" src="http://www.openculture.com/images/PhDKnowledge.006.jpg" width="440" /></div>
which clearly needs to be cleaned up. This sed command will use a regular expression that matches everything between the =” and the jpg” and prints out just that part. It uses the \( and \) for making a “group” of the interesting part. The group is then what gets printed with the \1
grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'
which gives you lines like:
you are now ready to use wget on those individual image URLs:
JPG_LIST=`grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'`
Then iterate through that list grabbing each URL:
for jpg_url in $JPG_LIST do echo $jpg_url wget "$jpg_url" done
To see this whole inline script (you can paste it in as it is):
mkdir ~/sandbox cd ~/sandbox wget http://www.openculture.com/2010/09/the_illustrated_guide_to_a_phd.html JPG_LIST=`grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'` for jpg_url in $JPG_LIST do echo $jpg_url wget "$jpg_url" done
A final note: the original article by Matthew Might was on his web site and he had organized his page to have the images in the hierarchy of the HTML. This is a more robust web site layout, and the recursive wget command would have mirrored it well:
wget -r -np -nc http://matt.might.net/articles/phd-school-in-pictures/ find matt.might.net -name '*.jpg' -print