a bit of knowledge of regular expressions

Today a collaborator who is working toward more automation in her ordinary computing work asked me how to use the amazing command-line browser wget to get the images out of a web page.

wget has options to grab a whole web site: if you want a page and everything in child directories that is linked off the server you can use the “recursive”, “no parents” and “no clobber” options:

wget -r -np -nc http://www.gnu.org/software/wget/manual/html_node/index.html

will give you a basic mirror of that site, and it will work quite well for that one because it does not have active content that needs to be followed heuristically.  But it will not retrieve linked files or images unless they are under the top level directory of that URL.

So this specific page is “The Illustrated Guide to a Ph.D.”, which has gone viral recently.  Note that when the openculture web site took it in they separated out the images, so if you look at the web page’s source to the page (control-U in most browsers) you will find it to be under http://www.openculture.com/2010/09/ , but its images are stored at locations such as: http://www.openculture.com/images/PhDKnowledge.001.jpg

So the simple wget command will not grab the images.  You might want to automate the process of downloading them all, and there is a very simple shell command sequence to do so.  Start by making yourself a sandbox area:

mkdir ~/sandbox
cd ~/sandbox

Then grab the top level URL:

wget http://www.openculture.com/2010/09/the_illustrated_guide_to_a_phd.html

Now you will have the file the_illustrated_guide_to_a_phd.html in your current directory.  Next experiment with grep and sed to get the list of jpg URLs:

grep '\.jpg' the_illustrated_guide_to_a_phd.html

Now that you see what lines have a .jpg in them, use the sed command to edit out everything except for the URL.  A typical line looks like:

<div class="graphic"><img alt="" src="http://www.openculture.com/images/PhDKnowledge.006.jpg" width="440" /></div>

which clearly needs to be cleaned up.  This sed command will use a regular expression that matches everything between the =” and the jpg” and prints out just that part.  It uses the \( and \) for making a “group” of the interesting part. The group is then what gets printed with the \1

grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'

which gives you lines like:

http://www.openculture.com/images/PhDKnowledge.005.jpg

you are now ready to use wget on those individual image URLs:

JPG_LIST=`grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'`

Then iterate through that list grabbing each URL:

for jpg_url in $JPG_LIST
do
    echo $jpg_url
    wget "$jpg_url"
done

To see this whole inline script (you can paste it in as it is):

mkdir ~/sandbox
cd ~/sandbox
wget http://www.openculture.com/2010/09/the_illustrated_guide_to_a_phd.html
JPG_LIST=`grep '\.jpg' the_illustrated_guide_to_a_phd.html | sed 's/.*="\(.*jpg\)".*/\1/'`

for jpg_url in $JPG_LIST
do
    echo $jpg_url
    wget "$jpg_url"
done

A final note: the original article by Matthew Might was on his web site and he had organized his page to have the images in the hierarchy of the HTML.  This is a more robust web site layout, and the recursive wget command would have mirrored it well:

wget -r -np -nc http://matt.might.net/articles/phd-school-in-pictures/
find matt.might.net -name '*.jpg' -print
Advertisements

About markgalassi

Mark Galassi is a research scientist in Los Alamos National Laboratory, working on astrophysics and nuclear non-proliferation.
This entry was posted in scripting, try this, unix and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s