Overview

Web scraping is the art of copying desirable content from web sites to your PC, either as-is or transferred into a different format as needed. Some reasons to do this are:

And I'm sure there are many more uses as well.

On this page I will show you some examples with code of useful web scraping that I've done. My primary tools are Wget and Perl, although cURL and Python or Ruby could easily be used instead.

Index

Get NASA Pictures

Get NASA Pictures

This wget command will pull down all of NASA's "Picture of the Day" pictures into the current directory:

wget -r -p -L -nc -nH -np -A big.jpg,big.gif,big.png \
     -q http://antwrp.gsfc.nasa.gov/apod/image/

The parameters tell wget to -r recurse web pages, -p get all images, -L follow relative links only, -nc skip existing files, -nH not create a host directory, -np not follow any parent directories, -A accept only these postfixes, -q be quiet. They are followed by the base URL to fetch.

Since a picture is added every day it would be nice to have our collection updated every once in a while without us having to remember to run the above command ourselves. In order to do this I've added the following cron job entry to root's crontab:

# get nasa pictures
0 2 1 * * (cd /data/media/nasa-pictures && \
               (nice wget -r -p -L -nc -nH -np \
                     -A big.jpg,big.gif,big.png \
                     -q http://antwrp.gsfc.nasa.gov/apod/image/ ; \
                nice /root/bin/make-nasa-slide-show)) > /dev/null 2>&1

The first line is a comment. The first 5 items on the command line tell cron to run at 0 minutes and 2 hours, on the 1st day of the month, every month, and any day of the week. This means it will run at 2 AM every first day of the month. This is followed by the command to run, which is the same command as above with a couple of additions. It starts by changing into the directory where I store the files, then it runs the wget command using nice, and finally runs my make-nasa-slide-show script (see below) which flattens the picture directory structure into one folder for easy use in a slide show. These commands are grouped together and followed by a redirect of STDOUT and STDERR to /dev/null so we don't have to see the output when it is run.

Here are the guts of my make-nasa-slide-show script:

indir="/data/media/nasa_pictures/apod/image/"
outdir="/data/media/pictures_slide_show/Nasa"

mkdir -p "${outdir}"
pushd "${indir}"
find . -type f | while read i ; do
    o=$(echo "${i}" | cut -c 3- | tr [/] [_])
    o="${outdir}/${o}"
    cp -u "${i}" "${o}"
done
popd

Replace the indir and outdir values according to your setup.