XKCD Knockout Comic Downloader

XKCD, for those of you who don’t know, is a webcomic about (as it testifies of itself) romance, sarcasm, math, and language. In my opinion, it’s the best webcomic out there. I wanted to download the complete comics archive, for the sake of a local backup, and as an idea for a printed, coffee-table style book. However, I didn’t want just the comic’s image file – the best part of XKCD is often the alt-text that shows up as a tooltip when you mouse-over the comic. And of course there’s the comic’s title, as well.

There are several scripts out there that others have written before me that download the comic, and some even do a pretty good job of getting some of the extra data. However, I wanted more. I wanted a downloader that would get ALL the data available about the comics, and store it in an easily-retrievable, transformative manner.

This led me to write my own downloader, which I have dubbed the XKCD Knockout Comic Downloader (or XKCD, for short). I have relied to some degree on those who have come before me, and have modified their code.
Off the top of my head, I used some code by John Lawrence to find the latest comic number; and this discussion on Ubuntu forums got me started with getting the meta-data.

My downloader the most complete I’ve seen so far. of it’s notable features:

  • Store meta–data, including path to image, in XML file.
  • Choice of downloading the images or not.
  • Can append to an existing XML file, and update it since you last downloaded your personal batch of XKCD.
  • Will store all data about the comic, including seldom or never before used attributes, such as href, src, etc. (more on that later.)

Sure, this script isn’t a nifty one-liner that does all the work, but instead, it does more work, and does it well:

#!/bin/sh

#-----user configurable-----
append_to_file=true # continue from previous download
download_path=~/xkcd/
image_path=images
xmlfile=xkcd.xml
download_images=true
#---------------------------
#------configuration--------
i=1
latest=`wget -q -O - http://www.xkcd.com | grep 'link to this comic' | sed 's/.*xkcd.com.\([^\/]*\).*/\1/'`
#---------------------------

if [ ! -d $download_path ]
then
    mkdir $download_path
fi
cd $download_path

if $download_images && [ ! -d $image_path ]
then
    mkdir $image_path
fi

if $append_to_file && [ -f $xmlfile ]
then
    sed -i '/\/xkcd/ d' $xmlfile
    i=$(tail -8 $xmlfile | grep '<id>' | sed 's/^.*>\([0-9]\+\).*/\1/')
    i=`expr $i + 1`
else
    echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>">>$xmlfile
    echo "<?xml-stylesheet type=\"text/xsl\" href=\"xkcd.xsl\"?>">>$xmlfile
    echo "<xkcd>">>$xmlfile
fi

while [ $i -le $latest ]
do
    echo "    <comic>">>$xmlfile
    echo "        <id>$i</id>">>$xmlfile
    wget http://xkcd.com/$i/
    img=$(grep http://imgs.xkcd.com/comics/ index.html | head -1)
    params=$(($(echo $img | tr -dc '"' | wc -c)/2))
    for ((j = 1; j <= $params; j++))
    do
        param=$(echo $img | cut -d\" -f$(($j*2-1)) | sed 's/>*<*[a-z]*\ \([a-z]*\)\=/\1/')
        val=$(echo $img | cut -d\" -f$(($j*2)))
        echo "        <$param>$val</$param>">>$xmlfile
        if [ $param = src ]
        then
            filename=$(echo $val | cut -d\/ -f5)
            if $download_images
            then
                wget $val
                mv $filename "$image_path"/"$i"_"$filename"
            fi
            echo "        <filename>$i"_"$filename</filename>">>$xmlfile
        fi
    done
    echo "    </comic>">>$xmlfile
    rm index.html
    i=`expr $i + 1`
done
echo "</xkcd>">>$xmlfile

Rather than explain all the regexs used, and the logic in the script, if you have any specific questions, please ask.
What distinguishes this script from others (of the rare few that download the meta-data) is that I’m not assuming any attributes exist, but am downloading all of them. This is useful for the irregular comics such as House of Pancakes or Lojban. In fact, I just ran a search for href in the XML file I created today, and found a few nuggets I’ve missed in the past.

The fact the data is stored in an XML file means it’s transformative. write a script to tweet the alt-text (why you would do that is beyond me), or create a tag cloud of frequently used words. My intention is to create an XSTL file that will display all the comics in a pleasing manner and bring it to print. (Dealing with Randal’s irregular image sizes is something I’m still working on, and am open to suggestions).

Let me know if you use this script and what creative ideas you have in mind for your stash of XKCD.

Be the first to comment on "XKCD Knockout Comic Downloader"

Leave a comment

Your email address will not be published.


*