Adding a cache to your scripts that read HTML files

I'm a big fan of Perl, but seem to use it for just one reason these days: reading and processing various HTML files out on the web (ie writing a spider). The LWP Perl modules makes this easy, and can honor robots.txt automatically. I've found, though, especially during development, that I don't want to access the live web page again and again as I develop the rest of the script. This is even more relevant for HTML pages that take a long time to return, or when I'm developing the script offline, or without intranet access.

My solution? Adding some HTML cache code in my object that can save a copy of the HTML locally the first time it is accessed, then automatically use that file each subsequent run.

High level design

I'm using LWP to automatically download web pages and then analyzing them with Perl. Let me re-phrase that (I still have a challenge thinking about objects, especially in Perl) I have a Perl object that represents the information I'm interested in from a given web page. I'm suggesting a few extra lines of code to make the object smarter - when created, the object should first see if it has already stored a local version, if so, use the local version, if not, use LWP to get the page from the web, then save it locally for next time!

This code can all be added to the object, so the main program doesn't even have to concern itself with the cache. This approach is certainly not specific to Perl, either.

Perl Object usage:

Suppose we want to analyze monthly sales summaries that are available on your company's intranet at a URL like summary.php?month=12&year=2009 We're interested in just blue-widget sales and red-widget sales.

We'll create a Perl object called SalesSummary and pass it the month and the year that we're interested in:

my $April2009  = SalesSummary->new(4, 2009);

Then be able to ask it what the two totals are:

my $April2009Blue = SalesSummary->blueTotal();
my $April2009Red = SalesSummary->redTotal();

Simple and clean code so far!

Perl Object design

I can't describe in detail how to create and work with Perl objects, there are lots of good tutorials out there (I've added a link or two at the end, please suggest more in the comments!) I'll describe as much as is required to show how the cache would work.

So you'll have your basic Perl object code:

package SalesSummary;
sub new {
    my $class = shift;    
    
    my $self  = {
    	MONTH => shift, # The numeric # of the month, like 5 for May
    	YEAR  => shift, # Represents the full year
         ....

And somewhere along the way you'll have your code that reads the web page using LWP:

my $browser = LWP::RobotUA->new(
    'statsSummarySpider/0.15', 'spider@yourdomain.org'); 
# Using RobotUA to respect robots.txt
$browser->delay( 7/60 );
my $response = $browser->get( $wholeURL );
die "Can't get $wholeURL  -- ", 
    $response->status_line unless $response->is_success;
$content = $response->content;

To create the cache mechanism you'll need to calculate a unique, local file name, in this case simply:

my $cacheFileName = "summary" . $self->{MONTH} . $self->{YEAR};

Then wrap the above LWP code in a simple IF statement to see if the file already exists locally:

if (-e $cacheFileName ) {
   # The cache file already exists on disk, just read that copy
   undef local ($/);
   open INP, $cacheFileName;
   $content = <INP>;
   close INP;
} else {
   # The LWP code above to read the file from the web
   # ....
   # Now write the contents to the local disk for the next run
   open OUTP, ">$cacheFileName";
   print OUTP $content;
   close OUTP;	
}

The first time you create the object with a given month and year, there won't be a file on disk, so it will read it from the web, but each subsequent run will load the local file - you've got your cache!

Improvements

The above code is as basic as I could make it; in a production script you'd want to do more error checking, etc... I've also found cases where I needed to add a flag to force re-reading the web page to update the cached version (although a simple work-around is to just erase all the local cache files, forcing the script to re-read the content from the web anyway).

More information:

The Perl LWP modules

Object Oriented Perl