URLs |
libwww-perl (LWP): http://www.linpro.no/lwp/
|
Today, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or "spider", in Perl. Which is a grand idea, except that:
Having said that, I immediately pictured a one-line Perl robot. It wouldn't do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I've split it onto separate lines for easier reading.
perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe ' $ua = LWP::UserAgent->new; while (my $link = shift @ARGV) { print STDERR "working on $link"; HTML::LinkExtor->new( sub { my ($t, %a) = @_; my @links = map { url($_, $link)->abs() } grep { defined } @a{qw/href img/}; print STDERR "+ $_" foreach @links; push @ARGV, @links; } ) -> parse( do { my $r = $ua->simple_request (HTTP::Request->new("GET", $link)); $r->content_type eq "text/html" ? $r->content : ""; } ) }' http://slinky.scrye.com/~tkil/
I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn't that much of a terror. Here's the one-line version.
perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '$ua = LWP::UserAgent->new; while (my $link = shift @ARGV) { print STDERR "working on $link";HTML::LinkExtor->new( sub { my ($t, %a) = @_; my @links = map { url($_, $link)->abs() } grep { defined } @a{qw/href img/}; print STDERR "+ $_" foreach @links; push @ARGV, @links} )->parse(do { my $r = $ua->simple_request (HTTP::Request->new("GET", $link)); $r->content_type eq "text/html" ? $r-> content : ""; } ) }' http://slinky.scrye.com/~tkil/
After getting an ego-raising chorus of groans from the hapless onlookers in #perl, I thought I'd try to identify some cute things I did with this code that might actually be instructive to TPJ readers.
This is where "callbacks" come in. They're well-known in GUI circles, since interfaces need to know what to do when one presses a button or selects a menu item. Here, HTML::LinkExtor needs to know what to do with links (all tags, actually) when it finds them.
My callback is an anonymous subroutine reference:
sub { my ($t, %a) = @_; my @links = map { url($_, $link)->abs() } grep { defined } @a{qw/href img/}; print STDERR "+ $_" foreach @links; push @ARGV, @links; }
I didn't notice until later that $link is actually scoped just outside of this subroutine (in the while loop), making this subroutine look almost like a closure. It's not a classical closure - it doesn't define its own storage - but it does use a lexical value far away from where it is defined. (Enough justification for a section title!)
my $button = $main->Button( ... )->pack();
We use a similar approach, except we don't keep a copy of the created reference (which is stored in $button above):
HTML::LinkExtor->new(...)->parse(...);
This is a nice shortcut to use whenever you want to create an object for a single use.
HTML::LinkExtor->new(...)->parse( get $link );
Where get() is a function provided by LWP::Simple; it returns the contents of a given URL.
Unfortunately, I needed to check the Content-Type of the returned data. The first version merrily tried to parse .tar.gz files and got confused:
working on ./dist/irchat/irchat-3.03.tar.gz Use of uninitialized value at /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 104. Use of uninitialized value at /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 107. Use of uninitialized value at /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 82.
Ooops.
Switching to the "industrial strength" LWP::UserAgent module allowed me to check the Content-Type of the fetched page. Using this information, together with the HTTP::Response module and a quick ?: construct, I could parse either the HTML content or an empty string.
Obviously, this spider does nothing more than visit HTML pages and try to grab all the links off of each one. It could be more polite (but see the LWP::RobotUA module for some of that) and it could be smarter about which links to visit. In particular, there's no sense of which pages have already been visited; a tied DBM of visited pages would solve that nicely.
Even with these limitations, I'm impressed at the power expressed by that "one" line. Kudos for that go to Gisle Aas (the author of LWP) and to Larry Wall, for making a language that does all the boring stuff for us. Thanks Gisle and Larry!
__END__