Modules: |
Finance::QuoteHist ................................................ CPAN HTML::TableExtract .............................................. CPAN LWP::UserAgent ................................................... CPAN Date::Manip ........................................................... CPAN HTML::Parser ....................................................... CPAN |
"Sugar is no good
Once it's cast among the white sand"
-Ted Hawkins
Sometimes tools of enormous utility arise as a natural consequence of solving what at first seems like a simple problem. Such was the case with HTML::TableExtract, spawned in the quest for automatically retrieving historical stock quotes with Finance::QuoteHist.
For a human, retrieving information from the web is straightforward. Start with your favorite site. If it is down, move to another site until you find one that is available. Then submit your query and read the result. Automating this task should be simple. And much of the process is, except for extracting the embedded information reliably and context-free. Thus the quest for historical stock quotes in the Finance::QuoteHist module begat another module: HTML::TableExtract. Describing how these modules work will hopefully seed more ideas for dependably recovering pearls of information cast among the sands of HTML.
Why are historical stock quotes desirable? After all, charts depicting the past stock performance of companies are readily available on the web. In my particular case, the itch that needed scratching was a desire to implement a Value Per Share (VPS) portfolio tracking system. The VPS system is a method of tracking a personal portfolio similar to how a mutual fund tracks its performance. The VPS index, a measure of your portfolio performance, is consistent regardless of your composite holdings on any given day or any motions of money into and out of your portfolio. For any window of time in your portfolio history, you can reliably track its performance against other funds and indices; after all, the name of the game isn't whether you are up or down on any given day, it's whether you beat the indices. If you can't beat the averages, just dump your money into index funds.
For VPS tracking, an arbitrary number of fictitious "portfolio shares" are assigned to your portfolio, starting from the date it was created. The initial number of shares is not relevant, nor does it have any relation to actual shares you might hold in a particular company. It's just an arbitrary starting point to serve as a reference for comparison to future values. For each day, the VPS is calculated by dividing the total value of your portfolio by the number of shares you own on that day.
The number of shares in the portfolio is affected only by moving cash into or out of the portfolio account, which is equivalent to buying or selling a number of these shares at the VPS for that day. For example, if you add cash to your portfolio, the number of VPS shares you own increases, but the VPS itself remains unchanged - the value of your fictitious shares remains constant, since adding money doesn't mean you're a better stock picker. With an accurate record of your portfolio activities and the daily closing prices of the stocks you owned, the entire VPS record of the portfolio can retroactively calculated.
VPS, incidentally, is the method many portfolio tracking tools use to track stocks, such as PortTrak from the Motley Fool. They typically don't have access to historical stock quotes, however, so the VPS last calculated is simply carried forward each day until you update your stock quotes again. Retroactive analysis of your portfolio is not possible unless you used the tracking tool on the first day you started your portfolio. And even if you had been using the tool from the beginning, their method of stock updates still introduces bugs in the VPS calculation. Practically speaking, this is acceptable for a broad view of performance, but not for fine grained analysis. Nothing is more irritating to Perl programmers than incomplete data; why settle for a pretty picture when you can make it gorgeous?
The interface for Finance::QuoteHist is simple when we gloss over where and how the data is obtained. Here's how we'd print out the quotes, splits, and dividends for three stocks from the beginning of 2000 until today:
use Finance::QuoteHist; $q = Finance::QuoteHist->new ( symbols => [qw(LNUX MSFT IBM)], start_date => '01/01/2000', end_date => 'today', ); print "Quotes:\n"; foreach $row ($q->quotes()) { ($date, $open, $high, $low, $close, $volume) = @$row; print "$date $close\n"; } @splits = $q->splits(); if (@splits) { print "\nSplits\n"; foreach $row ($q->splits()) { ($date, $new_shares, $old_shares) = @$row; print "$date $new_shares:$old_shares\n"; } } @dividends = $q->dividends(); if (@dividends) { print "\nDividends\n"; foreach $row ($q->dividends()) { ($date, $amount) = @$row; print "$date $amount\n"; } }
Unless otherwise specified, all values returned by Finance::QuoteHist are adjusted for stock splits. Finance::QuoteHist uses Date::Manip to parse dates, so any date formats acceptable by that module can be used by Finance::QuoteHist, including such useful phrasings as "2 years ago". Finance::QuoteHist uses LWP::UserAgent, and provides an accessor method for configuring that workhorse class for things like proxy access.
Just as the DBI module provides an abstraction layer over different database engines, Finance::QuoteHist separates the particulars of how information is obtained from how that information is manipulated. In this sense, every web site is a datastore with a unique API. With historical stock quotes, this API is typically affected by the following factors:
URL base for queries
Query limits
Result presentation
The first factor is rather obvious; each site will have its own URL and style of accepting query parameters.
The second is easily discovered through trial and error. Some sites return up to a month of data for a single query, some up to a year, and so on. Some sites, such as Yahoo!, have different query limits depending on how you request the information. When there are limits involved, the full query might be composed of several sub-queries. Tuning for query limits is important for two reasons: so that you can be sure you're covering the full range of your query, and for making your retrievals as efficient as possible by using the largest ranges possible per sub-query.
The third factor - result presentation - is potentially one of the largest impediments to reliable web mining, especially if the results are presented in HTML. It is easy to grab a web page and extract a particular nugget of information from the HTML. It is more difficult to reliably and adaptively extract that nugget despite changes in the layout of the surrounding HTML. That's where HTML::TableExtract comes in, and we'll talk about it soon.
Finance::QuoteHist can be viewed as an aggregate for several site-specific modules. If the first site fails, then the next site in line is attempted, and so on. A site might fail for several reasons: perhaps it's temporarily out of service, or perhaps it doesn't carry information about a defunct ticker symbol. Some of the more reliable sites, such as Yahoo! or the Motley Fool, do not provide information on companies that are no longer publicly traded under an old symbol. Under these circumstances, it is wise to include a site which has this information but might be less reliable for "first tier" use. A third way for a site to fail is when the site URL or content has been significantly altered.
The parameter for specifying this sequence of sites is the lineup parameter. The values are site-specific class names. The following is identical to the example above, except that the default lineup is explicitly provided:
$q = Finance::QuoteHist->new ( lineup => [qw( Finance::QuoteHist::Yahoo Finance::QuoteHist::MotleyFool Finance::QuoteHist::FinancialWeb )], symbols => [qw(LNUX MSFT IBM)], start_date => '01/01/2000', end_date => 'today', );
Finance::QuoteHist::new() returns a reference to an object blessed into the first site-specific class specified in the lineup. The remaining site-specific classes are passed along as the next lineup in the event that the first site has problems.
The site-specific classes are all sub-classes of Finance::QuoteHist::Generic, which is the engine that does most of the work. For simple HTML quote extraction, these sub-classes need to provide is a single method, quote_urls(), which returns a list of URLs for quote retrieval given a ticker symbol and date range. You can override the new() constructor to set default parameters that apply to the site, such as reverse (which inverts the order of the quotes from how they appear on the page) or column_labels (which affects how tables are extracted from the HTML).
There are also provisions for sites that provide more than just HTML quotes. Yahoo!, for example, also provides information on stock dividends and splits. If it's quotes or dividends you are interested in, the Yahoo! data can be quickly downloaded in CSV (Comma-Separated Values) format rather than HTML. Sometimes, such as with split information and non-adjusted quotes, the data is only available in HTML. For splits, matters are complicated further because the information is embedded in the regular quote results - there's no special query dedicated to splits. Finance::QuoteHist can also automatically adjust non-adjusted data as well as report which site class retrieved the data for a particular ticker symbol. Eventually, Finance::QuoteHist will also be capable of currency conversion.
Finance::QuoteHist is currently equipped to parse both HTML and CSV data. CSV is quick to download, and is the format in which most Finance::QuoteHist queries will be fulfilled (courtesy of Yahoo!). CSV is fast and boring. All other sites, and even Yahoo! in some circumstances, provide the data in HTML tables. HTML is big, messy, and interesting.
Markup languages serve a purpose: providing a parsable substrate for information. In the case of HTML, the point of this substrate is to give computers a clue about how to present the information in a more human-readable fashion. Often, when information is massaged to be more easily digested by humans, visual elements are introduced, obscuring the data for computers. Luckily, HTML structures are sometimes lackadaisical in their obfuscation. They slip up, and let computer-friendly contextual clues slip into the presentation. Such is the case with HTML tables.
The real hero of Finance::QuoteHist is the HTML::TableExtract module. This module tries very hard to decouple all HTML dependencies from the information contained in tables, without resorting to horrendous regular expressions. Complete severance from HTML might sound draconian, but it releases the developer from having to adjust extraction code whenever a web site is tweaked slightly. In the case of tabular information, it really isn't important that the information is cloaked in HTML. What is important is that the information is organized in a grid.
As anyone who has played with HTML parsers can tell you, markup languages can be modeled as tree structures (see the HTML::Element pod for commentary on this topic, and Sean Burke's Scanning HTML article elsewhere in this issue). Mapping a tree structure to a grid, as with HTML tables, is a lot of work for layout managers, especially when tables contain other tables. If the primary goal is extracting information rather than rendering it, most of these layout headaches can be avoided by doing two things: first, treating all tables separately, regardless of whether they are a subtable; and second, mapping the cells of each table to a grid by ignoring cell spans and using empty placeholder cells.
Grid contexts are easy for computers to understand, and that's the job of HTML::TableExtract. It recasts tree-like HTML tables into grids, by carefully keeping count of rows and cells in each table.
That alone is an enormous step in decoupling the tabular nature of data from the HTML used to present it. The only remaining question is which table on a page holds the information. HTML::TableExtract offers three ways of distinguishing multiple tables on a web page: depth, layer count, and header information. The depth merely represents how deeply nested tables are: a top-level table (that is, most of the tables that you see when surfing the web) is depth 0, a table within that table is depth 1, and so on.
Within each layer at a given depth, each table has a count representing the order of display, starting with 0. Finally, tables can be targeted with headers, which are nothing more than pattern matches above columns of interest in a table. Headers are the most context-free option of them all, since they don't rely on spatial relationships between tables.
HTML::TableExtract also offers more elaborate techniques for specifying tables relative to one another, but these haven't yet been necessary for capturing historical stock quotes: the Finance::QuoteHist modules only need the column headers. The text provided for headers eventually gets passed through a non-anchored, case-insensitive regular expression. The Finance::QuoteHist modules currently use a HTML::TableExtract object like this:
$te = HTML::TableExtract->new ( headers => [qw(Date Open High Low Close Volume)] ); $te->parse($html_string); foreach ($te->rows) { ($date, $open, $high, $low, $close, $volume) = @$_; ... }
This code is all you need to extract the quote data from every historical quote site I've found so far. Historical stock quotes are amenable to parsing, since the column headers are fairly uniform; the six words shown above are used everywhere. Column order is handled by the HTML::TableExtract module, which automatically returns columns in the same order as the headers you provide, regardless of their ordering on the web site.
Header-based extraction is the most stable way to pull tables out of HTML using HTML::TableExtract, but depths and counts are also available. Consider the following:
$te1 = new HTML::TableExtract( depth => 0 ); $te1->parse($html_string); foreach $ts ($te1->table_states) { print "Table found at ", join(',', $ts->coords), "\n"; } $te2 = new HTML::TableExtract( count => 1 ); $te2->parse_file(\*HTML_FILE); foreach $ts ($te2->table_states) { print "Table found at ", join(',', $ts->coords), "\n"; } $te3 = new HTML::TableExtract( depth => 0, count => 2 ); $te3->parse($html_string); print "Unique table found at ", join(',', $te3->first_table_state_found->coords), "\n";
In the first example, all tables at depth 0 are extracted. In the second example, the second table at every depth are extracted. In the last example, the third table at depth 0 is extracted. Providing a depth and count uniquely specifies a particular table in an HTML document.
Depths and counts work well for targeting tables, but they do retain context that could foul up future extractions. If the maintainer of the web site adds or removes more tables, it may well affect the depth and count of the page's tables, which is why using the headers is likely to be more robust.
If you find a new source of historical quotes not covered by the latest release of Finance::QuoteHist, you can create your own Finance::QuoteHist module. Although the current distribution is very adaptive, there's always a possibility of a quote site changing dramatically enough to break a module. Thanks to HTML::TableExtract, these failures are less likely to involve the specter of HTML layout than changes to the URL: either of the site itself, or within the syntax for calls to a CGI program. Looking under the hood, consider the quote_urls() implementation used for the Motley Fool:
sub quote_urls { my ($self, $ticker, $start_date, $end_date) = @_; $ticker or croak "Ticker symbol required\n"; $start_date = $self->{start_date} unless $start_date; $end_date = $self->{end_date} unless $end_date; # For splitting dates of the form 'YYYYMMDD' my $date_pat = qr/^\s*(\d{4})(\d{2})(\d{2})/; # Make sure date boundaries in the correct order. if ($start_date gt $end_date) { ($start_date, $end_date) = ($end_date, $start_date); } # Break date range into 30 day blocks (last block will likely # end up being less than 30 days) my (%date_pairs, $low_date, $high_date); $low_date = $start_date; while (1) { $high_date = DateCalc($low_date, '+ 30 days'); last if $high_date gt $end_date; $date_pairs{$low_date} = $high_date; $low_date = DateCalc($high_date, '+ 1 day'); } # Last query block always ends with end_date $date_pairs{$low_date} = $end_date; my @urls; foreach (sort keys %date_pairs) { # Break dates into year, month, day. my ($sy, $sm, $sd) = /$date_pat/; my ($ey, $em, $ed) = $date_pairs{$_} =~ /$date_pat/; # Construct and store URL my $base = 'http://quote.fool.com/historical/historicalquotes.asp?'; my $query = join('&', "startmo=$sm", "startday=$sd", "startyr=$sy", "endmo=$em", "endday=$ed", "endyr=$ey", "symbols=$ticker", "currticker=$ticker", 'period=daily'); push(@urls, $base . $query); } @urls; }
The Fool accepts starting and ending dates, but only returns the last month of data in the range specified. This implementation of quote_urls() divides the queries across thirty-day increments for a single symbol, relying heavily on the DateCalc routine from the lovely and talented Date::Manip module.
The site-specific Finance::QuoteHist modules are all subclasses of Finance::QuoteHist::Generic, and you should inherit from this module as well if you're going to be creating your own site-specific module. The Finance::QuoteHist::Generic module invokes the quote_urls() method across the supplied list of ticker symbols; if that site fails to produce results, the generic module invokes the next module in the lineup. Adding new sites to the lineup, or replacing old ones, is as simple as creating another quote_urls() method in a new site-specific module. Simple observation of how the site handles manual queries is usually all that is necessary to figure out the URL syntax and query limits. Additionally, since Finance::QuoteHist is based on LWP::UserAgent and HTTP::Request, the entire arsenal of networking tricks are available for use in cases where more than a simple HTTP GET is necessary.
The Finance::QuoteHist engine also provides services for handling more varied information, such as what Yahoo! Finance provides. The price of the stock is one example of what the module calls a target. Other targets might be dividends or splits, and Finance::QuoteHist comes with handlers for those targets. The top-level methods (such as quotes(), splits(), and dividends()) are all wrappers around their respective target handlers. The information for a target is either retrieved directly, via a mode-specific TARGET_get() method, or indirectly by filtering the output of a different target mode with a TARGET_extract() method. Labels for the target columns of direct queries are provided by TARGET_labels(); these are used both for HTML column extraction as well as for determining column order.
For instance, Finance::QuoteHist::Yahoo provides the following methods:
quote_urls() dividend_urls() dividend_get() dividend_extract() split_get() split_extract()
Note that there's no need to provide quote_labels(), dividend_labels(), or split_labels() methods, since those labels are set by default in all Finance::QuoteHist classes, although you can override them if you wish. There is no split_urls() method, because split information can only be found embedded in regular quote results on Yahoo!; the split_get() method will end up performing an HTML quote query and relying on the split_extract() method to get the information. The dividend_get() method, on the other hand, performs a direct query for dividends since a dividend_urls() method exists - unless the work has already been done by dividend_extract() during a regular quote query.
The documentation included with the modules include plenty of detailed information for developers interested in developing their own site-specific modules.
Historical stock quotes are just one application of HTML table extraction. The web is bursting with useful information, trapped in tables and the noise of HTML. Liberating this information for automation is made easier with HTML::TableExtract.
Go forth and browse without fear of tangled and encumbered HTML regular expressions, for there are still countless pearls scattered throughout the sands of HTML and the wide world of the web.
_ _END_ _Matthew Sisk (sisk@mojotoad.com) is the author of several CPAN modules, including HTML::TableExtract and Finance::QuoteHist. Professionally, he is a Houston-based consultant and developer for Bluware, Inc. In real life he is a mountain biker, home brewer, and shameless mixer of metaphors.