URLs Perl-XML FAQ: http://www.perlxml.com/faq/perl-xml-faq.html XML spec: http://www.w3.org/TR/REC-xml XSL: http://www.w3.org/Style/XSL XML::Parser: http://www.perl.com/CPAN/modules/by-author/id/C/CO/COOPERCL/ |
Have you heard? XML is going to change the very nature of computing. No more data conversion headaches or legacy code nightmares. And it will be everywhere: in databases, Web browsers, toaster ovens, and your favorite breakfast cereal. Oh, and it's Y2K compliant.
I may be exaggerating a bit, but many people smarter than me say that it's going to make a significant impact, so they must be right. Right?
Once upon a time, these same smart people decided that a common document format would be good. So they invented a format that did everything. They called it SGML, and ISO, the international standards organization, said it was good. Unfortunately, it was so complete that few SGML users knew how it all worked. Then came HTML, a markup language that is itself a derivative of SGML. HTML found wide acceptance thanks to its simplicity and ease of use.
HTML's popularity led to people wanting more ambitious layouts. Developers designed ever more convoluted mutations for complex Web-based presentations, and the limitations of HTML began to show. Furthermore, the growing volume of HTML content on the Internet and HTML's inability to define new tags made it difficult, if not impossible, to search for and manage data. So the World Wide Web Consortium, the standards body for the World Wide Web, formed an SGML working group to fix the problem. The result was XML.
The tenets of XML are simplicity and utility, the same features that made HTML so popular. At the same time, XML was designed to preserve the extensibility of SGML. This led to a language lacking many of the non-essential features that made SGML so complex. Now, many analysts and programmers are hailing XML as the next step in the evolution of the World Wide Web.
In this article we will learn the basics of XML and how to manipulate XML in Perl. Then, we will design a simple XML format for creating a FAQ, and write a program to convert it into HTML.
A valid XML file (more on what validity means later) begins with three things: an XML declaration, a Document Type Definition (DTD), and a root element. (In XML parlance, "element" means the same as "tag".) In the example that follows, we'll look at an address database containing contact information for people; our root element will be called contact.
Here's the first line of our XML file, the XML declaration:
<?xml version="1.0"?>
Note that xml must be in lower case. The DTD comes next:
<!DOCTYPE contact SYSTEM "contact.dtd">
This DTD declaration identifies the XML document as being of type contact. It also indicates that the file containing the rules for the contact document type is located in the file contact.dtd. DTDs are described in more detail on the next page.
Finally, the root element, like the <html> tag in HTML, envelopes rest of the XML document. Unlike HTML, you're free to choose whatever name you like.
Every XML document contains data within nested elements A simple XML record containing my contact information might look like this:
<contact> <name>Jonathan Eisenzopf</name> <address>555 Foobar Lane</address> <phone number="555-5555"/> <email address="eisen@pobox.com"/> </contact>
You can find the complete XML specification at http://www.w3.org/TR/REC-xml.
Like SGML and HTML, XML is a structured format that uses tags to define elements within a document. You're free to define any tag you like. Here, I've made up a name tag:
<name>Jonathan Eisenzopf</name>
This next tag contains a phone number. Notice the / character before the end of the tag. Unlike HTML, every element must contain an end tag--but the start and end tags can be one and the same. Here, the phone element is self-contained, so I include the slash to make it an end tag.
<phone number="555-5555"/>
This tag also contains one attribute named number, whose value is 555-5555. In XML, attribute values must be contained in quotes.
Document Type Definitions are the heart and soul of XML. Each DTD is a formal description of a document's structure; there are DTDs for domains like real estate, mathematics, news, and VRML. DTDs allow programs to validate the organization of a document and extract its meaning programmatically--typically using Perl.
Here, we create a DTD file named contact.dtd, which declares the element types for our contact database:
<!ELEMENT contact (name, address, phone, email)> <!ELEMENT name (#PCDATA)> <!ELEMENT address (#PCDATA)> <!ELEMENT phone EMPTY> <!ELEMENT email EMPTY> <!ATTLIST phone number CDATA #IMPLIED> <!ATTLIST email address CDATA #IMPLIED>
This DTD declares the contact element as containing name, address, phone, and email elements. The name and address elements contain character data (#PCDATA), while the phone and email elements contain no data at all (EMPTY). The last two lines of the DTD declare the valid element attributes for the phone and email elements: number and address respectively.
A well-formed XML document needs a root element and properly-nested tags. That's all. It doesn't even need a DTD. A valid XML document, on the other hand, is not only well-formed, but also contains a structure conforming to the associated DTD. Thus, XML that needs to be validated must include a document type declaration at the beginning of the XML document.
Unlike HTML, XML does not contain any information regarding how data is to be displayed on the user's screen. The two primary methods of displaying XML are stylesheets, and simply converting XML into another format.
The dominant proposal for XML stylesheets is the Extensible Style Language (XSL). XSL is based on the Document Style Semantics and Specification Language (DSSSL), an ISO standard designed to be used with SGML. It also borrows features from the older Cascading Style Sheets (CSS), which you can use in HTML. The difference between XSL and DSSSL is that XSL is itself XML. XSL is not yet an official standard, but it probably will be soon. Currently, tools exist to convert XML using an XSL style sheet into formats like Postscript, RTF, SGML, TeX, and HTML. http://www.w3.org/Style/XSL/ contains more information on XSL including tools, libraries, and tutorials.
The second technique for displaying XML documents is to directly convert XML to another format using an XML parser and Perl. When the conversion is simple, there's no need for a stylesheet mechanism like XSL. As we'll see, converting XML to HTML is easy as long as the XML document is fairly simple.
As of developer release 5.005_52, Perl includes a pragma that handles UTF8 encoded strings. This is a good thing, since the XML standard requires that all parsers handle both UTF8 and UCS2 encoded text. UTF8 enables the basic Unicode support required for multi-byte character languages like Japanese. This is a recent advance, but a very important one for Perl and especially XML. Thanks go out to Larry Wall and contributors.
To enable Unicode support, simply add the following line to the beginning of your program:
use utf8;
Larry Wall wrote the first XML::Parser module, now being maintained and updated by Clark Cooper. XML::Parser is built on top of James Clark's Expat, a non-validating XML parser written in C. A non-validating parser checks the XML document for well-formedness, but does not validate it against a DTD. The latest version of XML::Parser can be found at http://www.perl.com/CPAN-local/authors/id/C/CO/COOPERCL/.
Because XML::Parser is an event-based parser, we must register event handlers to process the incoming XML. As the XML data is parsed, the handlers are called when the corresponding event is triggered.
For processing to occur, we must set event handlers when we create a new instance of the XML::Parser class:
use XML::Parser; my $parser = new XML::Parser( Handlers => {Start => \&handle_start, End => \&handle_end});
This creates two event handlers, Start and End. Each time the parser encounters a new element (or tag), it triggers the Start event, which in turn executes the associated subroutine, handle_start(). Likewise, when the parser finds the end of an element, it triggers the End event and executes handle_end().
Once you've set your event handlers, it's time to parse the XML. Below is a script that parses an XML file from the command line ($ARGV[0]) and prints each element as it's found. It also prints the depth of each element in the XML structure:
1 #!/usr/bin/perl -w 2 3 use XML::Parser; 4 my $deep = 0; 5 my $parser = new XML::Parser(Handlers => { 6 Start => \&handle_start, End => \&handle_end}); 7 $parser->parsefile($ARGV[0]); 8 9 sub handle_start { 10 my $p = shift; 11 my $el = shift; 12 $deep++; 13 print "$el - $deep\n"; 14 } 15 16 sub handle_end { 17 $deep--; 18 }
Lines 5 and 6 create a new instance of XML::Parser and assign subroutines &handle_start and &handle_end to the Start and End events. Each time the XML parser detects a new tag, it calls the handler assigned to the Start event, &handle_start. Then line 7 reads in the file and stores it in a way that reflects the tag structure.
The XML::Parser module includes built-in styles that contain their own event handlers and implement built-in structures. The available styles are: Debug, Subs, Tree, Objects, and Stream. The Debug style prints information about an XML structure:
1 #!/usr/bin/perl -w 2 3 use XML::Parser; 4 my $parser = new XML::Parser(Style => 'Debug'); 5 $parser->parsefile($ARGV[0]);
When provided with our earlier contact database, this program prints the following:
\\ () contact || #10; contact \\ () contact name || Jonathan Eisenzopf contact // contact || #10; contact \\ () contact address || 555 Foobar Lane contact // contact || #10; contact \\ (number 555-5555) contact // contact || #10; contact \\ (address eisen@pobox.com) contact // contact || #10; //
This is simple stuff, but remember that XML is used for everything from databases to books. Visualizing the structure helps correct organizational problems.
Nearly every topic has its own FAQ, usually in HTML. Keeping a FAQ fresh can be difficult, especially if it requires frequent updates. You might also want to generate ASCII for sending mail, or a PostScript version for printing.
XML and Perl can address these problems with minimal effort. First, we define a DTD for our XML FAQ:
<!ELEMENT faq (header, abstract, section)> <!ELEMENT header (title, author version)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT version (#PCDATA)> <!ELEMENT abstract (#PCDATA)> <!ELEMENT section (question,answer)> <!ELEMENT question (#PCDATA)> <!ELEMENT answer (#PCDATA)> <!ATTLIST version number CDATA #IMPLIED>
Because the XML::Parser module is non-validating, we can't validate our XML FAQ using this DTD--but we can use it as a roadmap to define how our XML file is to be constructed. In this example, we will be maintaining a Music Trivia FAQ. Here's our XML file:
<faq> <header> <title>Music Trivia FAQ</title> <author>Jonathan Eisenzopf</author> <version>1.0</version> </header> <abstract>This FAQ contains musical trivia that includes bands, compositions, and writers. </abstract> <section> <question>When was Johann Sebastian Bach born? </question> <answer>Johann Sebastian Bach was born in 1685. </answer> </section> <section> <question>What band defied the laws of tradition in their album entitled "Frizzle Fry"? </question> <answer>Primus</answer> </section> </faq>
Now we write a script to parse the XML file, assign handler subroutines, and convert the XML to HTML. The source code is shown below. First, we load the XML::Parser module and create a new XML::Parser object. Notice that we have assigned three handlers: the familiar Start and End, and Char for character data.
#Parsing an XML-based FAQ #!/usr/binperl -w use strict; use XML::Parser; die "Usage: xmlfaq.pl <file>" unless @ARGV == 1; my $parser = new XML::Parser(Handlers => { Start => \&handle_start, End => \&handel_end, Char => \&handle_char}); $parser->parsefile($ARGV[0]); sub handle_start { my ($p, $e1) = @_; if ($e1 = /\bheader\b/i) {print "<CENTER><HR>";} elseif ($e1 = /\btitle\b/i) {print "<H1>";} elseif ($e1 = /\bauthor\b/i) {print "Author: <B>";} elseif ($e1 = /\bversion\b/i) {print "Version: <B>";} elseif ($e1 = /\babstract\b/i) { print '<FONT SIZE=+1><B>Abstract</B><FONT><P>'; } elseif ($e1 = /\bsection\b/i) {print "<UL>";} elseif ($e1 = /\bquestion\b/i) {print "<li><B>Q:</B>";} elseif ($e1 = /\banswer\b/i) {print "<dl><B>A:</B>";} } sub handle_char { my ($p, $data) = @_; print $data; } sub handle_end { my ($p, $e1) = @_; if ($e1 = /\bheader\b/i) {print "<HR></CENTER>";} elseif ($e1 = /\btitle\b/i) {print "</H1>";} elseif ($e1 = /\bauthor\b/i) {print "</B><BR>";} elseif ($e1 = /\bversion\b/i) {print "</B><BR>";} elseif ($e1 = /\babstract\b/i) {print "<P>\n";} elseif ($e1 = /\bsection\b/i) {print "</UL>";} elseif ($e1 = /\bquestion\b/i) {print "</li>";} elseif ($e1 = /\banswer\b/i) {print "</dl>";} }
The handle_start() subroutine grabs the element name that the parser found, and the nested if statements print the appropriate HTML for each of the eight meaningful tags in our DTD. The handle_char() subroutine, which dumbly prints simple character data, and the handle_end() subroutine prints the appropriate HTML end tags, finishing what handle_start() started.
As you can see, we were able to create a simple XML to HTML converter in less than forty lines of code. Because the FAQ is in XML form, we can change the HTML output at whim, or convert it to an entirely different format almost as easily.
Based on its popularity, the Perl and XML combination has a bright future. Programmers have begun to release modules built on top of XML::Parser for DBI, CGI, and RPC, to name a few acronyms. For more information on XML in Perl, read http://www.perlxml.com/faq/perl-xml-faq.html.
If you are interested in learning more about the future of Perl and XML, join the Perl-XML mailing list by sending email to Lyris@ActiveState.com. In the body of the message, write SUBSCRIBE Perl-XML.
__END__
Jonathan Eisenzopf is a Senior Software Engineer for Whirlwind Interactive (http://www.wwind.com), a premier Web design and applications development firm located in the Washington D.C. area. He can be reached at eisen@pobox.com.