PREVIOUS  TABLE OF CONTENTS  NEXT 

Spreadsheet::WriteExcel

John McNamara

Resources:
Spreadsheet::WriteExcel ................................................................... CPAN
OLE::Storage ................................................................................... CPAN
Win32::OLE ..................................................................................... CPAN
XML spces for Excel ................. http://msdn.microsoft.com/library/officedev/
Gnumeric ............................................................... http://www.gnumeric.org
HTML::TableExtract ........................................................................ CPAN
Excel SDK newsgroup .............................. news://microsoft.public.excel.sdk
OLE Compound File ..... http://user.cs.tu-berlin.de/~schwartz/pmh/guide.html
Herbert ........................................ http://user.cs.tu-berlin.de/~schwartz/pmh/
Filters .................................................. http://arturo.directmail.org/filtersweb/
xlHtml ....................................................................... http://www.xlhtml.org/

One of Perl's great strengths is the ability to filter data from one format into another. Data goes in one end of a Perl program and miraculously comes out the other end as something more useful. Your Sybase file goes into Perl counselling and after a few short sessions comes out feeling like a brand new Oracle file.

However, not all file formats are readily accessible. Certain proprietary file formats, and in particular binary files, can be difficult to handle. One such format is the Microsoft Excel spreadsheet file.

Excel is the spreadsheet application at the heart of the Microsoft Office suite. It is a popular tool for data analysis and reporting, and even though it is only available on Windows and Macintosh platforms there is often a requirement to produce Excel compatible files on Unix platforms. (Several rumors and some evidence of a Linux port of Microsoft Office have recently come to light on Slashdot.)

This article describes Spreadsheet::WriteExcel, a cross-platform Perl module designed to write data in the Microsoft Excel binary format. It highlights the fact that although Perl is most often associated with text files, it can readily handle binary files as well. This article also looks at alternative methods for producing Excel files and suggests some methods for reading them.

Using Spreadsheet::WriteExcel

A single Excel file is generally referred to as a workbook. A workbook is composed of one or more worksheets, which are pages of data in rows and columns. Each row and column position within a workbook is referred to as a cell.

Spreadsheet::WriteExcel creates a new workbook to which you can add new worksheets. You can then write text and numbers to the cells of these worksheets. The following Perl program is a simple example:

#!/usr/bin/perl -w

use strict;
use Spreadsheet::WriteExcel;

# Create a new Excel workbook called perl.xls

my $workbook = Spreadsheet::WriteExcel->new("perl.xls");

my $worksheet = $workbook->addworksheet();

# Write some text and some numbers
# Row and column are zero indexed
$worksheet->write(0, 0, "The Perl Journal");
$worksheet->write(1, 0, "One"            );
$worksheet->write(2, 0, "Two"            );
$worksheet->write(3, 0,  3               );
$worksheet->write(4, 0,  4.0000001        );

Figure 1: Example file Written with Spreadsheet::WriteExcel

What is happening here is that we are using the Spreadsheet::WriteExcel module to create a variable that acts like an Excel workbook. We add a single worksheet to this workbook and then write some text and numbers. Figure 1 shows how the resulting file looks when opened in Excel.

The Spreadsheet::WriteExcel module provides an object-oriented interface to a new Excel workbook. This workbook is an object (a variable) that acts as a container for worksheet objects (more variables), which themselves provide methods (functions) for writing to their cells.

The primary method of the module is the new() constructor, which takes a filename as its argument and creates a new Excel workbook:

$workbook = Spreadsheet::WriteExcel->new($filename);

The workbook is then used to create new worksheets using the addworksheet() method:

$worksheet = $workbook->addworksheet($sheetname);

If no $sheetname is specified, the general Excel convention for worksheet naming will be followed: Sheet1, Sheet2, and so on. The worksheets are stored in an array called @worksheets which can be accessed through the workbook object.

In a multi-sheet workbook you can select which worksheet is initially visible with the activate() method.

The worksheet objects provide the following methods for writing to cells:

        write($row, $column, $token)
        write_number($row, $column, $number)
        write_string($row, $column, $string)

The write() method is an alias for one of the other two write methods. It calls write_number() if $token looks like a number according to the following regex:

$token =~ /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/

Otherwise it calls write_string(). If you know in advance what type of data needs to be written, you can call the specific method, and otherwise you can just use write().

Here is another example that demonstrates some of these features:

#!/usr/bin/perl -w

use strict;
use Spreadsheet::WriteExcel;

# Create a new Excel workbook
my $workbook = Spreadsheet::WriteExcel->new("regions.xls");

# Add some worksheets
my $north = $workbook->addworksheet("North");
my $south = $workbook->addworksheet("South");
my $east  = $workbook->addworksheet("East");
my $west  = $workbook->addworksheet("West");

# Add a caption to each worksheet
foreach my $worksheet (@{$workbook->{worksheets}}) {
   $worksheet->write(0, 0, "Sales");
}

# Write some data
$north->write(0, 1, 200000);
$south->write(0, 1, 100000);
$east->write (0, 1, 150000);
$west->write (0, 1, 100000);

# Set the active worksheet
$south->activate();

The output from this program is shown in Figure 2.

Figure 2: A multiple worksheet example

You can also create a new Excel file using the special Perl filehandle -, which redirects the output to STDOUT. This is useful for CGI programs generating data with a content-type of application/vnd.ms-excel.

   #!/usr/bin/perl -w

   use strict;
   use Spreadsheet::WriteExcel;

   # Send the content type
   print "Content-type: application/vnd.ms-excel\n\n";

   # Redirect the output to STDOUT
   my $workbook  = Spreadsheet::WriteExcel->new("-");

   my $worksheet = $workbook->addworksheet();
   $worksheet->write(0, 0, "The Perl Journal");

The Spreadsheet::WriteExcel module also provides a close() method which can be used to close the Excel file explicitly. As usual, the file will be closed automatically when the object reference goes out of scope or when the program ends.

Finally, the following is a slightly more useful example - a Perl program that converts a tab-delimited file into an Excel file:

   #!/usr/bin/perl -w

   use strict;
   use Spreadsheet::WriteExcel;

   # Check for valid number of arguments
   if (($#ARGV < 1) || ($#ARGV > 2)) {
       die("Usage: tab2xls tabfile.txt newfile.xls\n");
   };

   # Open the tab-delimited file
   open (TABFILE, $ARGV[0]) or die "$ARGV[0]: $!";

   # Create a new Excel workbook
   my $workbook  = Spreadsheet::WriteExcel->new($ARGV[1]);
   my $worksheet = $workbook->addworksheet();
   # Row and column are zero indexed
   my $row = 0;

   while () {
       chomp;
       # Split on single tab
       my @Fld = split('\t', $_);

       my $col = 0;
       foreach my $token (@Fld) {
           $worksheet->write($row, $col, $token);
           $col++;
       }
       $row++;
   }

How the Spreadsheet::WriteExcel Module Works

We'll now turn to the structure of the module, discussing the Excel format, a history of the module's development, and the glory of Perl's pack builtin.

The Excel Binary Interchange File Format

Excel data is stored in the Binary Interchange File Format, also known as BIFF. Details of this format are given in the Excel SDK, the "Excel Developer's Kit" from Microsoft Press. It is also included in the MSDN CD library, but is no longer available on the MSDN web site. Issues relating to the Excel SDK are discussed, occasionally, the newsgroup microsoft.public.excel.sdk.

The BIFF portion of the Excel file is composed of contiguous binary records that have different functions and hold different types of data. Each BIFF record is composed of the following three parts:

Record name A hexadecimal identifier (2 bytes)
Record length The length of following data (2 bytes)
Record data The data, which can be of variable length

The BIFF data is stored along with other data in an OLE Compound File. This is a structured storage format that acts like a filesystem within a file. A Compound File is composed of storages and streams which, to follow the file system analogy, are like directories and files. This is shown schematically in Figure 3.

Figure 3: The compound File system used to store Excel data.

One effect of the file system structure is that the BIFF data within the Compound Files is often fragmented, and the files occasionally contain lost blocks of data. The location of the data within a Compound File is controlled by a file allocation table (FAT).

The documentation for the OLE::Storage module contains one of the few descriptions of the OLE Compound File in the public domain, at http://user.cs.tu-berlin.de/~schwartz/pmh/guide.html. The source code for the Gnumeric spreadsheet Excel plugin also contains information relevant to the Excel BIFF format and the OLE container at http://www.gnumeric.org/.

A Brief History of Time Wasted

Spreadsheet::WriteExcel started life as a C program written to convert the numerical output of a Fortran Finite Element Analysis program into an Excel spreadsheet. The prototype version produced a tab-delimited file that Excel digested without problem. However, I thought a native binary file would be nicer. Therein lies a true tale of human vanity.

The SDK documentation for Excel 5 lists 127 binary records that can be included in a file, but never says which records are required. By a painful process of trial and error, I removed binary records from a standard Excel file until I reached a minimum configuration that would load without crashing the application. This process has a nice name: reverse engineering. The memory of this drudgery came back to me two years later when I came across the following article in MSDN: "Records needed to make a BIFF5 file Microsoft Excel can use, Q147732", at http://support.microsoft.com/support/kb/articles/Q147/7/32.asp.

Everything went fine until Excel 97.

Excel 4 files are pure BIFF files. Excel 5 files aren't. They're Compound Files with BIFF files embedded inside. However, Excel 5 will also accept a pure BIFF file for backward compatibility. Excel 97 will not. The Fortran to Excel filter that had worked so successfully with Excel 5 caused Excel 97 to choke and die.

The solution was to open a Compound Document stream using a C++ interface and write the BIFF records into it. In C++ on Windows this is relatively easy; for a brief example, see "How to Create a BIFF5 File, Q150447", http://support.microsoft.com/support/kb/articles/Q150/4/47.ASP.

In Perl, or in C for that matter, this approach is more difficult since the OLE interface is closely tied to C++. The only cross-platform resource available for use with Perl is the OLE::Storage module, which is an interface to OLE documents. However, it doesn't provide any facility for writing into a document stream.

The first version of the Spreadsheet::WriteExcel exploited a backward compatibility feature in Excel to avoid using the OLE container. However, this limited the file to a single worksheet, and features like formatting weren't possible.

So it was back to the hex editor, although this time I was also armed with the OLE::Storage documentation and the Gnumeric source code. The current version of Spreadsheet::WriteExcel supports the OLE container, paving the way for the addition of other Excel features. At the moment the main thrust of the work is toward adding formatting for cells, rows, and columns.

The addition of the OLE container meant that the files produced by this module are also compatible with the Linux/Unix spreadsheet applications Star Office, Gnumeric, and XESS.

The pack Programming Language

Perl contains several mini-languages each with its own syntax: format, pod, regexen, sprintf, and pack. The pack function is described in perlfunc as follows: "pack(template, list) takes an array or list of values and packs it into a binary structure, returning the string containing the structure." This function is ideal for writing the BIFF records contained in an Excel file. For example, consider how pack is used to write the BOF binary record in the following subroutine from Spreadsheet::WriteExcel:

sub _store_bof {

    my $self    = shift;
    my $name    = 0x0809;       # Record identifier
    my $length  = 0x0008;       # Number of bytes to follow

    my $version = $BIFF_version; # 0x0500 for Excel 5
    my $type    = $_[0];        # 0x05 = workbook, 0x10 = worksheet

    my $build   = 0x096C;
    my $year    = 0x07C9;

    my $header  = pack("vv",   $name, $length);
    my $data    = pack("vvvv", $version, $type, $build, $year);

    $self->_prepend($header, $data);
}

The string written to the Excel file looks like this in hexadecimal:

        09 08 08 00 00 00 10 00 00 00 00 00

The v template produces a two-byte integer in little-endian order regardless of the native byte order of the underlying hardware. Since the majority of the BIFF and OLE data in an Excel file is composed of little-endian integers, it's possible to write a cross-platform binary file with very little effort. The complementary function for reading fixed format structures is unpack. Perl is most often associated with text processing, but has features that handle binary data in a relatively straightforward manner.

One problem I encountered was with the binary representation of a floating-point number, since Excel requires a 64-bit IEEE float. pack provides the d template for a double precision float, but its format depends on the native hardware. If Spreadsheet::WriteExcel cannot generate the required number format, it will croak() with an error message. During installation, make test will also catch this. Nobody has reported a problem yet, probably because the owners of PDPs or Crays are involved in real computing and aren't interested in such fripperies as Microsoft Excel.

There is one feature of writing binary files that traps everyone at least once. Consider the following example, which writes the Excel end-of-file record identifier, 0x000A. What file size is printed out?

   #!/usr/bin/perl -w

   use strict;

   open (TMP, "+> testfile.tmp") or die "testfile.tmp: $!";
   print TMP pack("v", 0x000A);
   seek (TMP,0,1);
   my $filesize = -s TMP;

   print  "Filesize is $filesize bytes.\n";

The answer depends on your operating system. On Unix the answer is 2, and on Windows the answer is 3. This is because 0x0A is the newline character, \n, which your Windows's I/O libraries will translate to 0x0D 0x0A or \r\n. This is a "feature" of Windows, not Perl. To write a binary file with exactly the data you want and nothing else, you need to use the binmode() function on the filehandle.

The Structure of the Module

Spreadsheet::WriteExcel was designed with a object-oriented interface so that it most closely represents Excel's own interface. The fact that Excel relies heavily on an object-oriented model can be seen from the Excel object hierarchy, and from its interaction with Visual Basic for Applications (VBA). The Excel object hierarchy is shown in the help file that comes with Excel VBA under the entry "Microsoft Excel Objects". The main strand of the hierarchy of interest is:

        Application->Workbook->Worksheet

For us, "Application" means Excel. In other contexts it might mean Word or PowerPoint.

Spreadsheet::WriteExcel mimics this hierarchy with five classes, each split into its own packages. For ease of development, each package is contained in its own module.

        WriteExcel  - The main module
        Workbook    - A container for worksheets
        Worksheet   - Provides the write methods
        BIFFwriter  - Writes data in BIFF format
        OLEwriter   - Write data into an OLE storage

From the user's point of view, these are seen as follows:

        WriteExcel->Workbook->Worksheet

The interaction of these packages is shown as low-tech UML in Figure 4. Only the documented public methods are included.

Figure 4: The structure of the Spreadsheet::WriteExcel module.

The relationships can be described as follows: WriteExcel is a Workbook. Workbook is a container for Worksheets, and it uses the OLEwriter class. Workbook and Worksheet are both derived from the abstract base class BIFFwriter.

Alternative Ways of Writing to Excel

Depending on your requirements, background, and general sensibilities, you may prefer one of the following methods for storing data in Excel.

• CSV (comma separated variables) or text. If the file extension is csv, Excel will open and convert this format automatically.
• HTML tables. This is an easy way to add formatting.
• DBI or ODBC. Connect to an Excel file as a database.
• Win32::OLE module and office automation. This is discussed in more detail in a later section.
• XML and HTML. There are XML and HTML specifications available for Excel Workbooks. The HTML specification goes beyond single tables and allows you access to all of Excel's features. However, there are no modules currently available to write Excel files in these formats. Interested parties should look at http://msdn.microsoft.com/library/officedev/ofxml2k/ofxml2k.htm.

Other sources of information: the Gnumeric source code (http://www.gnumeric.org/) and the soon to be GPL'ed Star Office.

Reading from Excel

Despite the title of the Spreadsheet::WriteExcel module, the most commonly asked questions that I receive are about reading Excel files. The following are some suggestions:

• HTML tables. If the files are saved from Excel as a HTML table the data can be accessed using HTML::TableExtract, http://search.cpan.org/search?dist=HTML-TableExtract.
• DBI or ODBC.
• OLE::Storage, formerly known as LAOLA. This is a Perl interface to OLE file formats. In particular, the distribution contains an Excel to HTML converter called Herbert, http://user.cs.tu-berlin.de/~schwartz/pmh/. There is also an open source C/C++ project based on the LAOLA work. Try the Filters Project at http://arturo.directmail.org/filtersweb/ and the xlHtml Project at http://www.xlhtml.org/. The xlHtml filter is more complete than Herbert.
• Win32::OLE module and office automation.

Win32::OLE

As is often said, only perl can parse Perl. Similarly, only Excel can grok and spew Excel. Tackling the binary file head on is fine up to a certain point. After that it's best to leave the dirty work to Excel.

By far the most powerful method of accessing an Excel file for either reading or writing is through OLE and OLE Automation. Automation is the process by which OLE objects, such as Excel, act as servers and allow other applications to control their functionality. When applied to the Microsoft Office suite of applications, this process is known as Office Automation.

The following is a textual description of how you might use Automation with Excel:

• Request Excel to start
• Request Excel to write some cells
• Request Excel to save the file
• Request Excel to close

To do this in Perl requires a Windows platform, the Win32::OLE module, and an installed copy of Excel. Here is an example:

   #!/usr/bin/perl -w

   use strict;
   use Cwd;
   use Win32::OLE;

   my $application = Win32::OLE->new("Excel.Application");
   my $workbook    = $application->Workbooks->Add;
   my $worksheet   = $workbook->Worksheets(1);

   $worksheet->Cells(1,1)->{Value} = "The Perl Journal";
   $worksheet->Cells(2,1)->{Value} = "One";
   $worksheet->Cells(3,1)->{Value} = "Two";
   $worksheet->Cells(4,1)->{Value} =  3;
   $worksheet->Cells(5,1)->{Value} =  4.0000001;

   # Add some formatting
   $worksheet->Cells(1,1)->Font->{Bold}       = "True";
   $worksheet->Cells(1,1)->Font->{Size}       = 16;
   $worksheet->Cells(1,1)->Font->{ColorIndex} = 3;
   $worksheet->Columns("A:A")->{ColumnWidth}  = 25;

   # Get current directory using Cwd.pm
   my $dir = cwd();

   $workbook->SaveAs($dir . '/perl_ole.xls');
   $workbook->Close;

Figure 5: An example file written with Win32::OLE and Excel.

The result is shown in Figure 5. Without the formatting code, this program produces an Excel file which is almost identical to the one shown in Figure 1.

There are some issues that we've skirted here, particularly in relation to starting and stopping an OLE server. A more detailed introduction to the Win32::OLE module is given by Jan Dubois in TPJ #10 at http://www.itknowledge.com../vol3_2/tpj0302-0008.html.

For additional examples see http://www.activestate.com/Products/ActivePerl/docs/faq/Windows/ActivePerl-Winfaq12.html and http://www.activestate.com/Products/ActivePerl/docs/site/lib/Win32/OLE.html.

As a brief diversion, the following program uses Win32::OLE to expose the flight simulator Easter Egg in Excel 97 SR2.

   #!/usr/bin/perl -w

   use strict;
   use Win32::OLE;

   my $application = Win32::OLE->new("Excel.Application");
   my $workbook    = $application->Workbooks->Add;
   my $worksheet   = $workbook->Worksheets(1);

   $application->{Visible} = 1;

   $worksheet->Range("L97:X97")->Select;
   $worksheet->Range("M97")->Activate;

   my $message =   "Hold down Shift and Ctrl and click the ".
                   "Chart Wizard icon on the toolbar.\n\n".
                   "Use the mouse motion and buttons to control ".
                   "movement. Try to find the monolith. ".
                   "Close this dialog first.";

   $application->InputBox($message);

Obtaining Spreadsheet::WriteExcel

The latest version of the module will always be available at CPAN, at http://search.cpan.org/search?dist=Spreadsheet-WriteExcel.

ActivePerl users can download and install the module using PPM as follows:

  C:\> ppm
  PPM> set repository tmp http://homepage.eircom.net/~jmcnamara/perl
  PPM> install Spreadsheet-WriteExcel
  PPM> quit
  C:\>
_ _END_ _


John McNamara (jmcnamara@cpan.org) works as a software developer for Tecnomen Ltd. Apart from the usual things that engage us all, his main interest in life is the Saab 900 series. He lives in Limerick, Ireland.
PREVIOUS  TABLE OF CONTENTS  NEXT