Updates

Latest Tweet



What's New?

Check out for latest innovation, a computer based training video collection


Like this Page

Data Munging with Perl Review by Thing with a hook

No-nonsense resource for meat and potatoes Perl scripting

The quintessential Perl activity is data processing, particularly in a Unix environment, where output is piped into a script from some other program, transformed, and spat out again. Many people's first encounter with Perl will probably be in this task. David Cross's book shows how to do this with the minimum of fuss and the maximum of flexibility. It's not a Perl tutorial however, so you will need some basic knowledge of Perl, having read The Llama is enough. There is an appendix of 'essential Perl' to refresh your memory if you're a bit rusty.

The book begins by revising some of those basic Perl practices that come in handy for scripting, e.g. command line options, regular expressions and sorting. The second part of the book deals with parsing fairly simple data: traditional fixed-width record data (e.g. the column-based stuff that you often find as the output of old Fortran and C programs), unstructured data (e.g. doing word counts on text files), and formats such as CSV, PNG and MP3. This is the strongest section of the book, and contains lots of useful hands-on information.

The third part of the book deals with more modern forms of data files, in the shape of XML. Parsing HTML also gets a chapter to itself, after the author usefully demonstrates the limitations of any simple solution (e.g. using regexes), which provides pretty strong evidence in favour of the standard 'don't try it yourself, use a CPAN module' argument. The XML chapter itself covers the XML::Parser module in reasonable detail. However, there are now many more XML parsers in Perl out there, and XML::Parser is probably no longer the best solution (Grant McClean's Perl XML FAQ on the net has a good overview of the options). Excluding the seemingly obligatory 'here's a bunch of books and websites to learn more' chapter, the last proper chapter is on parsing, and the Rec::Descent module, and it's a very good gentle introduction.

If you're not working in a command line environment, there's not a whole lot here you're going to need. Equally, if you've been doing this sort of thing for a while, there's not much here that will be new to you, not all the subjects are explored in any great depth. And some of it (particularly the XML chapter) is a bit outdated and superficial, so I would knock off a star from my rating if you're more interested in the XML/HTML chapters.

But for the simpler tasks, e.g. parsing column based data, this is recommended. You're shown all the handy tricks you need such as piping, taking input from standard in as well as files, slurping paragraphs etc. My 4-star rating applies if this sounds like what you need: it's a clear, short and to-the-point book, which is definitely taking with you on your first journey into data munging.