Sunday, April 3, 2011

Book review: "Python 2.6 Text Processing: Beginner’s Guide" by Jeff McNeil

Jeff McNeil’s “Python 2.6 Text Processing: Beginner’s Guide” is a practical introduction to a wide range of methods for reading, processing and writing textual data from a variety of structured and unstructured data formats. Aimed primarily at novice Python programmers who have some elementary knowledge of the language basics but without prior experience in text processing, the book offers hands-on examples for each of the techniques it discusses – ranging from Python’s built-in libraries for handling strings, regular expressions, and formats such as JSON, XML and HTML, through to more advanced topics such as parsing custom grammars, and efficiently searching large text archives. In addition it contains a great deal of general supporting material on working with Python, including installing packages and third-party libraries, and working with Python 3.

The first three chapters lay the foundations, covering a number of Python basics including a crash course in file and URL I/O, and the essentials of Python’s built-in string handling functions. Useful background topics – such as installing packages with easy_install, and using virtualenv – are also introduced here. (A sample of the first chapter can be freely downloaded from the book’s website at https://www.packtpub.com/python-2-6-text-processing-beginners-guide/book). The next three cover: using the standard library to work with simple structured data formats (delimited “CSV” data, “ini”-style configuration files, and JSON-formatted data); working with Python regular expressions (a stand out chapter for me); and handling structured markup (specifically, XML and HTML). Subsequent chapters on using the Mako templating package (the default system for the Pylons web framework) to generate emails and web pages, and on writing more advanced data formats (PDF, Excel and OpenDocument), are separated by an excellent overview of understanding and working with Unicode, encodings and application internationalization (“i18n”).

The remaining two chapters cover more advanced topics, with some good background theory supplementing the practical examples: using the PyParsing package to create parsers for custom grammars (with a brief nod to the basics of natural language processing using the Natural Language Toolkit, NLTK); and the Nucular package for indexing large quantities of textual data (not necessarily just plain text) to enable highly efficient searching. Finally, an appendix offers a grab-bag of general Python resources, references to some more advanced text processing tools (such as Apache’s Lucene/Solr), and an excellent overview of the differences between Python 2 and 3 (including a hands-on example of migrating code from 2 to 3).

The book covers a lot of ground and moves fairly quickly; however it adopts a largely successful hands-on approach, engaging the reader with working examples at each stage to illustrate the key points, and this certainly helped me keep up. I was also impressed by the clear and concise quality of code in the examples, and the very natural way that general Python concepts and principles – generators, duck typing, packaging and so on – were introduced as asides. (One very minor criticism is that the layout of the example code could have been improved, as the indentation levels weren’t always immediately obvious to me.) Aside from a surprisingly unsatisfying chapter on structured markup (reluctantly, I would recommend looking elsewhere for an introduction to XML processing with Python) and a few niggling typos, there’s a lot of excellent material in this book, and the author has a knack for presenting some tricky concepts in a deceptively easy-to-understand manner. I think that the chapter on regular expressions is possibly one of the best introductions to the subject that I’ve ever seen; other chapters on encodings and internationalization, advanced parsing, and indexing and searching were also highlights for me (as was the section on Python 3 in the appendix).

Overall I really enjoyed working through the book and felt I learned a lot. I think it’s fair to say that given the rather ambitious range of techniques presented, in many cases (particularly for the more advanced or specialised topics) that the chapters are inevitably more introductory than definitive in nature: the reader is given enough information to grasp the background concepts and get started, with pointers to external resources to learn more. In conclusion, I think this is a great introduction to a wide range of text processing techniques in Python, both for novice Pythonistas (who will undoubtedly also benefit from the more general Python tips and tricks presented in the book) and more experienced programmers who are looking for a place to start learning about text processing.

Disclosure: a free e-copy of this book was received from the publisher for review purposes; this review has also been submitted to Amazon.

No comments:

Post a Comment