Dienstag, 26. April 2011

Accepted to GSoC

Yesterday the projects which were accepted to GSoC have been announced. Among them are several interesting projects under the PSF umbrella, including mine.

During GSoC I will create a benchmark suite (based on existing ones) with "real world" benchmarks which can be easily used for every Python interpreter. Up until now each interpreter more or less rolled his own suite of benchmarks of varying quality. This makes comparisons unnecessarily hard and binds resources better used elsewhere.

Furthermore I will create an application which is able to download and build interpreters and execute the benchmarks with them using a simple configuration. Up until now such an application does not exist and e.g. http://speed.pypy.org compares released and current(from trunk) PyPy versions with other released CPython versions. As nice as that is, being able to compare the most current versions of various implementations is clearly favorable.

Once that work is completed I will port the benchmark suite to Python 3.x, as several benchmarks have dependencies that do not support 3.x, yet, I will not be able to port the entire suite, however it will be at least in start when it comes to benchmarks for 3.x.

I'm currently compiling a list with information on available benchmarks (what and how does it test) so that people unfamiliar with them can achieve an easy overview, once that is finished I will send E-Mails to the CPython, PyPy, IronPython, Jython and Cython mailing lists with the benchmarks I propose and asking for other benchmarks or changes to my proposed list.

Further information on my project will be published here and on Twitter as soon as possible.

Montag, 4. April 2011

Writing CLI Applications in Python: A Rant

A couple of weeks ago I searched for a way to better organize my music. I have several GBs of music all more or less properly tagged and organized however I wanted to be able to reorganize it, change metadata, search and add covers and add new music easily. The existing applications are either horribly confusing or simply don't provide the features I want, so like every programmer In decided I could do better and started a project.

As I usually write web applications, libraries or do small researchy projects to learn stuff I researched a bit concerning the tools I need. I needed stuff for configuration, something to deal with input and output from and to the CLI.

The first thing I noticed is that there is absolutely no solution to handle configuration. I want something that handles multiple hierarchical configuration files, preserves comments in them even when the configuration changes and at best supports more formats than just INI, choosing the proper parser based on the file extension.

A search for that on pypi shows several packages, several of them don't have a description, those that do have one don't necessarily have documentation and those that have it, tend to lack it and provide no way to contribute to the project or to report bugs. For all intents and purposes those projects don't exist.

Trying to figure out how to handle configuration I took a look at the mercurial source code (another project which needs a more obvious link to a source code browser), I learned that I never want to do that again, at least when it comes to that part of the source. Oh, I nearly forgot, apparently configuration is best handled on your own which is what mercurial is doing.

The next thing I considered was handling CLI arguments for which there are two widely used solutions optparse and argparse. Optparse is the older one and is probably used by more people than argparse so I decided to look at that first, it has no way to handle commands or arguments so I deemed it unusable for my purposes.

On first glance argparse seems to be almost identical to optparse, that is because the developers wanted to preserve "backwards-compatibility", at some point they recognized that this doesn't work and changed the API making it a merge of optparse and "something else". argparse handles arguments and provides commands however that latter is rather awkward to use.

You can't just create a command and add it to the parser, no you have to call `.add_subparsers` on the parser which does not add multiple subparsers as one might think, it returns a special object with a single `.add_parser` method which adds a subparser to the parser `.add_subparsers` was called on. I have no idea why you have to do that and as I value my sanity I probably really don't want to know but something tells me that nobody sane involved ever gave the API design any consideration.

As it is really just designed as a parser `argparse.Parser.parse_args` always returns a flat data structure which does not provide information about commands invoked, which would certainly be helpful to call the appropriate function implementing that command. The documented solution for this problem is to add a default function `func` to every subparser (and yes you can specify "defaults" independent of options or arguments on a subparser which are actually not default at all because they are never changed) and call the function, which ends up as `func` in the result, with the result.

I realize things are almost always more difficult than they appear to be and there are probably good reasons for the decisions which have been made but surely there have to be better solutions to this problem.

User input is a somewhat ugly thing, all that parsing and validating dealing with those idiots calling themselves users is not really pleasant so I was hopeful that at least output on a terminal can be considered a solved problem. It is not.

If you want to write a paragraph of text, wrapped to the width of the terminal, to stdout, you have to get the width of the terminal in platform dependent ways via iocntl and fcntl on linux (I guess you have to wrap the Windows API with ctypes); luckily at least textwrap is already in the stdlib.

You have to implement progress bars and coloring yourself unless you want to have dependencies for all of these things.

Also don't forget that a simple print statement may cause problems as soon as you don't have an ASCII decoding and that even if you decode to the proper stdout encoding if possible user input might not be encode-able (umlauts to ASCII) and that you therefore may have to transliterate unless you are willing to just replace and ignore these errors but I'm sure everyone of you does this carefully everywhere.

The fact that there are no solutions to these problems is a really big WTF and makes writing CLI applications a pain in the ass which really shouldn't be the case.