Donnerstag, 19. August 2010

A summary of the Google Summer of Code

The Google Summer of Code is nearly over and it's time to present some results. First of all every project we made can be considered successful, the team was great and I hope to be able to work with the others in the future to achieve new great things. However let's get to the point: As some of you already now Sphinx has now Python 3.x support in trunk which means that with the next non-bugfix release you will be able to use Sphinx with Python 3.x.

In separate branches which are hopefully merged soon into trunk we have i18n support, this allows you to build gettext message catalogs which contain ids as comments which you can use to identify messages which have been changed in the documentation. Another great achievement is websupport, this allows you to create web applications using Sphinx with server-side search, comments on paragraphs and code blocks and proposals to change the documentation from users.

My contribution to i18n and websupport has been an AST based merging algorithm which allows you to easiely track changes across multiple builds of the documentation. This makes it possible to identify changes to the documentation so that we don't have to delete all the comments with every documentation rebuild.

Samstag, 14. August 2010

Making Sphinx faster

I've recently spend a lot of time thinking about programming languages, it's something I'm very interested in and creating my own is an item on my todo list. One topic that comes up if you think about that is parallelization so to get my mind of sphinx-web-support for a while I looked at Sphinx to see how easily I could use it in Sphinx. Before you get too excited these are just a couple of thoughts on my part, there could be something I'm still missing, now read on.

Sphinx' Design

In order to implement this feature we have to look at the design of Sphinx. The design is more or less simple, we have an Application which set's everything up and can be used to run the build process. The build process is handled by the Environment, the environment parses every document in the source directory, creates a doctree(AST) transforms is as necessary and uses the information from the doctree to populate the index, after that the doctree is stored in build/doctrees/document.doctree. Once every document has been processed the environment invokes a Builder, the builder loads each doctree, modifies it if necessary and passes it on to the Writer which creates the code for each doctree we store in the build directory under the name of the builder.

The Problem

Currently the environment does actually a lot more than it should in my opinion, the index is kept global in the environment, as well as everything we need to know about the current document. This makes it impossible to simply parallelize the process of parsing and building process because there is too much shared state.

The Solution

The obvious solution and the better design is to keep the data associated with a specific document in an object I call DocumentContext, this context is used to store the necessary information for a document as well as information we get from the document which is relevant for the Environment. After parsing, transforming and processing each document we use the context and put the relevant information in the environment.

This way the Environment is immutable from a parser perspective and we can easily use parallelization to make the entire build process a lot faster than it is currently.

Another Problem and another Solution: Backwards compatibility

Changing Sphinx in the way I propose will probably break some extensions, it will definitely break the existing domains. Personally I don't really care about this issue because I think software has to evolve and constantly change over time in order to make it in the long run.

However I know that a lot of people do care so I propose something a lot of people know from web applications, context locals, basically they are proxies which point to the objects in the current context, which is either a process, a thread or even a simpler concept based on coroutines. Using those the current API could be kept at least partially and we could deprecate it first before removing it at some point in the future.

Dienstag, 10. August 2010

Hey, what you are doing?

Those of you who follow the discussions on the IRC channel and Twitter already know, we have Python 3.x Support now in Trunk, so sphinx-py3k can be considered a success. However what else is going on?

One of the problems with both i18n and websupport is that we need a way to identify parts of documentation across multiple builds. A simple example is a document, it has multiple paragraphs and we want store comments for each paragraph, we need to keep track of the paragraph even if the document changes or the paragraph itself does. If we don't we have to throw away all the comments for every build as we don't know where we have to put them or if they still apply in case a paragraph has been removed entirely.

Especially identifying a changed paragraph is a bit complicated and required a bit of research on my side however I have a solution which should work mostly, it doesn't pass all the test I came up with however I hope to be able to finish my work soon, so I can talk to birkenfeld about merging my branch with trunk so it can be used in web-support and i18n.

You can take a look at the code in the bitbucket repo, if you want to keep updated about the most recent developments I suggest visiting #pocoo on freenode and/or following me on Twitter. These would also be the right places to ask me questions about the project or to simply ask me about the current status.