Samstag, 14. August 2010

Making Sphinx faster

I've recently spend a lot of time thinking about programming languages, it's something I'm very interested in and creating my own is an item on my todo list. One topic that comes up if you think about that is parallelization so to get my mind of sphinx-web-support for a while I looked at Sphinx to see how easily I could use it in Sphinx. Before you get too excited these are just a couple of thoughts on my part, there could be something I'm still missing, now read on.

Sphinx' Design

In order to implement this feature we have to look at the design of Sphinx. The design is more or less simple, we have an Application which set's everything up and can be used to run the build process. The build process is handled by the Environment, the environment parses every document in the source directory, creates a doctree(AST) transforms is as necessary and uses the information from the doctree to populate the index, after that the doctree is stored in build/doctrees/document.doctree. Once every document has been processed the environment invokes a Builder, the builder loads each doctree, modifies it if necessary and passes it on to the Writer which creates the code for each doctree we store in the build directory under the name of the builder.

The Problem

Currently the environment does actually a lot more than it should in my opinion, the index is kept global in the environment, as well as everything we need to know about the current document. This makes it impossible to simply parallelize the process of parsing and building process because there is too much shared state.

The Solution

The obvious solution and the better design is to keep the data associated with a specific document in an object I call DocumentContext, this context is used to store the necessary information for a document as well as information we get from the document which is relevant for the Environment. After parsing, transforming and processing each document we use the context and put the relevant information in the environment.

This way the Environment is immutable from a parser perspective and we can easily use parallelization to make the entire build process a lot faster than it is currently.

Another Problem and another Solution: Backwards compatibility

Changing Sphinx in the way I propose will probably break some extensions, it will definitely break the existing domains. Personally I don't really care about this issue because I think software has to evolve and constantly change over time in order to make it in the long run.

However I know that a lot of people do care so I propose something a lot of people know from web applications, context locals, basically they are proxies which point to the objects in the current context, which is either a process, a thread or even a simpler concept based on coroutines. Using those the current API could be kept at least partially and we could deprecate it first before removing it at some point in the future.

1 Kommentar:

  1. You're right about the reading step -- it's not easily parallelizable. However, the writing step -- at least for builders akin to HTML -- should be very easily parallelizable, and if you want to start somewhere, you can try this first, which also shouldn't hurt backwards compatibility.

    AntwortenLöschen