Into the C: Making PyGeoHash Decode Twice as Fast

This is the third post in a small thread about PyGeoHash performance. First I benchmarked it against the field, then I trimmed the Python wrapper that sat in front of the C extension. That wrapper work got encode within a hair of the C++ python-geohash library. But decode was still about four times slower, and I had already established that the slow part lived in the C extension itself. So this time I opened up the C.

I expected to be staring at floating point math and bit twiddling, looking for a clever trick. What I actually found was a function building objects it threw away and importing the same Python module twice on every single call. The fix made decode 67% faster and did not touch the geohash algorithm at all.

Reading the C with fresh eyes

Here is the shape of what decode was doing, lightly paraphrased from the C:

static PyObject* geohash_decode(PyObject *self, PyObject *args) {
    // ... parse the geohash string ...

    // Call decode_exactly, which builds an ExactLatLong named tuple
    PyObject *exact_result = geohash_decode_exactly(self, args);

    // Pull two fields back out of that named tuple
    PyObject *lat_obj = PyObject_GetAttrString(exact_result, "latitude");
    PyObject *lon_obj = PyObject_GetAttrString(exact_result, "longitude");
    double lat = PyFloat_AsDouble(lat_obj);
    double lon = PyFloat_AsDouble(lon_obj);
    // ... drop exact_result entirely ...

    // Import pygeohash.geohash_types, look up LatLong, build one, return it
}

Three things are wrong here, and none of them are math.

First, decode calls decode_exactly, which does the bit walking and then packages the answer into an ExactLatLong named tuple with four fields. Then decode immediately rips that named tuple apart with two attribute lookups, pulls the floats out, and discards the object. It builds a four field named tuple just to read two of its fields and throw it on the floor. Then it builds a second named tuple, a LatLong, to actually return. Two object constructions per call, one of them pure waste.

Second, both decode and decode_exactly ran this on every call:

PyObject *module = PyImport_ImportModule("pygeohash.geohash_types");
PyObject *cls = PyObject_GetAttrString(module, "LatLong");

Importing a module that is already loaded is not free. Python checks sys.modules, hands back the cached module, and then you do a dictionary lookup for the class. Doing that once is fine. Doing it millions of times in a decode loop is a tax you pay forever.

Third, and this one is subtle, building the named tuple goes through its Python level __new__. A typing.NamedTuple is a tuple subclass, and the constructor Python generates for it is roughly def __new__(cls, lat, lon): return tuple.__new__(cls, (lat, lon)). Calling the class from C invokes that Python function on every decode.

Three fixes, no new math

The bit walking was fine. I left it exactly as it was. What I changed was everything around it.

I pulled the decode math into a plain C helper that writes into four double pointers and returns a status code. No Python objects involved:

static int decode_to_doubles(const char *geohash, double *lat, double *lon,
                             double *lat_err, double *lon_err);

Now decode_exactly calls that helper and builds one ExactLatLong. And decode calls the same helper and builds one LatLong, with no ExactLatLong in sight. The discarded object is gone, and so are the two attribute lookups.

I cached the named tuple classes. The very first decode imports pygeohash.geohash_types and stashes the LatLong and ExactLatLong type objects in static pointers that live for the rest of the process. Every call after the first just uses them.

static PyObject *LatLong_type = NULL;

static int ensure_types(void) {
    if (LatLong_type != NULL) return 0;   // already cached
    PyObject *module = PyImport_ImportModule("pygeohash.geohash_types");
    LatLong_type = PyObject_GetAttrString(module, "LatLong");
    // ... and ExactLatLong ...
}

And I built the result tuple through tuple’s own tp_new instead of calling the class:

PyTuple_Type.tp_new((PyTypeObject *)LatLong_type, call_args, NULL);

That looks like a hack, and I went back and forth on it, but it is precisely the operation the generated __new__ performs under the hood. It produces a real LatLong, fields and all, and just skips the Python function call wrapping it. I measured the named tuple construction three ways to make sure I was not fooling myself:

LatLong(lat, lon)              99 ns
tuple.__new__(LatLong, ...)    63 ns
plain tuple (lat, lon)         27 ns

So the bypass buys about 36 nanoseconds a call. The plain tuple at the bottom is what python-geohash returns, and it is a reminder that a named tuple will never be free. That is a deliberate API choice on our end. The named fields are worth a few nanoseconds to me.

The bug I was not looking for

While I was in there, I noticed the character lookup did this:

int cd = base32_decode_map[(int)geohash[i]];

base32_decode_map has 128 entries. geohash[i] is a char, which is signed on most platforms. Feed decode a string with a byte at or above 128, which is exactly what any multibyte UTF-8 character looks like, and (int)geohash[i] is negative, and you are reading memory before the start of the array. It probably never crashed in practice because the garbage value almost always failed the validity check a line later, but it was an out of bounds read sitting in the hot path. I cast through unsigned char and added an explicit range check so those bytes are cleanly rejected as invalid. Performance work has a way of making you read code carefully enough to find the bugs that were always there.

Where it landed

Same machine, same benchmark suite, mean time per operation:

OperationBeforeAfterFaster by
decode959 ns318 ns67%
bounding box758 ns484 ns36%

Bounding box came along for the ride because it decodes internally. Decode went from roughly four times slower than python-geohash to about 1.2 to 1.4 times, which is as close as I can get while still handing back a named tuple.

Back to the starting line

That 67% is just this one change. The honest question is where the whole thread left PyGeoHash, so I reran the exact benchmark from the first post, all seven libraries, against today’s code. Here is PyGeoHash at the start of the series versus now, after the wrapper cleanup and this C work stacked together:

OperationStart of seriesNowFaster by
encode415 ns204 ns51%
decode1,560 ns325 ns79%
bounding box1,614 ns496 ns69%

Decode nearly quadrupled in speed across the two rounds of work, and I never touched the geohash algorithm once.

So did we move up the leaderboard? No. PyGeoHash is still in fourth place on encode and decode, behind the two Rust libraries (geohashr and pygeohash-fast) and the C++ python-geohash, and third on bounding box where fewer libraries compete. The podium did not change.

What changed is the distance to it. At the start of the series the gap looked like this on decode:

geohashr (Rust)          91 ns
pygeohash-fast (Rust)   170 ns
python-geohash (C++)    221 ns
pygeohash               325 ns   <- was 1,560 ns
... then a cliff ...
libgeohash            3,675 ns
geohash-tools         4,006 ns
geolib               62,593 ns

In the first post PyGeoHash decoded about seven times slower than python-geohash. Now it is about one and a half times slower, and on encode the two are a coin flip (204 ns versus 201). We did not pass anyone, but we went from being lapped to running on their shoulder, and we are still ten to two hundred times faster than every pure Python implementation in the field. For a library whose entire reason to exist is installing cleanly everywhere without a compiler, sitting just behind the native-code libraries and miles ahead of the Python ones is exactly the place I want to be.

The lesson, again

The theme of this whole thread keeps repeating. The slow part is almost never the part you assume. I went into the C expecting to optimize arithmetic and instead deleted an object construction, cached two lookups, and skipped a Python function call. When you profile a compiled extension, look at what it does with Python objects before you look at the math. Object creation, attribute access, and module imports are where the time hides, and they are usually easier to fix than the algorithm. As always, the change is up as a pull request, and it is mostly deletions.