Optimizing PyGeoHash Without Touching the C

A few days ago I benchmarked PyGeoHash against six other geohash libraries. The summary: we sit comfortably ahead of every pure Python implementation and a step behind the ones that drop into Rust or C++. The number that bugged me was decode. PyGeoHash decodes a geohash in about 1,500 nanoseconds. python-geohash, the C++ library, does it in about 220. We have a C extension too, so why were we seven times slower?

It turns out the C extension was not the problem. The Python wrapper sitting in front of it was. I spent an afternoon on it and got encode 43% faster, decode 37% faster, and bounding box 53% faster, without changing a single line of C. Here is how I found the slack and how you can do the same thing to your own library.

Step one: measure the floor

Before optimizing anything, find out how fast the thing could possibly go. Our decode wrapper does some validation, calls into the C extension, and wraps the result in a named tuple. The C call is the part I cannot easily make faster, so that is the floor. I timed it directly:

from pygeohash.cgeohash.geohash_module import decode as c_decode
import timeit
timeit.timeit(lambda: c_decode("ezs42e44y"), number=500000)

The bare C call came in around 760 nanoseconds. The full Python wrapper around it was 1,500. So more than half the time spent in decode was happening in Python, before and after the C call ever ran. That is the gap worth chasing. If the wrapper had been at 780 against a 760 floor, I would have stopped here and gone to rewrite the C instead.

This is the single most useful habit in performance work. Separate the part you can change cheaply from the part you cannot, and measure each one. A profiler tells you where time goes. Timing the floor tells you how much of that time is even recoverable.

Step two: read the hot path like it costs money

Here is what the decode wrapper looked like:

def decode(geohash):
    if not isinstance(geohash, str):
        raise ValueError(...)
    if not geohash:
        raise ValueError(...)
    if not all(c in __base32 for c in geohash):
        raise ValueError(f"Invalid character in geohash: ...")

    logger.debug("Decoding geohash: %s", geohash)
    lat, lon = c_decode(geohash)
    logger.debug("Decoded to coordinates: lat=%f, lon=%f", lat, lon)
    return LatLong(latitude=lat, longitude=lon)

Two things jump out once you are looking for wasted work.

First, that all(c in __base32 for c in geohash) check. It builds a generator and walks every character of the input, in Python, checking membership against the base32 alphabet. For a nine character geohash that is nine membership tests and a generator setup, every single call. The question to ask is not “is this correct” but “is this necessary.” I dropped into the C extension and handed it garbage:

>>> c_decode("aio")
ValueError: Invalid character in geohash

The C code already validates characters and raises. The Python scan was doing the exact same work a second time, slower, in front of code that was going to reject the bad input anyway. Pure waste. I deleted it and let the C extension be the source of truth. The error message even still matches, so the existing tests passed unchanged.

Second, the two logger.debug calls. Logging is not free even when nobody is listening. Each call is an attribute lookup, a function call, and a level check, and you pay all of that on every decode whether or not a debug handler exists. Logging the inputs and outputs of a function that runs millions of times in a tight loop is a performance footgun. I pulled the debug logging out of the hot paths entirely and kept it only on the error branch of encode_strictly, where it runs once and only when something is actually wrong.

What was left:

def decode(geohash):
    if not isinstance(geohash, str):
        raise ValueError(...)
    if not geohash:
        raise ValueError(...)
    return LatLong(*c_decode(geohash))

The type check and the empty check stay, because the C extension does not cover them. It happily returns (0, 0) for an empty string, so I cannot delegate that one. Knowing exactly which guarantees the lower layer gives you, and which it does not, is the whole game.

Step three: prove it, in the same session

The trap with before and after numbers is that they get measured days apart on a machine in a different state, and then you are comparing a warm laptop to a cold one. Run both in one sitting. Git stash makes this clean:

# optimized code is in the working tree
git stash          # back to baseline
pytest ...bench... # record before
git stash pop      # optimized again
pytest ...bench... # record after

Here is what fell out, same machine, same run, mean time per operation:

OperationBeforeAfterFaster by
encode410 ns235 ns43%
decode1,514 ns958 ns37%
bounding box1,614 ns764 ns53%

Bounding box won the most because it calls the decode path internally and then had three debug log calls of its own stacked on top. Overhead compounds. When you clean up a function that other functions lean on, everything downstream gets faster for free.

Against the field, encode went from roughly twice as slow as the C++ library to about 1.2 times. Decode and bounding box roughly halved their distance. The remaining gap is real and it lives in the C extension, which is a bigger project for another day. But closing this much by deleting code is the best kind of win.

The general recipe

None of this is specific to geohashing. If you maintain a Python library with a compiled core, the pattern is the same every time.

  1. Time the compiled call by itself. That is your floor and it tells you whether the wrapper is even worth optimizing.
  2. Find work the wrapper does that the lower layer already does. Redundant validation is the classic offender, because people add belt-and-suspenders checks at every layer and never measure the cost.
  3. Get logging and other observability out of the hot path. Useful at the edges, expensive in the middle.
  4. Construct your return objects with positional arguments and avoid building throwaway collections per call.
  5. Measure before and after in the same session, and keep the benchmark in the repo so the next person does not have to rediscover any of this.

The benchmark suite from the last post made all of this honest. I could see exactly where I stood, make a change, and watch the number move. That is the real argument for writing the benchmark before you start tuning. You cannot optimize what you are not measuring, and you definitely cannot prove you helped.

The change is up as a pull request if you want to see the whole diff. It is small, which is the point.