From Weekend Hack to Core Tool: The category_encoders Journey

Exploring Ideas: A Blog on Technology, Startups, Food, and More

Every year during the quiet week between Christmas and New Year’s, I find myself checking in on category_encoders. What started as a weekend programming experiment has evolved into something I never could have imagined.

The Humble Beginnings

The project began as a simple playground for experimenting with different categorical encoding methods. It was just me, tinkering with various approaches to handle categorical variables in machine learning pipelines. The initial version was rough around the edges - more of a proof of concept than a production-ready library.

What began as an exploration of different encoding methods quickly evolved into a more structured package as I realized others might benefit from these implementations.

The scikit-learn-contrib Era

A significant turning point came when scikit-learn launched their contrib project. This initiative provided a framework for high-quality extensions to the scikit-learn ecosystem. When we submitted category_encoders, it forced us to elevate our standards. The process brought in new contributors and helped establish better engineering practices.

The package continued to grow, becoming available on conda-forge and eventually being published in the Journal of Open Source Software, which provided academic credibility and made it easier for researchers to cite.

Passing the Torch

As often happens in open source, life eventually pulled me in different directions. This is where the beauty of community-driven development really shined. Jan Motl stepped up to maintain the project, transforming it from a popular but experimental library into a well-engineered piece of software. Later, Paul Westenthanner took the reins, ensuring the project’s continued growth and stability.

The Impact Today

The numbers tell a remarkable story:

Over 4,000 GitHub projects depend on category_encoders
25+ million monthly downloads from PyPI
24 academic citations and counting
40+ contributors who’ve shaped the project

Notable projects using the library include:

PyCaret
Microsoft Recommenders
Deepchecks
And many more

Lessons in Open Source

This journey has taught me several valuable lessons about open source development:

Start Small: You don’t need a grand vision to create something valuable. Sometimes the best projects start as simple experiments.
Community Matters: The project truly flourished when it became bigger than any single maintainer.
Succession Planning: Having capable maintainers take over ensured the project’s longevity beyond my active involvement.
Standards Help: Joining scikit-learn-contrib forced us to improve our engineering practices, benefiting everyone.

Looking Forward

As I reflect on this journey, I’m filled with gratitude for everyone who has contributed to making category_encoders what it is today. From Jan and Paul’s leadership to every contributor who submitted a pull request or reported an issue - this is what makes open source special.

The project continues to evolve and improve, serving as a reminder that open source software is one of the most powerful ways we can collaborate to build something lasting and impactful.

Here’s to all the contributors, users, and future maintainers who will continue to shape this tool. Open source is indeed cool.

Note: If you’re interested in how modern AI tools can help maintain open source projects like this one, check out my post on Using Cursor for Open Source Library Maintenance, where I discuss how AI-powered coding tools are changing the way we approach library maintenance.

Subscribe to the Newsletter

Get the latest posts and insights delivered straight to your inbox.