From Weekend Hack to Core Tool: The category_encoders Journey
Every year during the quiet week between Christmas and New Year’s, I find myself checking in on category_encoders. What started as a weekend programming experiment has evolved into something I never could have imagined.
The Humble Beginnings
The project began as a simple playground for experimenting with different categorical encoding methods. It was just me, tinkering with various approaches to handle categorical variables in machine learning pipelines. The initial version was rough around the edges - more of a proof of concept than a production-ready library.
What began as an exploration of different encoding methods quickly evolved into a more structured package as I realized others might benefit from these implementations.
The scikit-learn-contrib Era
A significant turning point came when scikit-learn launched their contrib project. This initiative provided a framework for high-quality extensions to the scikit-learn ecosystem. When we submitted category_encoders, it forced us to elevate our standards. The process brought in new contributors and helped establish better engineering practices.
The package continued to grow, becoming available on conda-forge and eventually being published in the Journal of Open Source Software, which provided academic credibility and made it easier for researchers to cite.
Passing the Torch
As often happens in open source, life eventually pulled me in different directions. This is where the beauty of community-driven development really shined. Jan Motl stepped up to maintain the project, transforming it from a popular but experimental library into a well-engineered piece of software. Later, Paul Westenthanner took the reins, ensuring the project’s continued growth and stability.
The Impact Today
The numbers tell a remarkable story:
- Over 4,000 GitHub projects depend on category_encoders
- 25+ million monthly downloads from PyPI
- 24 academic citations and counting
- 40+ contributors who’ve shaped the project
Notable projects using the library include:
- PyCaret
- Microsoft Recommenders
- Deepchecks
- And many more
Lessons in Open Source
This journey has taught me several valuable lessons about open source development:
Start Small: You don’t need a grand vision to create something valuable. Sometimes the best projects start as simple experiments.
Community Matters: The project truly flourished when it became bigger than any single maintainer.
Succession Planning: Having capable maintainers take over ensured the project’s longevity beyond my active involvement.
Standards Help: Joining scikit-learn-contrib forced us to improve our engineering practices, benefiting everyone.
Looking Forward
As I reflect on this journey, I’m filled with gratitude for everyone who has contributed to making category_encoders what it is today. From Jan and Paul’s leadership to every contributor who submitted a pull request or reported an issue - this is what makes open source special.
The project continues to evolve and improve, serving as a reminder that open source software is one of the most powerful ways we can collaborate to build something lasting and impactful.
Here’s to all the contributors, users, and future maintainers who will continue to shape this tool. Open source is indeed cool.
Note: If you’re interested in how modern AI tools can help maintain open source projects like this one, check out my post on Using Cursor for Open Source Library Maintenance, where I discuss how AI-powered coding tools are changing the way we approach library maintenance.