Category Encoders accepted into scikit-learn-contrib

In the past I’ve posted a few times about a library I’m working on called category encoders. The idea of it is to provide a complete toolbox of scikit-learn compatible transformers for the encoding of categorical variables in different ways.

Scikit-learn is an extremely popular python package that extends Numpy and Scipy to provide rich machine learning functionality. It’s one of the most active python open source projects and generally has a reputation for being extremely high quality.

In the past year or so, some of the core scikit-learn developers started a project called scikit-learn-contrib, which focuses on providing a collection of scikit-learn compatible libraries that are both easy to use and easy to install. Contrary to scikit-learn itself, algorithms implemented in contrib libraries may be experimental or not as mature.

Currently in scikit-learn-contrib there are projects:

lightning

Large-scale linear classification, regression and ranking. Maintained by Mathieu Blondel and Fabian Pedregosa.

py-earth

A Python implementation of Jerome Friedman’s Multivariate Adaptive Regression Splines. Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques. Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python. Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms. Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering. Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

And now Category Encoders joins this excellent collection of tools! Check it out in its new home, look at the other great projects, and if you want to help continue to push forward on it, let me know.

The Journey So Far

This project began as an exploration of different methods for encoding categorical variables in machine learning models. What started as a simple experiment has grown into a robust library that implements various encoding techniques in a scikit-learn compatible way.

The acceptance into scikit-learn-contrib marks an important milestone for the project, but it’s just one step in its evolution. The library continues to grow, with more encoders being added and improvements being made to the existing ones.

What’s Next

With this new home comes new responsibilities. We’re working on improving documentation, test coverage, and overall code quality. We’re also planning to add more encoders and make the existing ones more robust.

If you’re interested in using category encoders in your research, you’ll be pleased to know that it was later published in the Journal of Open Source Software, making it easy to cite in academic work. And if you’re curious about how a weekend project grew into a widely-used tool, check out my reflection on the category_encoders journey.