category_encoders in the Wild: A Tour of the Research

When you publish an open source library, you have some idea of who might use it. Data scientists, ML engineers, maybe some Kaggle competitors. What you don’t expect is that one day you’ll find your little categorical encoding library cited in a paper about cleaning toxic chemicals out of water. Or predicting how pills dissolve. Or catching financial criminals. But here we are.

Ages ago I did a series of blog posts here on different categorical encoding techniques and threw the code for it up on github. Gradually I worked on it to make it installable, and eventually got it into scikit-learn-contrib and the Journal of Open Source Software. Years ago I handed over maintenance of the project to a few incredible maintainers who have continued to keep it humming and growing since then.

The category_encoders JOSS paper has picked up around 40 citations over the years, and I recently spent an afternoon going through them. The sheer variety of domains is the part that gets me. These aren’t all ML methodology papers, most of them are practitioners in other fields reaching for machine learning as a tool to solve problems in their own domain, and category_encoders happened to be part of their toolkit.

Proud dad moment for sure.

The Breadth

Here’s a partial list of what people have used category_encoders for, just to give you a sense of the range:

Environmental science: Designing better biochar catalysts for water treatment, predicting free radical content in carbon materials
Fintech: Detecting fraud across real-world financial transaction datasets
Transportation: Generating synthetic populations for travel demand modeling with GANs
Pharmaceuticals: Predicting particle sizes in spray-dried drug dispersions
Remote sensing: Mapping vegetation canopy height across the entire contiguous United States
Public safety: Analyzing NYPD stop-and-frisk data, detecting phantom 911 calls
Government: Building fraud detection systems for government claims
Education: Predicting student academic performance in online learning
Semiconductors: Modeling logic device performance from manufacturing process conditions
Real estate: Evaluating the impact of encoding on property mass valuation
Privacy: Growing synthetic datasets with differential privacy guarantees
Water resources: Estimating public water supply withdrawals across the U.S.
Cybersecurity: Detecting username enumeration attacks on SSH
Automotive: Real-time driver distraction recognition

Some of these are exactly what you’d expect from a categorical encoding library. Others, like the biochar water treatment paper, make you do a double take. But that’s the thing about building infrastructure-level tools: they end up in places you never imagined.

A Closer Look at Five Papers

1. Is There Yet Anything “Hotter” Than One-Hot Encoding?

Poslavskaya & Korolev, 2023 – arXiv (34 citations)

This one is the most directly relevant to category_encoders itself, and the title alone made me laugh. Researchers from Huawei’s Novosibirsk Research Center ran a comprehensive study of categorical encoding effects across a large sample of classification problems from the OpenML repository. Their finding? For multiclass problems, good old one-hot encoding and Helmert contrast coding actually outperform the fancier target-based encoders.

I appreciate the honesty of this result. There’s a natural tendency in ML to assume that more sophisticated methods must be better. Sometimes the boring approach wins. That said, the picture gets more nuanced when you factor in high-cardinality features and dataset size, which is exactly why a library with many encoding options is useful: there’s no single right answer.

2. Enhancing Biochar for Water Treatment with Machine Learning

Wang et al., 2023 – Environmental Science & Technology (117 citations)

This is the most-cited paper in the bunch by a wide margin, and it’s about as far from my original use case as you can get. The researchers used ML to guide the design of biochar (essentially a charcoal-like material made from biomass) for breaking down pollutants in water. They needed to encode categorical variables describing different biomass precursors and preparation conditions, which is where category_encoders came in.

Their ML models identified that high specific surface area and oxygen content significantly enhance the nonradical degradation pathways. They then actually synthesized new biochar materials based on the ML predictions and confirmed the results experimentally. It’s a genuinely cool loop: ML predicts what material properties matter, scientists build the materials, and the predictions hold up. The fact that categorical encoding of feedstock types was part of that pipeline is the kind of thing that makes maintaining open source feel worthwhile.

3. Follow the Trail: Fraud Detection in Fintech

Stojanovic et al., 2021 – Sensors (74 citations)

Financial fraud detection is one of those problems where the data is inherently messy and categorical. Transaction types, merchant categories, account types: it’s categories all the way down. This paper evaluated multiple anomaly detection methods across both real-world and synthetic financial datasets, with a particular focus on the false positive problem. A fraud detection system that flags everything as suspicious isn’t useful, and this paper digs into that tradeoff in practical terms.

The category_encoders library shows up in their preprocessing pipeline, which makes sense. When you’re trying to feed merchant category codes and transaction type strings into a gradient boosted tree, you need to make some encoding decisions. The paper’s contribution is in showing how different ML methods handle the class imbalance inherent in fraud data (most transactions are legitimate), but the categorical encoding step is a necessary precondition for all of it.

4. Synthetic Populations with GANs for Transportation Modeling

Badu-Marfo, Farooq & Patterson, 2020 – IEEE Transactions on Intelligent Transportation Systems (36 citations)

Transportation modeling needs synthetic populations: realistic but fake people with demographics, travel patterns, and mobility sequences. This paper presents CTGAN, a generative adversarial network that creates synthetic agents with both tabular attributes (age, sex, income) and sequential mobility data (trip trajectories). The challenge is that tabular demographic data is heavily categorical, and you need to encode it properly before a GAN can work with it.

What I find interesting here is that this is a generative task, not a predictive one. Most uses of category_encoders are about preparing features for classification or regression. Using categorical encoders as part of a pipeline to generate new synthetic data is a creative application I hadn’t really considered when building the library.

5. Predicting Drug Particle Size with Machine Learning

Schmitt, Baumann & Morgen, 2022 – Pharmaceutical Research (25 citations)

Spray drying is a common manufacturing technique in pharmaceuticals, and the particle size of the resulting powder is a critical quality attribute. Too big and the drug doesn’t dissolve properly. Too small and it doesn’t flow through manufacturing equipment. This paper built an ensemble ML model to predict particle size from formulation and process parameters, with prediction errors between -7.7% and 18.6%.

The categorical variables here include things like drug compound identity and equipment configurations. SHAP analysis revealed which formulation and process parameters drove particle size variations, and the explanations were consistent with what pharmaceutical scientists already understood mechanistically. That’s the dream outcome for ML in the sciences: not just accurate predictions, but interpretable ones that align with domain expertise.

What This Tells Me

Looking at all 40-ish citing papers together, a few things stand out:

Infrastructure matters more than algorithms. Most of these papers aren’t about encoding itself. They’re about fraud, or water treatment, or drug manufacturing, and encoding is just one step in a longer pipeline. But if that step is painful or error-prone, the whole pipeline breaks down. Making categorical encoding boring and reliable is the whole point.

Domain diversity is the real validation. When a tool gets used across environmental chemistry, fintech, pharmaceutical manufacturing, and transportation modeling, it suggests the API design generalizes well. These are very different types of users with different data shapes and different workflows, and they all found the library useful enough to mention.

The boring stuff gets cited. The JOSS paper is not a groundbreaking contribution to encoding theory. It’s a description of a software package. But practical, well-documented tools accumulate citations steadily over years because they keep being useful to people doing real work.

I still find it a little surreal that a library I started building to deal with fault codes at a startup ended up in a paper about cleaning polluted water. Open source is weird like that, and I mean that in the best possible way.