Artificial Intelligence & Machine Learning
This section contains my writings on artificial intelligence and machine learning, drawing from years of experience leading AI initiatives across various organizations. Topics range from practical implementation advice to theoretical discussions and industry trends.
McCabe Complexity: The Python Metric You Should Care About
April 24, 2025
Tags:python, code-quality, best-practices, development, tools
A practical guide to understanding, measuring, and managing code complexity in Python
Python Logging Best Practices for Library Developers
April 20, 2025
Tags:python, logging, libraries, best-practices, development
A comprehensive guide to implementing logging in Python libraries - from basic setup to advanced patterns and common pitfalls to avoid
Introducing 'stargazers': A Tool to Understand Your GitHub Audience
April 16, 2025
Tags:github, opensource, community, cli, python, data
Announcing a new CLI tool to fetch, analyze, and summarize the stargazers and forkers of any public GitHub repository, inspired by Cockroach Labs' analysis.
HashingEncoder: Tackling Extreme Cardinality with the Hashing Trick
April 15, 2025
Tags:python, machine-learning, category-encoders, feature-engineering, hashing-trick
Exploring the HashingEncoder - how it works, when to use it, and implementation with category_encoders
BinaryEncoder: The Space-Efficient Alternative to One-Hot Encoding
April 13, 2025
Tags:python, machine-learning, category-encoders, feature-engineering, binary-encoding
Exploring the BinaryEncoder - how it works, when to use it, and implementation with category_encoders
OrdinalEncoder: When Order Matters in Categorical Data
April 10, 2025
Tags:python, machine-learning, category-encoders, feature-engineering, ordinal-encoding
Exploring the OrdinalEncoder - how it works, when to use it, and implementation with category_encoders
Makefiles: The Unsung Hero of Python Development
April 8, 2025
Tags:python, development, tools, automation
I’ve been building Python libraries for years, and there’s one tool that consistently makes my life easier: the humble Makefile. Yes, that decades-old build automation tool from the C world. It might seem old school, but it’s become an essential part of my Python development workflow, and today I want to show you why.
What’s a Makefile Anyway?
If you’re coming from a pure Python background, you might be wondering what Make is and why we’re talking about it. Make is a build automation tool that’s been around since 1976. Think of it as a simple way to define shortcuts for common commands in your project. Instead of remembering and typing out long commands, you can define them once in a Makefile and run them with a simple make command
.
Modern Python Package Publishing: PyGeoHash's New CI/CD Pipeline
April 6, 2025
Tags:python, pygeohash, github-actions, cicd, pypi, development
How we're using GitHub Actions and PyPI's Trusted Publisher to automate PyGeoHash releases
PyGeoHash Gets Type Hints: A Journey into Modern Python
April 3, 2025
Tags:python, pygeohash, type-hints, development, open-source
Adding comprehensive type hints and a new types module to PyGeoHash for better developer experience and code quality
Optimal Bankroll Management with Keeks: The Kelly Criterion
April 1, 2025
Tags:python, open-source, keeks, finance, betting, kelly-criterion
In this first post of our series on bankroll management strategies in Keeks, we’ll dive into the Kelly Criterion - the mathematical foundation of optimal betting and the inspiration behind the library’s name.
Documenting Your Library's API: Best Practices
March 30, 2025
Tags:python, documentation, api reference, sphinx, autodoc, docstrings, library, development, best practices
Create a clear, comprehensive, and easy-to-navigate API reference using Sphinx and autodoc. Learn best practices for structure, content, and cross-referencing.
OneHotEncoder: The Workhorse of Categorical Encoding
March 27, 2025
Tags:python, machine-learning, category-encoders, feature-engineering, one-hot-encoding
A deep dive into OneHotEncoder - how it works, when to use it, and implementation with category_encoders
Understanding the Elo Rating System: The Grandfather of Competitive Rankings
March 25, 2025
Tags:python, open-source, elote, rating-systems, elo, algorithms
A deep dive into the Elo rating system - how it works, its history, and implementing it with Elote
Automating Documentation Builds and Deployment with GitHub Actions and GitHub Pages
March 23, 2025
Tags:python, documentation, sphinx, github actions, ci/cd, github pages, automation, library, development
Keep your documentation up-to-date automatically. Learn how to set up a GitHub Actions workflow to build your Sphinx docs and deploy them to GitHub Pages.
Crafting Useful Code Examples: From Basic Snippets to Real-World Scenarios
March 22, 2025
Tags:python, documentation, code examples, sphinx, doctest, writing, library, development
Good code examples are crucial for documentation. Learn how to write examples that are clear, runnable, and effectively demonstrate your library's features.
Keeks 0.1.0 Release: Optimal Bankroll Management Made Simple
March 18, 2025
Tags:python, open-source, keeks, finance, betting, kelly-criterion
I’m excited to announce the release of Keeks 0.1.0, my Python library for optimal bankroll allocation and betting strategies.
If you’ve ever wondered how to mathematically optimize your betting or investment strategy, Keeks might be just what you’ve been looking for.
Getting Started with Sphinx for Python Project Documentation
March 15, 2025
Tags:python, documentation, sphinx, restructuredtext, docstrings, library, development
Generate professional, cross-referenced documentation for your Python library using Sphinx, the de facto standard tool.
Elote 1.0.0 Release: Rating Systems Made Simple
March 13, 2025
Tags:python, open-source, elote, rating-systems, elo
After what feels like forever (and honestly, it kind of has been), I’m thrilled to announce that Elote 1.0.0 has been released.
For those who haven’t been following along, Elote is my little Python library for implementing and comparing rating systems like the Elo system used in chess.
PyGeoHash v3.0.0: Faster, Freer, and More Pythonic
March 11, 2025
Tags:python, open-source, geospatial, pygeohash, performance, licensing
A deep dive into the latest major release of PyGeoHash, featuring a complete rewrite in pure CPython, MIT relicensing, and dramatic performance improvements
Using Cursor for Open Source Library Maintenance
March 9, 2025
Tags:cursor, open-source, python, ai, library-maintenance, pygeohash, keeks, elote
How AI-powered coding tools like Cursor can simplify the maintenance burden of open source projects, featuring real-world examples from PyGeoHash, Keeks, and Elote
Writing Effective Docstrings: Google vs. NumPy vs. reStructuredText Styles
March 6, 2025
Tags:python, documentation, docstrings, sphinx, autodoc, google style, numpy style, restructuredtext, library, development
Learn how to write clear, informative docstrings that Sphinx can understand, comparing the popular Google, NumPy, and native reStructuredText formats.
PyGeoHash 2.1.0: Modernizing a Geospatial Python Library
March 4, 2025
Tags:python, geospatial, open-source, geohash, cursor, claude
A look at the latest updates to PyGeoHash, a lightweight Python library for working with geohashes, and how modern AI tools helped revitalize the project.
Geohash: When Clever Isn't Always Smart
March 2, 2025
Tags:geohash, gis, python, algorithms, data-engineering
Exploring the common pitfalls and limitations of the Geohash algorithm, and when you might want to rethink using it.
Where Did All the RAM Go? Memory Profiling with Memray
March 1, 2025
Tags:python, testing, profiling, performance, memory, memray, optimization, library, development
High CPU usage isn't the only performance problem. Learn how Memray helps you track down memory leaks and excessive allocation in your Python library.
Claude 3.7 and new Cursor: first impressions
February 26, 2025
Tags:data-science, ai, tools
Anthropic finally released Claude 3.7, and Cursor has a new version with some interesting features. Let’s take a look at what’s new and some first impressions.
Finding the Slowdown: Profiling Python Code with Pyinstrument
February 25, 2025
Tags:python, testing, profiling, performance, pyinstrument, optimization, library, development
Your benchmark says a function is slow, but why? Profilers like Pyinstrument help you pinpoint exactly where your Python code is spending its time.
How Fast Is It? Benchmarking Your Code with Pytest-Benchmark
February 22, 2025
Tags:python, testing, pytest, benchmark, performance, pytest-benchmark, library, development
Functionality is crucial, but performance matters too. Learn how to easily measure the speed of your Python library code using the pytest-benchmark plugin.
From Silos to Shared Libraries: A Practical Guide to Inner Source Adoption
February 18, 2025
Tags:python, inner-source, library-development, best-practices, governance, security, development-practices
A step-by-step guide for transitioning from team-specific code to shared libraries, including governance models, security considerations, and standardized development practices.
Mastering Mocking in Python with pytest-mock
February 16, 2025
Tags:python, testing, pytest, mocking
A practical guide to mocking in Python testing - from basic concepts to advanced techniques with pytest-mock and other helpful libraries
Building Your Internal Library Developer Community
February 15, 2025
Tags:python, inner-source, community-building, corporate-culture, development-practices, library-development, best-practices
Learn how to build and nurture a thriving community of library developers within your organization through effective incentives, recognition, and collaboration practices.
Will It Blend? Testing Across Environments with Tox
February 13, 2025
Tags:python, testing, tox, pytest, library, development, compatibility, ci-cd, virtualenv
Your library works on your machine with Python 3.11. Great! But what about Python 3.9? Or 3.12? Tox ensures compatibility across different Python versions and dependency sets.
Inner Source: Bringing Open Source Culture Inside Your Organization
February 11, 2025
Tags:python, inner-source, open-source, corporate-culture, development-practices, library-development, best-practices
Learn how to harness the power of open source development practices within your organization through inner source principles and practices.
Data Science Things Roundup #13
February 10, 2025
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Are Your Tests Enough? Measuring Coverage with Coverage.py
February 9, 2025
Tags:python, testing, pytest, coverage, pytest-cov, library, development, code-quality
Writing tests is step one. Step two is knowing what parts of your library code those tests actually exercise. Enter Coverage.py.
Designing for Developer Joy: Python Library Ergonomics
February 6, 2025
Tags:python, api-design, library-development, best-practices, developer-experience, programming, ergonomics
Explore the principles and practices that make Python libraries a joy to use, from naming conventions to error messages that guide users to solutions.
Why Your Library Needs Pytest (And How to Get Started)
February 4, 2025
Tags:python, testing, pytest, library, development, code-quality
Testing is non-negotiable for Python libraries. Let's explore why, and how the popular Pytest framework makes writing tests less painful and more powerful.
The Art of API Design: Making the Right Things Easy
February 3, 2025
Tags:python, api-design, library-development, best-practices, developer-experience, programming
Learn the principles of intuitive API design in Python, focusing on making common operations simple while keeping advanced functionality possible.
Secure Coding Practices for Python Library Developers
February 2, 2025
Tags:python, security, secure coding, best practices, library, development, input validation, least privilege, error handling
Beyond specific tools, what general principles should guide secure library development in Python? An overview of essential secure coding practices.
Taming the Python Chaos: Linting & Formatting with Ruff
January 30, 2025
Tags:python, linting, formatting, ruff, code-quality, development, ci-cd, github-actions
What linting and formatting actually are, why they matter (a lot!), and how the speedy tool Ruff can save your Python project (and your sanity).
Handling Sensitive Data Securely Within Your Python Library
January 29, 2025
Tags:python, security, sensitive data, secrets management, pii, library, development, secure coding
Does your library handle API keys, passwords, or personal information? Learn best practices for securely managing sensitive data within your Python code.
Decoding Library Updates: Understanding Semantic Versioning (SemVer)
January 28, 2025
Tags:python, packaging, versioning, semver, library, development, dependencies, pip
What does v1.2.3 actually mean? A guide for Python library authors on using Semantic Versioning to communicate changes and manage dependencies effectively.
Dependency Security: Managing Vulnerabilities with pip-audit
January 27, 2025
Tags:python, security, dependencies, vulnerabilities, pip-audit, supply chain security, library, development, safety
Your library relies on other packages. Learn how to use pip-audit to scan your dependencies for known security vulnerabilities and keep your users safe.
The Center of Your Python Project: Understanding pyproject.toml
January 26, 2025
Tags:python, packaging, pyproject.toml, pep517, pep518, pep621, setuptools, hatch, ruff, pytest, library, development
From setup.py chaos to TOML clarity. Learn why pyproject.toml exists, how it standardizes Python packaging and tool configuration, and what goes inside.
Introduction to Bandit Security Rules with Ruff: Finding Common Security Issues in Python Code
January 25, 2025
Tags:python, security, static analysis, ruff, bandit, sast, vulnerabilities, secure coding, library, development
Learn how to use Ruff's Bandit integration to automatically scan your Python code for common security pitfalls through static analysis.
Don't Forget the Fine Print: Licensing Your Python Library
January 24, 2025
Tags:python, licensing, open-source, legal, library, development, dependencies, compliance, mit, apache, gpl
Choosing an open-source license is crucial. Understand common licenses (MIT, Apache, GPL), why compatibility matters, and how to comply with obligations.
Building and Engaging a Community Around Your Open Source Library
January 22, 2025
Tags:python, open source, community, engagement, contribution, maintenance, github, library, development
An active community is vital for an open-source project's success. Learn practical steps to attract users, encourage contributions, and foster a welcoming environment.
The Library Author's Dilemma: Managing Python Dependencies
January 21, 2025
Tags:python, packaging, dependencies, library, development, pip, versions, best-practices
Choosing dependencies for your Python library is tricky. How do you decide what to include, what versions to support, and avoid creating headaches for your users?
Data Science Things Roundup #12
January 20, 2025
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Avoiding Common Pitfalls: Injection Flaws in Python Libraries
January 18, 2025
Tags:python, security, injection, sql injection, command injection, input validation, secure coding, library, development
Injection attacks (SQL injection, command injection) aren't just for web apps. Learn how improper input handling in your Python library can create vulnerabilities.
The Art of Saying No: Defining Your Python Library's Scope
January 17, 2025
Tags:python, library, design, scope, development, software-engineering
Why keeping your Python library focused is harder than it looks, and how saying 'no' can be your most powerful design tool.
SDLC in the Age of AI
January 12, 2025
Tags:ai, software-development, programming, prompt-engineering, data science
Exploring how AI is reshaping software development practices and the emerging role of prompt engineering
From Weekend Hack to Core Tool: The category_encoders Journey
December 27, 2022
Tags:open-source, python, data science, machine-learning
Reflecting on how a simple Python experiment grew into a fundamental data science library with millions of downloads
Investment Review: Seer.ai
October 18, 2022
Tags:investing, startups, angel-investing, deal-review, ai, machine-learning, analytics
A review of my angel investment in Seer.ai, exploring how they align with my investment thesis and their unique value proposition in AI-powered analytics.
Category Encoders v1.2.8 Release
June 4, 2018
Tags:python, data science, category-encoders, open-source
Release announcement for Category Encoders v1.2.8 with bugfixes and new features
TechEmergence Podcast and Atlanta AI Article
June 3, 2018
Tags:artificial-intelligence, podcast, atlanta, predictive-maintenance, data science
Discussing predictive maintenance and the Atlanta AI ecosystem on TechEmergence
Data Engineering Podcast
February 20, 2018
Tags:data-engineering, podcast, predictive-maintenance, industrial-iot, data science
Discussing data engineering, predictive maintenance, and industrial IoT on the Data Engineering Podcast
Category Encoders published in JOSS
January 26, 2018
Tags:python, data-science, category-encoders, open-source, academic, publication
Category Encoders package gets published in the Journal of Open Source Software (JOSS)
The Problem with Industrial IoT
January 16, 2018
Tags:iot, industry, technology, data science
A discussion of the challenges facing industrial IoT adoption and implementation
Revisiting Python support in Apache Flink
January 11, 2018
Tags:python, apache-flink, big-data, streaming, data science
A follow-up on the state of Python support in Apache Flink after a couple of years
Tendencies of Data Engineers and Scientists
January 9, 2018
Tags:data-engineering, data-science, team-organization, engineering-management, data science
Exploring the relationship dynamics and organizational challenges between data engineering and data science teams
I Made a Model, Now What?
January 4, 2018
Tags:data-science, machine-learning, production, pydata, data science, atlanta
Insights from a PyData Atlanta talk about successfully deploying and maintaining machine learning models in production
On taking things too seriously: holiday edition
December 9, 2017
Tags:data science, sports, python, rating systems
A deep dive into developing rating systems for college football bowl game predictions, including the development of elote, keeks, and keeks-elote packages
Elote: a python package of rating systems
December 6, 2017
Tags:python, data-science, machine-learning, rating-systems
Introducing Elote, a Python package for implementing various rating systems
Ripyr: sampled metrics on datasets using python's asyncio
November 28, 2017
Tags:data science, python, asyncio, type hinting, data processing
An introduction to ripyr, a Python library for streaming through large datasets and parsing basic metrics using asyncio and type hinting
Category Encoders v1.2.5 Release
November 22, 2017
Tags:python, category-encoders, open-source, data-science, machine-learning
Release notes for Category Encoders v1.2.5, highlighting community contributions and improvements
Data Science Things Roundup #11
September 23, 2017
Tags:data-science, roundup, visualization, bayesian, finance
A collection of interesting data science articles and projects, including SEC keynotes, Bayesian inference, and visualization tools
git-pandas Caching: Faster Analysis
July 25, 2017
Tags:python, git, pandas, data-analysis, performance
Improving performance in git-pandas with caching
Category Encoders v1.2.4 Release
July 12, 2017
Tags:python, data-science, machine-learning, category-encoders, release
New release of category_encoders with improved functionality and bug fixes
Data Science Things Roundup #10
April 19, 2017
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Data Science Things Roundup #9
March 12, 2017
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Data Science Things Roundup #8
January 25, 2017
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
BaseN Encoding Grid Search in Category Encoders
December 18, 2016
Tags:python, data-science, machine-learning, category-encoders
Exploring the BaseN encoder and grid search capabilities in category_encoders
Category Encoders accepted into scikit-learn-contrib
November 20, 2016
Tags:python, data-science, open-source, scikit-learn, category-encoders
Category Encoders joins the scikit-learn-contrib ecosystem
Data Science Things Roundup #7
November 10, 2016
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Category Encoders now on conda-forge
September 17, 2016
Tags:python, data-science, open-source, conda, category-encoders
Announcing the availability of category_encoders on conda-forge
Data Science Things Roundup #6
July 20, 2016
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Introducing unified glob-syntax in git-pandas
June 15, 2016
Tags:git-pandas, python, data science, glob
A new, more flexible way to specify file patterns in git-pandas using glob syntax
Parallelizing cumulative blame in git-pandas with joblib
June 12, 2016
Tags:git-pandas, python, data science, performance, joblib
Improving performance of git-pandas cumulative blame analysis using parallel processing with joblib
When do I work on what?
April 30, 2016
Tags:data science, git-pandas, dataviz, oss, python
Analyzing work patterns between open source and closed source projects using git-pandas
Estimating the time spent on a project with git-pandas
April 16, 2016
Tags:git-pandas, git, github, oss, python, time tracking
Using git-pandas to estimate development time from git commit history
Data Science Things Roundup #5
March 15, 2016
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Automating documentation workflow with sphinx and github pages
February 29, 2016
Tags:python, documentation, github, oss, sphinx
A guide to automating Sphinx documentation deployment to GitHub Pages
Pypi-publisher: a simple cli for publishing python libraries
February 24, 2016
Tags:python, packaging, cli, deployment, oss
Introducing pypi-publisher (ppp), a command-line tool for streamlining Python package publishing
Using survival analysis and git-pandas to estimate code quality
February 21, 2016
Tags:git-pandas, data analysis, data science, dataviz, git, github, oss, python
Using statistical survival analysis techniques with git-pandas to analyze code quality and ownership patterns
Git-pandas v1.0.0, or how to check for a stable release
February 2, 2016
Tags:git-pandas, datascience, dataviz, git, github, oss, python
Examining interface consistency and parameter naming in git-pandas for the v1.0.0 release
Github.com cumulative blame in 5 lines of python
January 31, 2016
Tags:git-pandas, data science, dataviz, git, github, pandas, python
Using git-pandas to visualize GitHub repository growth over time in just a few lines of code
Data-driven engineering team management with gitnoc and git-pandas
January 19, 2016
Tags:git-pandas, data visualization, dataviz, git, github, oss, python, software management
Using git-pandas and gitnoc to make data-driven engineering management decisions
Create organization-wide punchcards with git-pandas
January 17, 2016
Tags:git-pandas, data analysis, git, github, oss, python
Using git-pandas to create punchcard visualizations across multiple repositories
How to Write Comprehensions and Alienate People
January 8, 2016
Tags:python, programming, humor, best-practices, data science
A tongue-in-cheek guide to writing Python comprehensions that will make your colleagues question their life choices
Gitpandas v0.0.6: python 2.7, fileowners, file-wise blame and examples
January 7, 2016
Tags:git-pandas, projects, data analysis, data science, git, github, oss, python
New features and improvements in git-pandas v0.0.6 including Python 2.7 support and enhanced blame analysis
Git-Pandas v0.0.5: coverage.py, risk, and more
December 25, 2015
Tags:git-pandas, projects, data analysis, git, github, oss, pandas, python
New features in git-pandas including coverage.py support and risk analysis metrics
Common Data Pitfalls for Recurring Machine Learning Systems
December 20, 2015
Tags:data science, machine learning, analytics, data engineering
A reference guide to common data problems and limitations encountered when building recurring machine learning systems, supplementing the comprehensive guide to bad data with problems specific to recurring systems.
Visualize all of your git repositories with gitnoc and git-pandas
December 13, 2015
Tags:git-pandas, d3, data visualization, dataviz, flask, git, pandas, python, redis, rq
Using gitnoc and git-pandas to analyze and visualize git repositories at scale
CyberLaunch: An Accelerator for Machine Learning Companies
December 8, 2015
Tags:startups, atlanta, machine learning, accelerators, data science
A look at CyberLaunch, Atlanta's accelerator focused on machine learning and information security companies, and its role in the startup ecosystem.
Data Science Things Roundup #4
December 5, 2015
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Beyond One-Hot: An Exploration of Categorical Variables
November 29, 2015
Tags:machine learning, data science, categorical variables, feature engineering
A deep dive into different methods for encoding categorical variables in machine learning, exploring their benefits and trade-offs
Analyzing GitPython and Pandas With GitPandas
November 19, 2015
Tags:git-pandas, projects, data analysis, git, github, oss, pandas, python
Using git-pandas to analyze the repositories of GitPython and Pandas themselves
Create a pip-installable python package in 2 minutes
November 12, 2015
Tags:cookiecutter-pipproject, python, packaging, pip, oss
A quick guide to creating and publishing a Python package using cookiecutter-pipproject
Blame the world with git-pandas
November 10, 2015
Tags:git-pandas, dataviz, git, github, pandas, python
Introducing git-pandas: a pandas interface for git blame and repository analysis
Data Science vs. Data Engineering
October 31, 2015
Tags:data science, data engineering, career, technology, big data
Understanding the fundamental differences between data science and data engineering through the lens of methodology rather than tools
Data Science Things Roundup #3
September 10, 2015
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Data Science Things Roundup #2
May 20, 2015
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources
Data Science Things Roundup #1
February 15, 2015
Tags:data-science, machine-learning, roundup, resources
A collection of interesting data science articles and resources