Exploring Ideas: A Blog on Technology, Startups, Food, and More

Welcome to my blog where I share thoughts and insights on technology, startups, and life in Atlanta. Browse through the articles below or explore by topic.

Data Science Things Roundup #11

September 23, 2017

Note: This post has been migrated from my previous blog. Some links to previous roundups or external resources may no longer be available. Once again time for the data science things roundup, a few links of articles or projects I’ve stumbled across and found interesting. This is the 11th one in the extremely irregular series, so if you think it’s cool, check out some of the others. This time we’v...

Read more →

git-pandas Caching: Faster Analysis

July 25, 2017

The git-pandas library has been around for a while now, providing tools to analyze git repositories using pandas DataFrames. One of the common pieces of feedback has been about performance - analyzing large repositories can be slow, especially when running multiple analyses on the same data. To address this, I’ve added caching to the repository objects. This means that when you run an analysis, th...

Read more →

Category Encoders v1.2.4 Release

July 12, 2017

I’m happy to announce the release of category_encoders v1.2.4! This release includes several improvements and bug fixes to make the library more robust and easier to use. Key updates in this version: Added support for pandas categorical types Improved handling of missing values across all encoders Better error messages when input validation fails Fixed several edge cases in the BaseN encoder Updat...

Read more →

Data Science Things Roundup #10

April 19, 2017

Hey all, I haven’t done one of these in quite a while, but thought I’d share a few more articles I’ve found interesting recently. An analysis of twitter influencers in the field of data science & big data This is a pretty in depth medium article that goes through some of the concepts in network analysis, through the lens of twitter data. It’s not an area I know a ton about, but I found it approach...

Read more →

Data Science Things Roundup #9

March 12, 2017

Things got a bit busy and I feel off the wagon posting, but here we are back for the ninth edition of the data science things roundup. If you haven’t seen previous editions, it’s basically just 3 data science or python related articles or packages that I’ve stumbled across recently and thought were interesting. This time we have a great paper and 2 python packages, so dig in. A few useful things t...

Read more →

Data Science Things Roundup #8

January 25, 2017

Time again for the Tuesday-regular data science things roundup. I’ve definitely fallen into a repeatable routine here (3 new links every Tuesday at 10EST), but may play with the format of this some in the future, so if you have any feedback (good, bad, or indifferent), leave a comment below. In previous editions we talked about some lower-level tooling and dataviz, but this week we swing back towa...

Read more →

BaseN Encoding Grid Search in Category Encoders

December 18, 2016

One of the more interesting encoders in the category_encoders library is the BaseN encoder. The idea behind it is to take a categorical variable and convert it into a series of binary variables, similar to one-hot encoding, but with a different base. For example, if we have a categorical variable with 8 unique values, we could encode it in base 2 with 3 binary variables (2³ = 8). The advantage of ...

Read more →

Category Encoders accepted into scikit-learn-contrib

November 20, 2016

In the past I’ve posted a few times about a library I’m working on called category encoders. The idea of it is to provide a complete toolbox of scikit-learn compatible transformers for the encoding of categorical variables in different ways. Scikit-learn is an extremely popular python package that extends Numpy and Scipy to provide rich machine learning functionality. It’s one of the most active p...

Read more →

Data Science Things Roundup #7

November 10, 2016

This weeks edition of the Data Science Things Roundup is pretty python-heavy, as opposed to previous editions that were a bit more machine learning and dataviz heavy. At the end of the day, some kind of software is backing most of data science, so getting a bit lower level can be useful sometimes. This week we look at a couple of ways to increase performance in python codebases and one way to gene...

Read more →

Category Encoders now on conda-forge

September 17, 2016

My scikit-learn compatible library of categorical data encoders (category_encoders) is now published on conda-forge! Conda, if you didn’t know, is an open source package manager for python (and other things) developed primarily by continuum analytics. Thanks to continuum developer @bollwyvl for doing pretty much all of the work to get it working. Check out the category_encoders feedstock here: htt...

Read more →

Subscribe to the Newsletter

Get the latest posts and insights delivered straight to your inbox.