Analyzing GitPython and Pandas With GitPandas

A couple of weeks ago I posted about a new open source python library I started called git-pandas. The github page for it is here:

https://github.com/wdm0006/git-pandas

The basic idea is to provide an interface to a git repository or collection of git repositories via pandas DataFrames. With this, we can do some interesting analysis. In this example we will analyze the two projects that make git-pandas possible: GitPython and pandas. To get started, make a new directory to put everything in, and clone the 3 repositories (we will use the bleeding edge version of git-pandas):

mkdir gitpandas_example
cd gitpandas_example
git clone https://github.com/gitpython-developers/GitPython.git
git clone https://github.com/pydata/pandas.git
git clone https://github.com/wdm0006/git-pandas.git

Now in git-pandas, in the examples folder, there is an example called bus_analysis.py. It contains the following script:

import os
from pandas import merge
from gitpandas import ProjectDirectory, Repository

__author__ = 'willmcginnis'

def get_interfaces():
    project_path = str(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
    proj = ProjectDirectory(working_dir=project_path)

    pandas_repo = Repository(working_dir=project_path + os.sep + 'pandas')
    gitpython_repo = Repository(working_dir=project_path + os.sep + 'GitPython')

    return proj, pandas_repo, gitpython_repo

if __name__ == '__main__':
    project, pandas_repo, gitpython_repo = get_interfaces()

    # do some blaming
    shared_blame = project.blame(extensions=['py'])
    pandas_blame = pandas_repo.blame(extensions=['py'])
    gitpython_blame = gitpython_repo.blame(extensions=['py'])

    # figure out who is common between projects
    common = merge(pandas_blame, gitpython_blame, how='inner', left_index=True, right_index=True)
    common = common.rename(columns={'loc_x': 'pandas_loc', 'loc_y': 'gitpython_loc'})

    # figure out committer count from each
    pandas_ch = pandas_repo.commit_history('master', limit=None, extensions=['py'])
    gitpython_ch = gitpython_repo.commit_history('master', limit=None, extensions=['py'])

    # now print out some things
    print('Total Python LOC for 3 Projects Combined')
    print('\t%d' % (int(shared_blame['loc'].sum()), ))

    print('\nNumber of contributors per project')
    print('\tPandas: %d' % (len(set(pandas_ch['committer'].values))))
    print('\tGitPython: %d' % (len(set(gitpython_ch['committer'].values))))

    print('\nTop 10 Contributors Between Each')
    print(shared_blame.head(10))

    print('\nCommitters that committed to Both')
    print(common)

    print('\nTruck Count of Each')
    print('\tPandas: %d' % (pandas_repo.bus_factor(extensions=['py'])))
    print('\tGitPython: %d' % (gitpython_repo.bus_factor(extensions=['py'])))

Which does a few things for us. First we are pulling the commit history and the blame for each project, we are also pulling the blame for the directory as a whole (which includes all 3 projects, git-python, git-pandas and pandas).

Then we compute some interesting things using those datasets. At the end, we estimate the bus factor of each repository by seeing the number of contributors that account for 50% of all of the code. This is an extremely rough estimate of how many people it would take disappearing (i.e. getting hit by a bus) for the project to die.

If you run the script, you should see:

Total Python LOC for 3 Projects Combined
    284921

Number of contributors per project
    Pandas: 350
    GitPython: 70

Top 10 Contributors Between Each
name            loc
Wes McKinney    64994
jreback         47357
Jeff Reback     21869
sinhrks         20126
Sebastian Thiel 15236
Phillip Cloud   13282
Chris Whelan    7864
Jeffrey Tratner 6933
y-p             6053
Andy Hayden     5158

Committers that committed to Both
committer          pandas_loc   gitpython_loc
Yaroslav Halchenko   41           18

Truck Count of Each
    Pandas: 3
    GitPython: 1

So there you have it, a nice analysis of project size, organizational support, and distribution of contribution in around 50 lines of painless python. Development continues on git-pandas, so if you have any suggestions for new features, examples, use-cases or anything else, comment below or on github:

https://github.com/wdm0006/git-pandas

Edit: I’ve since pushed a new release of git-pandas to pypi (v0.0.3), so you can install the version used in this post with pip using the instructions in the docs/readme.

Subscribe to the Newsletter

Get the latest posts and insights delivered straight to your inbox.