Analyzing GitPython and Pandas With GitPandas
A couple of weeks ago I posted about a new open source python library I started called git-pandas. The github page for it is here:
https://github.com/wdm0006/git-pandas
The basic idea is to provide an interface to a git repository or collection of git repositories via pandas DataFrames. With this, we can do some interesting analysis. In this example we will analyze the two projects that make git-pandas possible: GitPython and pandas. To get started, make a new directory to put everything in, and clone the 3 repositories (we will use the bleeding edge version of git-pandas):
mkdir gitpandas_example
cd gitpandas_example
git clone https://github.com/gitpython-developers/GitPython.git
git clone https://github.com/pydata/pandas.git
git clone https://github.com/wdm0006/git-pandas.git
Now in git-pandas, in the examples folder, there is an example called bus_analysis.py
. It contains the following script:
import os
from pandas import merge
from gitpandas import ProjectDirectory, Repository
__author__ = 'willmcginnis'
def get_interfaces():
project_path = str(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
proj = ProjectDirectory(working_dir=project_path)
pandas_repo = Repository(working_dir=project_path + os.sep + 'pandas')
gitpython_repo = Repository(working_dir=project_path + os.sep + 'GitPython')
return proj, pandas_repo, gitpython_repo
if __name__ == '__main__':
project, pandas_repo, gitpython_repo = get_interfaces()
# do some blaming
shared_blame = project.blame(extensions=['py'])
pandas_blame = pandas_repo.blame(extensions=['py'])
gitpython_blame = gitpython_repo.blame(extensions=['py'])
# figure out who is common between projects
common = merge(pandas_blame, gitpython_blame, how='inner', left_index=True, right_index=True)
common = common.rename(columns={'loc_x': 'pandas_loc', 'loc_y': 'gitpython_loc'})
# figure out committer count from each
pandas_ch = pandas_repo.commit_history('master', limit=None, extensions=['py'])
gitpython_ch = gitpython_repo.commit_history('master', limit=None, extensions=['py'])
# now print out some things
print('Total Python LOC for 3 Projects Combined')
print('\t%d' % (int(shared_blame['loc'].sum()), ))
print('\nNumber of contributors per project')
print('\tPandas: %d' % (len(set(pandas_ch['committer'].values))))
print('\tGitPython: %d' % (len(set(gitpython_ch['committer'].values))))
print('\nTop 10 Contributors Between Each')
print(shared_blame.head(10))
print('\nCommitters that committed to Both')
print(common)
print('\nTruck Count of Each')
print('\tPandas: %d' % (pandas_repo.bus_factor(extensions=['py'])))
print('\tGitPython: %d' % (gitpython_repo.bus_factor(extensions=['py'])))
Which does a few things for us. First we are pulling the commit history and the blame for each project, we are also pulling the blame for the directory as a whole (which includes all 3 projects, git-python, git-pandas and pandas).
Then we compute some interesting things using those datasets. At the end, we estimate the bus factor of each repository by seeing the number of contributors that account for 50% of all of the code. This is an extremely rough estimate of how many people it would take disappearing (i.e. getting hit by a bus) for the project to die.
If you run the script, you should see:
Total Python LOC for 3 Projects Combined
284921
Number of contributors per project
Pandas: 350
GitPython: 70
Top 10 Contributors Between Each
name loc
Wes McKinney 64994
jreback 47357
Jeff Reback 21869
sinhrks 20126
Sebastian Thiel 15236
Phillip Cloud 13282
Chris Whelan 7864
Jeffrey Tratner 6933
y-p 6053
Andy Hayden 5158
Committers that committed to Both
committer pandas_loc gitpython_loc
Yaroslav Halchenko 41 18
Truck Count of Each
Pandas: 3
GitPython: 1
So there you have it, a nice analysis of project size, organizational support, and distribution of contribution in around 50 lines of painless python. Development continues on git-pandas, so if you have any suggestions for new features, examples, use-cases or anything else, comment below or on github:
https://github.com/wdm0006/git-pandas
Edit: I’ve since pushed a new release of git-pandas to pypi (v0.0.3), so you can install the version used in this post with pip using the instructions in the docs/readme.
Subscribe to the Newsletter
Get the latest posts and insights delivered straight to your inbox.