Who Holds the Keys? Calculating Bus Factor with GitPandas
Ever had that nagging feeling about “what if Alice from accounting gets hit by a bus?” Okay, maybe not exactly that, but the underlying concern is real in software projects too. What happens if a key developer leaves? How much knowledge walks out the door with them? This concept is often (somewhat morbidly) called the Bus Factor.
In essence, the bus factor is the minimum number of team members that have to suddenly disappear from a project before the project stalls due to lack of knowledgeable personnel. A low bus factor (like 1 or 2) indicates a high risk – too much critical knowledge is concentrated in too few people.
While it’s not a perfect science, gitpandas
gives us a way to estimate this based on code contributions, specifically using the bus_factor()
method.
How GitPandas Calculates Bus Factor
The idea is to look at who has contributed the most lines of code (based on git blame
results). gitpandas
identifies the top contributors and calculates how many of them you’d need to remove before less than 50% of the codebase is “covered” by the remaining contributors.
Let’s see it in action:
from gitpandas import Repository
# Point to your repo
repo_path = '/path/to/your/repo'
repo = Repository(repo_path)
# Calculate the bus factor
# You can ignore or include files using glob patterns.
# For example, to ignore markdown and text files, and the docs/ directory:
bus_factor_df = repo.bus_factor(ignore_globs=['*.md', '*.txt', 'docs/*'])
print("Bus Factor Analysis:")
print(bus_factor_df)
Interpreting the Results
The bus_factor()
method now returns a pandas DataFrame with columns:
repository | bus factor |
---|---|
my-cool-repo-name | 2 |
- repository: The name of your repository.
- bus factor: The minimum number of contributors whose combined code contributions account for at least 50% of the codebase’s lines of code.
So, if your DataFrame says the bus factor is 2, that means you’d need to lose your top 2 contributors before less than half the codebase is “covered” by the remaining team. (Or, put another way: if both Alice and Bob get abducted by aliens, you might be in trouble.)
About the Arguments
by
: How to calculate the bus factor. Use ‘repository’ (the default) to calculate for the whole repo. (File-level isn’t implemented yet, so don’t get too fancy.)ignore_globs
: List of glob patterns for files to ignore (e.g., docs, configs, or anything you don’t want to count).include_globs
: List of glob patterns for files to include (optional; if set, only these files are considered).
Important Caveats
Okay, let’s be real. This is a proxy metric.
- Lines of Code != Knowledge: Someone might write tons of boilerplate or simple code, while another writes fewer lines of highly complex, critical logic.
git blame
doesn’t know the difference. - Ownership vs. Understanding:
git blame
shows who last touched a line, not necessarily who understands it best or who originally designed it. - Collaboration: It doesn’t account for knowledge shared through pairing, code reviews, documentation, or design discussions.
- The 50% Threshold: It’s an arbitrary default. You might consider a higher threshold more appropriate for your project’s risk tolerance.
So, Is It Useful?
Despite the caveats, calculating the bus factor with gitpandas
can be a useful conversation starter. A very low number (like 1) is often a red flag worth investigating. It can prompt questions like:
- Are we relying too heavily on one person for specific modules?
- Is our documentation good enough for someone else to pick up the work?
- Can we encourage more cross-functional work or code reviews to spread knowledge?
Don’t treat the number as absolute truth, but use it as another data point, alongside churn analysis and team discussions, to understand your project’s health and potential risks.
Subscribe to the Newsletter
Get the latest posts and insights delivered straight to your inbox.