git-pandas Caching: Faster Analysis
The git-pandas library has been around for a while now, providing tools to analyze git repositories using pandas DataFrames. One of the common pieces of feedback has been about performance - analyzing large repositories can be slow, especially when running multiple analyses on the same data.
To address this, I’ve added caching to the repository objects. This means that when you run an analysis, the results are cached in memory. Subsequent analyses that need the same data will use the cached version instead of re-querying git, resulting in significant performance improvements.
Here’s a quick example of how it works:
from git_pandas import Repository
# Create a repository object
repo = Repository('path/to/repo')
# First call - will query git and cache results
df = repo.commits()
# Second call - will use cached data
df = repo.commits() # Much faster!
# If you need fresh data, you can clear the cache
repo.clear_cache()
The caching is particularly useful when you’re doing interactive analysis or running multiple queries that use the same base data. For example, if you’re analyzing commit patterns and author activity, you can now do this much more efficiently:
# These will all use the same cached commit data
commits_by_author = repo.commits().groupby('author').size()
commits_by_month = repo.commits().groupby('date').size()
file_changes = repo.commits().groupby('files').size()
This update should make git-pandas much more practical for analyzing larger repositories or doing more complex analyses. Let me know if you run into any issues or have suggestions for further improvements.