Skip to main content

Analyizing a codebase with git before reading a single file

https://piechowski.io/post/git-commands-before-reading-code/

The files that change the most

git log --format=format: --name-only --since="1 year ago" | sort | uniq -c | sort -nr | head -20

Piechowski explains that this line is for finding the files that change the most, the files with the most "churn" - i.e., the modules everyone works with the most:

I run this from app/ or src/, not the repo root. Lockfiles, changelogs, and generated code will dominate the list otherwise.

The 20 most-changed files in the last year. The file at the top is almost always the one people warn me about. “Oh yeah, that file. Everyone’s afraid to touch it.”

Suggested reading from this part of the post: https://pragprog.com/titles/atcrime2/your-code-as-a-crime-scene-second-edition/

The most frequent contributors

git shortlog -sn --no-merges

Gives you a strong idea of who may know the most about the codebase based on their commits to it. Piechowski adds some nuance to this git command:

If the top contributor from the overall shortlog doesn’t appear in a 6-month window (git shortlog -sn --no-merges --since="6 months ago"), I flag that to the client immediately.

I also look at the tail. Thirty contributors but only three active in the last year. The people who built this system aren’t the people maintaining it.

One caveat: squash-merge workflows compress authorship. If the team squashes every PR into a single commit, this output reflects who merged, not who wrote. Worth asking about the merge strategy before drawing conclusions.

Where the problems tend to happen

git log -i -E --grep="fix|bug|broken" --name-only --format='' | sort | uniq -c | sort -nr | head -20

I kind of had a feeling when I read the "grep" part, but Piechowski says that the success of this git command depends entirely on how your team writes commit messages.

Velocity

git log --format='%ad' --date=format:'%Y-%m' | sort | uniq -c

Commit count by month, for the entire history of the repo. I scan the output looking for shapes. A steady rhythm is healthy. But what does it look like when the count drops by half in a single month? Usually someone left. A declining curve over 6 to 12 months tells you the team is losing momentum. Periodic spikes followed by quiet months means the team batches work into releases instead of shipping continuously.

I once showed a CTO their commit velocity chart and they said “that’s when we lost our second senior engineer.” They hadn’t connected the timeline before. This is team data, not code data.

How often is the team firefighting

git log --oneline --since="1 year ago" | grep -iE 'revert|hotfix|emergency|rollback'

A handful over a year is normal. Reverts every couple of weeks means the team doesn’t trust its deploy process.

Big fan of the terminology "crisis patterns" for describing these commits.