How Developers Communicate on GitHub

Home Blog Projects Contact

Published:

This is the first of several of posts in which I analyze data scraped from GitHub to learn more about developer communication. These posts are based on a research paper I wrote for a natural language processing course at Brigham Young University.

Natural Language in Code Repositories

GitHub is a massively popular online service that helps developers manage and collaborate on their code. But they contain more than just code; they’re also full of natural language.

Why? Large software projects require communication to ensure they don’t fall apart. This communication takes many forms, and two of the most common are code comments and GitHub issues.

Code comments are snippets of natural language embedded into source code files to clarify the purpose of the underlying code.

GitHub issues are discussions between software developers or users in GitHub’s issue tracking tool, typically for the purpose of reporting bugs, asking questions, or requesting features.

This post will answer the question, What are the high-level characteristics of natural language in GitHub repositories?

Methodology

To answer this question, I created a dataset consisting of two types of data, code comments and issue comments, from 19 GitHub repositories. Chosen projects are diverse in purpose, size, and programming language.

The full project list is Sitcom Simulator, Nodejs.org, sled, Auto1111SDK, Devika, System.Linq.Dynamic.Core, HTTPRequest, Flask, React, Turbo, Rails, Vue.js, MoviePy, Astro, htmx, Phoenix, Ethereum, Bootstrap, Django.

There’s a slight bias towards generative AI projects and web frameworks due to personal interests 🙃

I randomly sampled 1,500 code comments and 1,500 issue comments from each repository because I’ve already spent too much money on Colab Pro this year.

Code Comments

I used the GitHub API, GitPython, and a slew of regular expressions to extract code comments from every commit from 19 GitHub repositories.

I then used the BART model as a zero-shot classifier to group the comments into the following categories:

Comment TypeCategoryReal Example
CodeExplanationwe can’t run the select function on the first tab
CodeDeprecatedDEPRECATED - Do not use if you can avoid.
CodeFuture workTODO - support more request types POST, PUT, DELETE, etc.

Issue Comments

I used the GitHub API to extract the full issue comment history from each repository, and classified them into the following categories:

Comment TypeCategoryReal Example
IssueQuestion@<person’s name> this is the exact issue I am facing. Did you find any solution to this?
IssueConclusionThis is a duplicate of #919
IssueDiscussionSorry for the delay, I’ve approved but would like to give the chance for another reviewer to merge it.
IssueSolutionYou can leverage many of the latest models, paid and free through a single API at Openrouter.
IssueFeature requestA low-hanging fruit and huge Feature boost would be adding Langsmith. Thanks!
IssueBug reportThis bug still persists with 4-2-stable.

I also did sentiment analysis, classifying each comment as POSITIVE or NEGATIVE using the DistilBERT model.

Results

I aggregated the above data into a single dataset and created some visualizations to gain a broad understanding of what technical communication looks like on GitHub.

Code Comment Categories Over Time

Graph of code comment categories over time

The relative frequency of each code comment type has remained mostly static over time.

Explanations are by far the most common purpose for code comments, accounting for over 80% of the total. Deprecation and future work comments are almost equal in frequency, but future work takes a slight lead. This suggests that software projects tend to grow over time, with more features being created than destroyed.

Issue Comment Categories Over Time

Graph of issue comment categories over time

Similarly, issue comment purpose ratios have largely stayed the same over time.

More than half of issue comments are categorized as questions, but fewer than 10% of comments are classified as solutions. Assuming the classification model is reasonably accurate, this indicates that many questions raised in GitHub issues go unanswered.

The instability of the values between 2013-2015 is due to less available data during that time.

Sentiment Analysis Over Time

Ratio of positive comments over time

The tone of technical communication is mostly negative. This might be because code errors and bugs are among the most frequently discussed topics. For example, the top twenty unigrams and bigrams include error, problem, breaking change, and doesn’t work.

However, this negative slant may also be due to model limitations. Researchers have found that standard sentiment analysis tools often struggle with technical language.

N-Gram Analysis

To get a feel for what words developers use, I calculated the most common unigrams and bigrams for both code and issue comments.

The results show that code comments are more technical in nature than issue comments. However, both are extremely likely to contain hyperlinks to GitHub.com.

Top Code Comment Unigrams

RankTermCount
1use512
2object472
3function471
4value460
5set440

Interestingly, the unigrams object and function are used almost equally. This suggests that neither nouns nor verbs take precedence in the minds of developers. It may also suggest that neither object-oriented nor functional programming is inherently more “natural” than the other.

Top Code Comment Bigrams

RankTermCount
1make sure90
2github com57
3license bsd56
4copyright 201040
5return value39

Flask was apparently obsessed with copyrighting itself in 2010.

Top Issue Comment Unigrams

RankTermCount
1issue1721
2like1095
3think1076
4use1013
5just1008

Right away you can see that issues are more natural in tone than code comments.

The unigram issue is almost twice as popular as the second most popular. This is likely because issue has two meanings in software development: it can refer to a generic problem with the software or it can refer to a post on GitHub Issues.

Top Issue Comment Bigrams

RankTermCount
1github com487
2https github462
3don think145
4looks like133
5use case131

Once again, GitHub links are by far the most common phrase.

And that concludes the high-level portion of my analysis!

Why It Matters

Analyzing how developers communicate on GitHub reveals a lot about project management and community dynamics. Or something. ChatGPT wrote that. I don’t care if this project is useful; I did it on a whim.

In future posts, I’ll explore how these communication styles vary between projects and how much they impact project growth and popularity. Stay tuned!