I found it curious that a log-normal distribution fits both tweets and lines of code. Doing some digging, I found some literature on the subject of message lengths on the internet. This study finds that across languages and mediums, comment lengths follow a log-normal distribution quite closely. They propose a mechanism related to the Weber-Fechner law, which suggests a logarithmic scale in degrees of perception. It seems reasonable that lengths of code lines would respond to the same mechanism.
As for the data, I think it’s quite interesting what this reveals about each project’s commitment to the PEP8 line length. Pandas, Scikit-Learn, and Matplotlib seem to be strongly committed to keeping their lines below 79 characters; by contrast, AstroPy doesn’t seem to mind the occasional long line (though it does still display a “cramming” pattern, to use the parlance of the Twitter team’s analysis).
Comparing the summary statistics between packages, it is interesting to note the relative similarity of the numpy and scipy packages in terms of distribution of line lengths. This makes a lot of sense, because historically the development team has overlapped strongly between these two packages.
By contrast, scikit-learn tends to have around 10 more characters per line, and a much wider distribution of typical line lengths. I suspect the length of the lines is due to the nature of scikit-learn’s code: it has relatively long class names (e.g. RandomForestClassifier
) which are used frequently throughout the code. The prevalence of classes also adds a four-space indentation to much of the package.
As for the larger standard deviation, this may be due to the larger pool of contributors: scikit-learn has about twice the total number of contributors as either numpy or scipy.
Pandas is interesting in that it doesn’t show the pronounced “cramming” effect seen in the other packages. Long lines seem to be broken-up in a more distributed way, perhaps through assignment of temporary variables rather than through line breaks.
Each package displays a distinct “fingerprint” regarding the lengths of code lines, and the above visualizations suggest that PEP8’s restrictions really do affect the way people write code, particularly in packages that use more characters per line, such as Pandas and Scikit-Learn.