Benjamin Cabé

How many lines of open source code are hosted at the Eclipse Foundation?

Spoiler alert: 162 million!

That’s right, as of August 1st, there are 330 active open-source projects hosted at the Eclipse Foundation and if you look across the 1120 Git repositories that this represents, you will find over 162 million physical source lines of code. But beyond this number, let’s look at how it was obtained, and what it really means.

I’ve blogged several times about the importance of using metrics to monitor the health (and hopefully, growth!) of an open source project/community, and lines of code are just one. You should always have other metrics on your radar like the number of contributors, diversity, etc.

There are many ways, and many tools available out there, to count source lines of code. Openhub (previously known as ohloh) used to be a really good tool, but it doesn’t seem to be actively maintained. For a few years now, I’ve been relying on a home-made script to analyze Eclipse IoT projects, and it’s only recently that I realized I should probably run it against the entire eclipse.org codebase!

In this blog post, I will briefly talk about how the aforementioned script works, why you should make sure to take these metrics with a pinch of salt and finally, go through some noteworthy findings.

Line counting process

The script used to count the number of lines of code is available on Github. It takes a list of Eclipse projects’ identifiers (e.g ‘iot.paho’) and a given time range as an input and outputs a consolidated CSV file.

The main script (main.js) uses the Eclipse Project Management Infrastructure (PMI) API to retrieve the list of Git repositories for the requested projects and then proceeds to clone the repos and run the cloc command-line tool against each repo. The script also allows computing the statistics for a given time period, in which case it looks at the state of each repository at the beginning of each month for that period.

Once the main script has completed (and it can obviously take quite some time), thecsv-concat.js script can be used to consolidate all the produced metric files into one single CSV file that will contain the detailed breakout of lines of code per project and per programming language, the affiliation of the project to a particular top-level projects, the number of blanks or comment lines, etc.. It is pretty easy to then feed this CSV into Excel or Google Spreadsheets, and use it as the source for building pivot tables for specific breakouts.

Caveats

Just like virtually any KPI, you want to take the number of lines of code in your project with a grain of salt. Here are a few things to keep in mind:

All lines of code are not created equal

There is an incredible diversity of projects at Eclipse, and while a majority is using Java as their main programming language, there’s also a lot of C, C++, Python, Javascript, … 10M lines of Java code probably don’t carry the same value (i.e. how much effort has been needed to produce them) as 10M lines of C code.

Trends are more important than snapshots

It is nice to know that as of today there are 162 million lines of code in the Eclipse repositories, but it is, in my opinion, more important to look at trends over time. Is a particular programming language becoming more popular? Are all the top-level projects equally active?

I didn’t have a chance to run the scripts for a longer time period yet, but I will make sure to share the results when I get a chance!

Generated code, should it count?

There is a fair amount of generated code in some projects (in the Modeling top-level project in particular, of course), which certainly accounts for a few million lines of code. However, generated code often is customized, so I think it doesn’t necessarily skew the numbers as much as one would think.

Development does not always happen in a single branch

My script just looks at the code stored in the main (HEAD) branch of the Git repository. Some projects may have more than one development stream and may e.g. have a “develop” branch that is ahead of the main stable branch. Therefore, there is very likely more code in our repositories than what this quick analysis shows.

Additional findings

As my script outputs pretty detailed statistics, it is interesting to have a quick look at e.g. how the different top-level projects and programming languages compare.

Top 3 top-level projects: Runtime, Technology & Modeling

Top-level project Physical SLOC
rt 54,961,728
technology 28,887,621
modeling 27,140,344
tools 14,214,182
webtools 9,651,900
eclipse 6,401,518
ee4j 5,809,126
ecd 3,114,768
polarsys 3,105,229
iot 2,930,217
birt 2,235,624
science 1,670,051
datatools 939,424
mylyn 767,652
soa 752,774

Top programming language: Java

Programming language Physical SLOC
Java 72,349,870
HTML 61,119,106
XML 7,543,689
ANTLR Grammar 3,161,339
JSON 2,313,556
JavaScript 2,251,418
C++ 2,245,759
C 1,446,013
XMI 1,355,914
C/C++ Header 1,019,368
TTCN 923,098
Maven 884,271
CSS 805,073
Assembly 717,771
XSD 688,764
PHP 459,237
Python 316,553
Markdown 304,421
XSLT 256,857
Scala 229,560
Bourne Shell 214,142
Go 184,306
SWIG 152,062
JSP 142,190
Gencat NLS 125,251
Ant 113,133
TypeScript 108,217
AsciiDoc 105,552
Windows Module Definition 64,843
TITAN Project File Information 64,014
Groovy 55,261
Sass 53,915
XQuery 51,432
XHTML 51,166
DTD 51,052
make 48,021
Perl 43,643
DITA 42,526
yacc 39,876
TeX 36,400
m4 34,438
AspectJ 33,717
Ruby 28,355
Scheme 27,484
YAML 26,348
CMake 25,182
Lua 23,646
LESS 18,712
SQL 16,070
Cucumber 15,454
IDL 12,564
INI 12,171
Bourne Again Shell 11,978
Pascal 11,915
lex 11,795
DOS Batch 11,675
Windows Resource File 10,278
Blade 8,295
C# 7,983
Tcl/Tk 7,611
Stylus 7,477
Fortran 90 7,211
ERB 7,048
Vuejs Component 6,281
Visualforce Component 5,047
MSBuild script 4,538
Freemarker Template 4,077
Dockerfile 3,696
Velocity Template Language 3,649
awk 3,068
Rust 2,903
Qt 2,772
CUDA 2,533
Puppet 2,084
diff 1,880
Haml 1,819
Oracle PL/SQL 1,778
ProGuard 1,739
Objective C 1,469
ActionScript 1,459
Visual Basic 1,365
Mathematica 1,247
RobotFramework 1,074
Korn Shell 1,023
D 1,007
Smalltalk 911
R 887
TOML 826
Ada 668
Lisp 618
Objective C++ 589
Fortran 77 588
Arduino Sketch 480
MATLAB 476
sed 461
Protocol Buffers 454
WiX source 446
JavaServer Faces 440
PowerShell 284
Qt Project 176
Windows Message File 139
Expect 120
NAnt script 110
Smarty 109
HCL 78
CoffeeScript 78
Skylark 74
Forth 69
Qt Linguist 61
WiX include 52
XAML 49
QML 48
Handlebars 46
Clojure 38
Prolog 37
Razor 32
PO File 29
Haskell 27
JSX 24
ASP.NET 21
HLSL 15
F# 11
Swift 10
GLSL 8
Kotlin 7
C Shell 7
Mustache 1

If you end up using my script and have any question, please let me know in the comments or directly on Github!