Analysing NLP publication patterns

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Below is the graph of top 25 organisations and the conferences where they publish.

CMU comes out as the most prolific publisher with 305 papers. A close second is Microsoft with 302 publications, also leading in the industry category. I was somewhat surprised to find that Microsoft publishes so much, almost twice as many papers compared to Google, especially as Google seems to get much more publicity with their research. Stanford is also among the top 3 organisations that publish substantially more than others. Edinburgh and Cambridge represent the UK camp with 121 and 117 papers respectively.

When we look at the distribution of conferences, Princeton and UCL stand out as having very little NLP-specific research, with nearly all of their papers in ICML and NIPS. Stanford, Berkeley and MIT also seem to focus more on machine learning algorithms. In contrast, Edinburgh, Johns Hopkins and University of Maryland have most of their publications on NLP-related conferences. CMU, Microsoft and Columbia are the most balanced among the top publishers, with roughly 50:50 division between NLP and ML.

We can also plot the number of publications per year, focusing on the top 15 institutions.

Carnegie Mellon has a very good track record, but has only just recently overtaken Microsoft as the top publisher. Google, MIT, Berkeley, Cambridge and Princeton have also stepped up their publishing game, showing upward trends in the recent years. The sudden drop for 2016 is due to incomplete data – at the time of writing, ACL, EMNLP and NIPS papers for this year are not available yet.

Now let’s look at the same graphs but for individual authors.

Chris Dyer comes out on top with 50 papers. This result is even more impressive given that he started with just 2 papers in 2012, then rocketing to the top by quite a margin in 2015. Almost all of his papers are in NLP conferences, with only 1 paper each for NIPS and ICML. Noah Smith, Chris Manning and Dan Klein rank 2nd-4th, with more stable publishing records, but also focusing mainly on NLP conferences. In contrast, Zoubin Ghahramani, Yoshua Bengio and Lawrence Carin are focused mostly on machine learning algorithms.

There seems to be a clear separation between the two research communities, with researchers specialising to publishing either in NLP or ML. This seems somewhat unexpected, especially considering the widespread trend of publishing novel neural network architectures for NLP tasks. Both fields would probably benefit from slightly tighter integration in the future.

I hope this little analysis was interesting to fellow researchers. I’m happy to post an update some time in the future, to see how things have changed. In the meantime, let me know if you find any bugs in the statistics.

Update: As requested, I’ve also added the statistics for first authors with highest publication counts. Jiwei Li from Stanford towers above others with 14 publications. William Yang Wang (CMU), Young-Bum Kim (Microsoft), Manaal Faruqui (CMU), Elad Hazan (Princeton), and Eunho Yang (IBM) have all managed an impressive 9 first-author publications.

Update 2: Added a fix for Jordan Boyd-Graber who publishes under Jordan L. Boyd-Graber in NIPS.

Update 3: Added a fix for Hal Daumé III, mapping together different spellings.

Update 4: By showing top N authors on the graphs, some authors with equal numbers of publications were being excluded. I’ve adjusted the value N for each graph so this doesn’t happen.

Update 5: Added a fix for Pradeep K. Ravikumar who also publishes under Pradeep Ravikumar.

Update 6: Added fixes to capture name variations for INRIA.