arXiv Harvesting
Inspired by this cool paper, I wanted to analyze the metadata of a bunch of scientific journal articles. By some estimates there are at least 75 million published journal articles, so I figured I’d start small and only use papers published on arXiv. It turns out that someone already published a handy tool called metha for downloading bulk Open Archives Initiative metadata.
After downloading and configuring metha, I ran the bash command:
which downloaded the metadata for every arXiv article in XML format. It took a few hours, but in the end I had more then 3 GB of XMl files. These files contained entries of the form:
I wrote a script you can see on my GitHub to loop over records in all files and parse the entries for each field, and store the data in a SQLite3 database. The fields I extracted were article title, abstract, authors, subjects, unique arXiv identifier, and date. The parsing was done using BeautifulSoup and regular expressions, for example, this function created a semicolon-delimited list of all subject keywords of a given journal article:
For convenience (assuming your computer has enough RAM), we can convert the SQLite3 database into a Pandas DataFrame. It’s also useful for analysis purposes to separate the fields and subfields of each article. We can do both of these steps using this script.
Finally, we can do some data analysis. We use this script to generate plots of field prominence over time for the top 6 fields:
All fields |
Interesting! There’s been a lot of changes in the popularity of fields in the last 25 or so years. It seems that arXiv was primarily a High Energy Physics journal in the 1990s, but that field’s preeminence has since gradually declined since then. Pure mathematics increased to a high of about 1/4 around 2015 but has declined slightly since then. The big winner seems to be computer science, which has grown significantly in popularity, especially since around 2005.
Let’s take a look at the most popular subfield for each of the top 6 fields:
Mathematics | Condensed Matter |
High Energy Physics | Astrophysics |
Computer Science | Physics |
A few preliminary points are worth mentioning: some fields (Mathematics, Computer Science, and Physics in particular) did not have many papers in the early years, so the data from that period is not necessarily representative. Additionally, it seems that the Astrophysics section changed their classification system around 2008: before that all papers had the subject “Astrophysics”, with no subfield, but after that they started including subfields. And due to the way I extracted the field and subfield names, fields without a subfield have the default subfield name of an empty string, which is why the prominent curve in the Astrophysics plot has no label in the legend.
- For Mathematics, PDE Analysis and Combinatorics are hot, Differential Geometry is not.
- For Condensed Matter, Mesoscale and Nanoscale Physics, and Materials Science are hot, Statistical Mechanics is not.
- For High Energy Physics, Experiment is hot (probably due in large part to the LHC), but most of the other subfields are fairly constant. I’m not sure about the source of the periodic pattern in the Theory and Lattice subfields.
- For Astrophysics, Astrophysics of Galaxies is hot, Cosmology and Nongalactic Astrophysics is not.
- For Computer Science, it’s no surprise that Machine Learning and Computer Vision are hot, and Information Theory is not.
- Finally, for Physics, Optics is hot, and ignoring the outliers in the 1990s, most of the other subfields are fairly constant.
Using the Google Charts Service, I created an interactive Sankey diagram for the 40 most common subjects on arXiv. I found that using more than 40 or so subjects made the chart look terrible due to the large number of subfields.
If you’d like a copy of the database I generated but for some reason can’t run the scripts in the GitHub Repo, send me an email and I’ll get it to you.