DOIs in Reddit

Skimming the headlines on Hacker News yesterday morning, I noticed something exciting. A dump of all the submissions to Reddit since 2006. “How many of those are DOIs?”, I thought. Reddit is a very broad community, but has some very interesting parts, including some great science communication. How much are DOIs used in Reddit?

(There has since been a discussion about this blog post on Hacker News)

We have a whole strategy for DOI Event Tracking, but nothing beats a quick hack or is more irresistible than a data dump.

What is a DOI?

If you know what a DOI is, skip this! The DOI system (Digital Object Identifier) is a link redirection service. When a publisher puts some content online they could just hand out the URL. But the URL can change, and within a very short space of time, link-rot happens. DOIs are designed to fight link rot. When a publisher mints a DOI to an article they just published, they can change the article’s URL and then update the DOI to point to the new place. DOIs are persistent. They are URLs. They’re also identifiers (kind of like ISBNs), and they’re used in scholarly publishing as to do citations.

Crossref is the DOI registration agency for scholarly publishing. That means mostly things like journal articles. There are other registration agencies, for example, DataCite, who do DOIs for research datasets. But at this point in time, most DOIs are Crossref’s.

What does finding DOIs in Reddit mean?

It means someone used a DOI to cite something! DOIs can be used for any kind of content, but because of the sheer volume of scientific publishing, lots of DOIs are for science. Having a DOI doesn’t say anything about quality or content. But it does indicate that the person who created the DOI probably intended it to be cited. We care because it means that every time a DOI is used a tiny bit of link-rot doesn’t have the opportunity to take hold. Every time something is discussed on Reddit and the DOI is used, it means that archaeologists using the data dump in 100 years will have identifiers to find the things being discussed, even if the web and URLs have long since crumbled to dust.

Or, more likely, in five year’s time when a few URLs will have shuffled around.

The results

DOIs have been used on Reddit since 2008 (the logs start in 2006). After a rocky start, we see hundreds being used per year.

DOI submissions per month

That’s dozens per month.

DOI submissions per month

The best subreddit to find DOIs is /r/Scholar, followed by /r/science. And then a lot of others with one or two per year.

DOI submissions per subreddit per year


It’s great to see DOIs being used in Reddit. But let’s be honest, it’s not a massive amount.

We have a list of domains that our DOIs point to. They mostly belong to publishers, so every time we see a link to a domain on the list, there’s a chance (not a certainty) that the link could have been made using a DOI. We found a large number of these, orders of magnitude more than DOIs. We’re still crunching the data.

The data

The data is quite large. It’s a 40 Gigabyte download compressed, which comes to about 170 GB that uncompressed. It contains the submissions to reddit between 2006 and 2015, not the comments, so each data point represents a thread of conversation about a DOI.


You can find the source code and reproduce the figures at We use Apache Spark for this kind of thing.

More info

Read about our DOI Event Tracking strategy, including our live stream of Wikipedia citations.

Annual Meeting: Join Crossref in Boston this November!

We’d like to invite the scholarly publishing community to get together in Boston this November with the Crossref Annual Meeting as a rally point. This is the event we hold just once a year to get the whole team under one roof, host a lively discussion with the leading voices in scholarly communications, present technical workshops, and offer you the chance to get hands’ on with our latest metadata services. Our free two-day event takes place from November 17-18, 2015 in Boston, MA.


  • Tuesday, November 17 – Tech Workshops:

The morning is an opportunity to get into small groups and talk directly with our development and support teams. We will present best practices around using Crossref’s metadata. After lunch, we will feature member case studies with tips on implementation and lessons learned. If you’re on the technical production side of scholarly publishing, you’ll want to be there — and not just for the beer & pretzels afterwards.

  • Wednesday, November 18 – Member Meeting:

A day to hear from thought leaders from the larger scholarly publishing community as well as from inside Crossref. Our keynote speaker will be Dr. Ben Goldacre (Bad Science), and our distinguished speakers include Dr. Scott Chamberlain (rOpenSci), Dr. Juan Pablo Alperin (Public Knowledge Project), and Dr. Martin Eve, (Open Library of Humanities). We will share details about the road map for Crossref Labs’ current and future initiatives, hear about the latest organizational developments from new members of our team, and see the debut of our new brand logo and communications strategy. Following the formal discussion, we’ll continue the conversation over cocktails as part of our celebration of Crossref’s milestone 15th Anniversary!

✱ Tickets:

Reserve your free tickets here:

Who Should Attend?

Scholarly publishers, technology providers, librarians, researchers, academic institutions, funders, journalists, and others who are keen to discuss tools and services to advance scholarly publishing are encouraged to attend.

✱ Venue:

About Crossref Crossref is a not-for profit membership organization that wants to improve research communication. We organize publisher metadata, run the infrastructure that makes DOI links work, and we rally multiple community stakeholders in order to develop tools and services to enable advancements in scholarly publishing.

DOI Event Tracker (DET): Pilot progresses and is poised for launch

Publishers, researchers, funders, institutions and technology providers are all interested in better understanding how scholarly research is used. Scholarly content has always been discussed by scholars outside the formal literature and by others beyond the academic community. We need a way to monitor and distribute this valuable information.

The Crossref DOI Event Tracker (DET)

To meet this need, Crossref will be introducing a new service that tracks activity surrounding a research work from potentially any web source where an event is associated with a DOI. Following a successful pilot run started Spring 2014, the service has been approved to move toward production and is expected to launch in 2016. Any party wishing to join this phase is welcome to contact Jennifer Lin. The DOI Event Tracker (DET) registers a wide variety of events such as bookmarks, comments, social shares, citations, and links to other research entities, from a growing list of online sources. DET aggregates them, and stores and delivers the data in many ways.

Open, portable, and licensed for maximum reuse
Crossref has long served as the citation linking and metadata infrastructure provider for scholarly communication; the new DOI Event Tracker is a natural next step, providing a practical solution as a resource for the whole community. The tracker offers the following features:

  • Data on event activity across a common pool of online channels.
  • Near real-time alerting for select sources with push notifications to the system.
  • Cross-publisher monitoring to enable benchmarking and provide context to the data.
  • Common format for normalizing data results across the diverse set of sources via modern REST API.
  • Secure and regularly refreshed backups of critical data for long term data preservation.
  • Transparency of data collection so as to ensure auditable, replicable, and trustworthy results.
  • Query-initiated retrieval or real-time alerts when an event of interest occurs.
  • CC-0 license for open and flexible propagation of data.

A number of platforms are already confirmed and more parties are welcomed at any stage. So far we have confirmation to track DOI events on the following platforms:

Confirmed DET Platforms Sept 2015

Blogs & Reference
Social Shares
& Discussions
Links to Research
Research BloggingCiteULikeFacebookORCiD
Wordpress.comEurope PMC
Database Citations

This set of sources reflects our initial focus on parties willing to allow their data to be redistributed in the common pool. Efforts are underway to expand the source list to include Twitter and MyScienceWork, among others. Publishers can also act as sources by publishing and distributing DOI event data via the DET when an event occurs on its platform (for example, when a PDF is downloaded, or when a comment mentions a DOI in a locally hosted discussion forum, etc.). This would make local DOI activity globally available to funders, researchers, institutions, etc.

DET provides benefits of scale and ease of access as a central point for collecting and propagating data to the community. As a single point of access, it overcomes the business and technical hurdles that are a part of managing multiple online sources where scholarly activity occurs, in a rapidly changing landscape of online channels. This resource covers content across publishers and serves as a strong foundation to support the development of tools and services by any party. DET users will always be able to combine the DET data with those individually collected via negotiated or paid access. DET remains a utility separate from any value-added amenities, such as analytics, presentation, and reporting.

DET Service-Level Agreement

For those who seek the highest level of service and a more flexible range of access options, Crossref will provide a Service-Level Agreement (SLA) service for the DOI Event Tracker. The DET SLA includes the following additional features on top of the common data offering:

  • Access to the complete suite of sources, which includes restricted and/or paid sources in addition to common data, providing the fullest picture of DOI usage activity possible.
  • Guaranteed uptime and response time to the latest raw data on the aggregate activity surrounding a DOI.
  • Guaranteed support response time to questions and issues surrounding data and data delivery.
  • Flexible data access options: on-demand real time data access and scheduled bulk downloads for processing batch analytics.
  • Optimum retrieval rates and accelerated delivery speeds with the dedicated SLA API.
  • Access to a webhook API for events of interest as an alternative to polling DET.
  • Standardized and enhanced linkback service for the difficult-to-track, grey literature.

The DET SLA service has a simple, value-based pricing model based on subscriber size. Register your interest in Crossref’s DOI Event Tracker and the DET SLA service if you would like stay informed of the upcoming launch. Please contact Jennifer Lin for more information.

Image modified from “Radar” icon by Karsten Barnett from the Noun Project.

DOIs and matching regular expressions

We regularly see developers using regular expressions to validate or scrape for DOIs. For modern CrossRef DOIs the regular expression is short


For the 74.9M DOIs we have seen this matches 74.4M of them. If you need to use only one pattern then use this one.

The other 500K are mostly from CrossRef’s early days when the battle between “human-readable” identifiers and “opaque” identifiers was still being fought, the web was still new, and it was expected that “doi” would become as well a supported URI schema name as “gopher”, “wais”, …. Ok, that didn’t go so well.

An early CrossRef’s member was John Wiley & Sons. They faced the need to design DOIs without much prior work to lean on. Many of those early DOIs are not expression friendly. Nevertheless, they are still valid and valuable permanent links to the work’s version of record. You can catch 300K more DOIs with


While the DOI caught is likely to be the DOI within the text it may also contain trailing characters that, due to the lack of a space, are caught up with the DOI. Even the recommended expression catches DOIs ending with periods, colons, semicolons, hyphens, and underscores. Most DOIs found in the wild are presented within some visual design program. While pleasant to look at the visual design can misdirect machines. Is the period at the end of the line part of the DOI or part of the design? Is that endash actually a hyphen? These issues lead to a DOI bycatch.

Adding the following 3 expressions with the previous 2 leaves only 72K DOIs uncaught. To catch these 72K would require a dozen or more additional patterns. Each additional pattern, unfortunately, weakens the overall precision of the catch. More bycatch.




CrossRef is not the only DOI Registration Agency and while our members account for 65-75% of all registered DOIs this means there are tens of millions of DOIs that we have not seen. Luckily, the newer RAs and their publishers can copy our successes and avoid our mistakes.

Rehashing PIDs without stabbing myself in the eyeball

Anybody who knows me or reads this blog is probably aware that I don’t exactly hold back when discussing problems with the DOI system. But just occasionally I find myself actually defending the thing…

About once a year somebody suggests that we could replace existing persistent citation identifiers (e.g. DOIs) with some new technology that would fix some of the weaknesses of the current systems. Usually said person is unhappy that current systems like DOI, Handle, Ark,, etc. depend largely on a social element to update the pointers between the identifier and the current location of the resource being identified. It just seems manifestly old-fashioned and ridiculous that we should still depend on bags of meat to keep our digital linking infrastructure from falling apart.

In the past, I’ve threatened to stab myself in the eyeball if I was forced to have the discussion again. But the dirty little secret is that I play this game myself sometimes. After all, the best thing a mission-driven membership organisation could do for its members would be to fulfil its mission and put itself out of business. If we could come up with a technical fix that didn’t require the social component, it would save our members a lot of money and effort.

When one of these ideas is posed, there is a brief flurry of activity as another generation goes through the same thought processes and (so far) comes to the same conclusions.

The proposals I’ve seen generally fall into one of the following groups:

  • Replace persistent identifiers (PIDs) with hashes, checksums, etc.
  • Just use search (often, but not always coupled with 1 above)
  • Automagically create PIDs out of metadata.
  • Automagically redirect broken citations to archived versions of the content identified
  • And more recently… use the blockchain

I thought it might help advance the discussion and avoid a bunch of dead ends if I summarised (rehashed?) some of the issues that should be considered when exploring these options.

Warning: Refers to FRBR terminology. Those of a sensitive disposition might want to turn away now.

  • DOIs, PMIDs, etc. and other persistent identifiers are primarily used by our community as “citation identifiers”. We generally cite at the “expression” level.
  • Consider the difference between how a “citation identifier” a “work identifier” and a “content verification identifier” might function.
  • How do you deal with “equivalent manifestations” of the same expression. For example the ePub, PDF and HTML representations of the same article are intellectually equivalent and interchangeable when citing. The same applies to csv & tsv representations of the same dataset. So, for example, how do hashes work here as a citation identifier?
  • Content can be changed in ways that typically doesn’t effect the interpretation or crediting of the work. For example, by reformatting, correcting spelling, etc. In these cases the copies should share the same citation identifier, but the hashes will be different.
  • Content that is virtually identical (and shares the same hash) might be republished in different venues (e.g. a normal issue and a thematic issue). Context in citation is important. How do you point somebody at the copy in the correct context?
  • Some copies of an article or dataset are stewarded by publishers. That is, if there is an update, errata, corrigenda, retraction/withdrawal, they can reflect that on the stewarded copy, not on copies they don’t host or control. Location is, in fact, important here.
  • Some copies of content will be nearly identical, but will differ in ways that would affect the interpretation and/or crediting of the work. A corrected number in a table for example. How would you create a citation form a search that would differentiate the correct version from the incorrect version?
  • Some content might be restricted, private or under embargo. For example private patient data, sensitive data about archaeological finds or the migratory patterns of endangered animals.
  • Some content is behind paywalls (cue jeremiads)
  • Content is increasingly composed of static and dynamic elements. How do you identify the parts that can be hashed?
  • How do you create an identifier out of metadata and not have them look like this?

This list is a starting point that should allow people to avoid a lot of blind alleys.

In the mean time, good luck to those seeking alternatives to the current crop of persistent citation identifier systems. I’m not convinced it is possible to replace them, but if it is- I hope I beat you to it. :-) And I hope I can avoid stabbing myself in the eye.

Coming to you Live from Wikipedia

We’ve been collecting citation events from Wikipedia for some time. We’re now pleased to announce a live stream of citations, as they happen, when they happen. Project this on your wall and watch live DOI citations as people edit Wikipedia, round the world.

View live stream »

In the hours since this feature launched, there are events from Indonesian, Portugese, Ukrainian, Serbian and English Wikipedias (in that order).

Live event stream

The usual weasel words apply. This is a labs project and so may not be 100% stable. If you experience any problems please email .