January 2015 DOI Outage: Followup Report


On January 20th, 2015 the main DOI HTTP proxy at doi.org experienced a partial, rolling global outage. The system was never completely down, but for at least part of the subsequent 48 hours, up to 50% of DOI resolution traffic was effectively broken. This was true for almost all DOI registration agencies, including CrossRef, DataCite and mEDRA.

At the time we kept people updated on what we knew via Twitter, mailing lists and our technical blog at CrossTech. We also promised that, once we’d done a thorough investigation, we’d report back. Well, we haven’t finished investigating all implications of the outage. There are both substantial technical and governance issues to investigate. But last week we provided a preliminary report to the CrossRef board on the basic technical issues, and we thought we’d share that publicly now.

The Gory Details

First, the outage of January 20th was not caused by a software or hardware failure, but was instead due to an administrative error at the Corporation for National Research Initiatives (CNRI). The domain name “doi.org” is managed by CNRI on behalf of the International DOI Foundation (IDF). The domain name was not on “auto-renew” and CNRI staff simply forgot to manually renew the domain. Once the domain name was renewed, it took about 48 hours for the fix to propagate through the DNS system and for the DOI resolution service to return to normal. Working with CNRI we analysed traffic through the Handle HTTP proxy and here’s the graph:

Chart of Handle HTTP proxy traffic during outage

The above graph shows traffic over a 24 hour period on each day from January 12, 2015 through February 10th, 2015. The heavy blue line for January 20th and the heavy red line for January 21st show how referrals declined as the doi.org domain was first deleted, and then added back to DNS.

It could have been much worse. The domain registrar (GoDaddy) at least had a “renewal grace and registry redemption period” which meant that even though CNRI forgot to pay its bill to renew the domain, the domain was simply “parked” and could easily be renewed by them. This is the standard setting for GoDaddy. Cheaper domain registrars might not include this kind of protection by default. Had there been no grace period, then it would have been possible for somebody other than CNRI to quickly buy the domain name as soon as it expired. There are many automated processes which search for and register recently expired domain names. Had this happened, at the very least it would have been expensive for CNRI to buy the domain back. The interruption to DOI resolutions during this period would have also been almost complete.

So we got off relatively easy. The domain name is now on auto-renew. The outage was not as bad as it could have been. It was addressed quickly and we can be reasonably confident that the same administrative error will not happen again. CrossRef even managed to garner some public praise for the way in which we handled the outage. It is tempting to heave a sigh of relief and move on.

We also know that everybody involved at CNRI, the IDF and CrossRef have felt truly dreadful about what happened. So it is also tempting to not re-open old wounds.

But it would be a mistake if we did not examine a fundamental strategic issue that this partial outage has raised: How can CrossRef claim that its DOIs are ‘persistent’ if CrossRef does not control some of the key infrastructure on which it depends? What can we do to address these dependencies?

What do we mean by “persistent?”

@kaythaney tweets on definition of "persistent"

@kaythaney tweets on definition of “persistent”

To start with, we should probably explore what we mean by ‘persistent’. We use the word “persistent” or “persistence” about 470 times on the CrossRef web site. The word “persistent” appears central to our image of ourselves and of the services that we provide. We describe our core, mandatory service as the “CrossRef Persistent Citation Infrastructure.”

The primary sense of the word “persistent” in the New Oxford American Dictionary is:

Continuing firmly or obstinately in a course of action in spite of difficulty or opposition.

We play on this sense of the word as a synonym for “stubborn” when we half-jokingly say that, “CrossRef DOIs are as persistent as CrossRef staff.” Underlying this joke is a truth, which is that persistence is primarily a social issue, not a technical issue.

Yet presumably we once chose to use the word “persistent” instead of “perpetual” or “permanent” for other reasons. “Persistence” implies longevity, without committing to “forever.” Scholarly publishers, perhaps more than most industries, understand the long term. After all, the scholarly record dates back to at least 1665 and we know that the scholarly community values even our oldest journal backfiles. By using the word “persistent” as opposed to the more emphatic “permanent” we are essentially acknowledging that we, as an industry, understand the complexity and expense of stewarding the content for even a few hundred years to say nothing of “forever.” Only the chronologically naïve would recklessly coin terms like “permalink” for standard HTTP links which have a documented half-life of well under a decade.

So “persistent” implies longevity- without committing to forever- but this still begs questions. What time span is long enough to qualify as “persistent?” What, in particular, do we mean by “persistent” when we talk about CrossRef’s “Persistent Citation Infrastructure?” or of CrossRef DOIs being “persistent identifiers?”

What do we mean by “persistent identifiers?”

@violetailik tweets on outage and implication for term "persistent identifier"

]5 @violetailik tweets on outage and implication for term “persistent identifier”

First, we often make the mistake of talking about “persistent identifiers” as if there is some technical magic that makes them continue working when things like HTTP URIs break. The very term “persistent identifier” encourages this kind of magical thinking and, ideally, we would instead talk about “persist-able” identifiers. That is, those that have some form of indirection built into them. There are many technologies that do this- Handles, DOIs, Purls, ARKs and every URL shortener in existence. Each of them simply introduces a pointer mapping between an identifier and location where a resource or content resides. This mapping can be updated when the content moves, thus preserving the link. Of course, just because an identifier is persist-able doesn’t mean it is persistent. If Purls or DOIs are not updated when content moves, then they are no more persistent than normal URLs.

Andrew Treloar points out that when we talk about “persistent identifiers,” we tend to conflate several things:

  1. The persistence of the identifier- that is the token or string itself.
  2. The persistence of the thing being pointed at by the identifier. For example, the content.
  3. The persistence of the mapping of the identifier to the thing being identified.
  4. The persistence of the resolver that allows one to follow the mapping of the identifier to the thing being identified.
  5. The persistence of a mechanism for updating the mapping of the identifier to the thing being identified.

If any of the above fails, then “persistence” fails. This is probably why we tend to conflate them in the first place.

Each of these aspects of “persistence” is worthy of much closer scrutiny, however, in the most recent case of the January outage of “doi.org,” the problem specifically occurred with item “D”- the persistence of the resolver. When CNRI failed to renew the domain name for “doi.org” on time, the DOI resolver was rendered unavailable to a large percentage of people over a period of about 48 hours as global DNS servers first removed, and then added back the “doi.org” domain.

Turtles all the way down*

The initial public reaction to the outage was, almost unanimous in one respect- people assumed that the problem originated with CrossRef.

@iainh_z tweets to CrossRef enquiring about failed DOI resoluton

@iainh_z tweets to CrossRef enquiring about failed DOI resoluton

@LibSkrat tweets at CrossRef about DOI outage

@LibSkrat tweets at CrossRef about DOI outage

This is both surprising and unsurprising. It is surprising because we have fairly recent data indicating that lots of people recognise the DOI brand, but not the CrossRef brand. Chances are, that this relatively superficial “brand” awareness does not correlate with understanding how the system works or how it relates to persistence. It is likely plenty of people clicked on DOIs at the time of the outage and, when they didn’t work, simply shrugged or cursed under their breath. They were aware of the term ‘DOI’ but not of the promise of “persistence”. Hence, they did not take to twitter to complain about it, and if they did, they probably wouldn’t have known who to complain to or even how to complain to them (neither CNRI or the IDF has a Twitter account).

But the focus on CrossRef is also unsurprising. CrossRef is by far the largest and most visible DOI Registration Agency. Many otherwise knowledgeable people in the industry simply don’t know that there are even other RAs.

They also generally didn’t know of the strategic dependencies that exist in the CrossRef system. By “strategic dependencies” we are not talking about the vendors, equipment and services that virtually every online enterprise depends on. These kinds of services are largely fungible. Their failures may be inconvenient and even dramatic, but they are rarely existential.

Instead we are talking about dependencies that underpin CrossRef’s ability to deliver on its mission. Dependencies that not only affect CrossRef’s operations, but also its ability to self-govern and meet the needs of its membership. In this case there are three major dependencies: Two of which are specific to CrossRef and other DOI registration agencies and one which is shared by virtually all online enterprises today. The organizations are: The International DOI Foundation (IDF), Corporation for National Research Initiatives (CNRI) and the Internet Corporation for Assigned Names and Numbers (ICANN).

Dependency of RAs on IDF, CNRI and ICANN. Turtles all the way down.

Dependency of RAs on IDF, CNRI and ICANN. Turtles all the way down.

Each of these agencies has technology, governance and policy impacts on CrossRef and the other DOI registration agencies, but here we will focus on the technological dependencies.

At the top of the diagram are a subset of the various DOI Registration Agencies. Each RA uses the DOI for a particular constituency (e.g. scholarly publishers) and application (e.g. citation). Sometimes these constituencies/applications overlap (as with mEDRA, CrossRef and DataCite), but sometimes they are orthogonal to the other RAs, as is the case with EIDR. All, however, are members of the IDF.

The IDF sets technical policies and development agendas for the DOI infrastructure. This includes recommendations about how RAs should display and link DOIs. Of course all of these decisions have an impact on the RAs. However, the IDF provides little technical infrastructure of its own as it has no full-time staff. Instead it outsources the operation of the system to CNRI, this includes the management of the doi.org domain which the IDF owns.

The actual DOI infrastructure is hosted on a platform called the Handle System which was developed by and is currently run by CNRI. The Handle System is part of a quite complex and sophisticated platform for managing digital objects that was originally developed for DARPA. A subset of the Handle system is designated for use by DOIs and is identified by the “10” prefix (e.g. 10.5555/12345678). The Handle system itself is not based on HTTP (the web protocol). Indeed, one of the much touted features of the Handle System is that it isn’t based on any specific resolution technology. This was seen as a great virtue in the late 1990s when the DOI system was developed and the internet had just witnessed an explosion of seemingly transient, competing protocols (e.g. Gopher, WAIS, Archie, HyperWave/Hyper-G, HTTP, etc.). But what looked like a wild-west of protocols quickly settled into an HTTP hegemony. In practice, virtually all DOI interactions with the Handle system are via HTTP and so, in order to interact with the web, the Handle System employs a “Handle proxy” which translates back and forth between HTTP, and the native Handle system. This all may sound complicated, and the backend of the Handle system is really very sophisticated, but it turns out that the DOI really uses only a fraction of the Handle system’s features. In fact, the vast majority of DOI interactions merely use the Handle system as a giant lookup table which allows one to translate an identifier into a web location. For example, it will take a DOI Handle like this:


and redirect it to (as of this writing) the following URL:


This whole transformation is normally never seen by a user. It is handled transparently by the web browser, which does the lookup and redirection in the background using HTTP and talking to the Handle Proxy. In the late 1990s, even doing this simple translation quickly, at scale with a robust distributed infrastructure, was not easy. These days however we see dozens if not hundreds of “URL Shorteners” doing exactly the same thing at far greater scale than the Handle System.

It may seem a shame that more of the Handle Systems features are not used, but the truth is the much touted platform independence of the Handle System rapidly became more of a liability and impediment to persistence than an aid. To be blunt, if in X years a new technology comes out that supersedes the web, what do we think the societal priority is going to be?

  • To provide a robust and transparent transition from the squillions of existing HTTP URI identifiers that the entire world depends on?
  • To provide a robust and transparent transition from the tiny subset of Handle-based identifiers that are used by about a hundred million specialist resources?

Quite simply, the more the Handle/DOI systems diverge from common web protocols and practice, then the more we will jeopardise the longevity of our so-called persistent identifiers.

So, in the end, DOI registration agencies really only use the Handle system for translating web addresses. All of the other services and features one might associate with DOIs (reference resolution, metadata lookup, content negotiation, OAI-PMH, REST APIs, CrossMark, CrossCheck, TDM Services, FundRef etc) are all provided at the RA level.

But this address resolution is still critical. And it is exactly what failed for many users on January 20th 2015. And to be clear, it wasn’t the robust and scaleable Handle System that failed. It wasn’t the Handle Proxy that failed. And it certainly wasn’t any RA-controlled technology that failed. These systems were all up and running. What happened was that the standard handle proxy that the IDF recommends RAs use, “dx.doi.org”, was effectively rendered invisible to wide portions the internet because the “doi.org” domain was not renewed. This underscores two important points.

The first is that it doesn’t much matter what precisely caused the outage. In this case it was an administrative error. But the effect would have been similar if the Handle proxies had failed of if the Handle system itself had somehow collapsed. In the end, CrossRef and all DOI registration agencies are existentially dependent on the Handle system running and being accessible.

The second is that the entire chain of dependencies from the RAs down through CNRI are also dependent on the DNS system which, in turn, is governed by ICANN. We should really not be making too much of the purported technology independence of the DOI and Handle systems. To be fair, this limitation is inherent to all persistent identifier schemes that aim to work with the web. It really is “turtles all the way down.

What didn’t fail on January 19th/20th and why?

You may have noticed a lot of hedging in our description of the outage of January 19th/20th. For one thing, we use the term “rolling outage.” Access to the Handle Proxy via “dx.doi.org” was never completely unavailable during the period. As we’ve explained, this is because the error was discovered very quickly and the domain was renewed hours after it expired. The nature of DNS propagation meant that even as some DNS servers were deleting the “doi.org” entry, others were adding it back to their tables. In some ways this was really confusing because it meant it was difficult to predict where the system was working and where it wasn’t. Ultimately it all stabilised after the standard 48-hour DNS propagation cycle.

But there were also some Handle-based services that simply were not affected at all by the outage. During the outage, a few people asked us if there was an alternative way to resolve DOIs. The answer was “yes,” there were several. It turns out that “doi.org” is not the only DNS name that points to the Handle Proxy. People could easily substitute “dx.doi.org” with “dx.crossref.org” or “dx.medra.org” or “hdl.handle.net” and “resolve” any DOI. Many of CrossRef’s internal services use these internal names and so the services continued to work. This is partly why we only discovered the “doi.org” was down from people reporting it on Twitter.

And, of course, there were other services that were not affected by the outage. CrossMark, the REST API, and CrossRef Metadata Search all continued to work during the outage.

Protecting ourselves

So what can we do to reduce our dependencies and/or the risks intrinsic to those dependencies?

Obviously, the simplest way to have avoided the outage would have been to ensure that the “doi.org” domain was set to automatically renew. That’s been done. Is there anything else we should do? A few ideas have been floated that might allow us to provide even more resilience. They range greatly in complexity and involvement.

  1. Provide well-publicised public status dashboards that show what systems are up and which clearly map dependencies so that people could, for instance, see that the doi.org server was not visible to systems that depended on it. Of course, if such a dashboard had been hosted at doi.org, nobody would have been able to connect to it. Stoopid turtles.
  2. Encourage DOI RAs to have the members point to Handle proxies using domain names under the RA’s control. Simply put, if CrossRef members had been using “dx.crossref.org” instead of “dx.doi.org”, then CrossRef DOIs would have continued to work throughout the outage of “doi.org”. The same with mEDRA, and the other RAs. This way each RA would have control over another critical piece of their infrastructure. It would also mean that if any single RA made a similar domain name renewal mistake, the impact would be isolated to a particular constituency. Finally, using RA-specific domains for resolving DOIs might also make it clear that different DOIs are managed by different RAs and might have different services associated with them. Perhaps CrossRef would spend less time supporting non-CrossRef DOIs?
  3. Provide a parallel, backup resolution technology that could be pointed to in the event of a catastrophic Handle System failure. For example we could run a parallel system based on PURLs, ARKs or another persist-able identifier infrastructure.
  4. Explore working with ICANN to get the handle resolvers moved under the special “.arpa” top level domain (TLD). This TLD (RFC 3172) is reserved for services that are considered to be “critical to the operation of the internet.” This is an option that was first discussed at a meeting of persistent identifier providers in 2011.

These are all tactical approaches to addressing the specific technical problem of the Handle System becoming unavailable, but they do not address deeper issues relating to our strategic dependence on several third parties. Even though the IDF and CNRI provide us with pretty simple and limited functionality, that functionality is critical to our operations and our claim to be providing persistent identifiers. Yet these technologies are not in our direct control. We had to scramble to get hold of people to fix the problem. For a while, we were not able to tell our users or members what was happening because we did not know ourselves.

The irony is that CrossRef was held to account, and we were in the firing line the entire time. Again, this was almost unavoidable. In addition to being the largest DOI RA, we are also the only RA that has any significant social media presence and support resources. Still, it meant that we were the public face of the outage while the IDF and CNRI remained in the background.

And this is partly why our board has encouraged us to investigate another option:

  1. Explore what it would take to remove CrossRef dependencies on the IDF and CNRI.

CrossRef is just part of a chain of dependencies the goes from our publisher members down through the IDF, CNRI and, ultimately, ICANN. Our claim to providing a persistent identifier structure depends entirely on the IDF and CNRI. Here we have explored some of the technical dependencies. But there are also complex governance and policy implications of these dependencies. Each organization has membership rules, guidelines and governance structures which can impact CrossRef members. Indeed, the IDF and CNRI are themselves members of groups (ISO and DONA, respectively) which might ultimately have policy or governance impact for DOI registration agencies. We will need to understand the strategic implications of these non technical dependencies as well.

Note that the CrossRef board has merely asked us to “explore” what it would take to remove dependencies. They have not asked us to actually take any action. CrossRef has been massively supportive of the IDF and CNRI, and they have been massively supportive of us. Still, over the years we have all grown and our respective circumstances have changed. It is important that occasionally we question what we might have once considered to be axioms. As we discussed above, we use the term “persistent” which, in turn, is a synonym for “stubborn.” At the very least we need to document the inter-dependencies that we have so that we can understand just how stubborn we can reasonably expect our identifiers to be.

The outage of January 20th was a humbling experience. But in a way we were lucky: Forgetting to renew the domain name was a silly and prosaic way to partially bring down a persistent identifier infrastructure, but it was also relatively easy to fix. Inevitably, there was a little snark and some pointed barbs directed at us during the outage, but we were truly overwhelmed by the support and constructive criticism we received as well. We have also been left with a clear message that, in order for this good-will to continue, we need to follow-up with a public, detailed and candid analysis of our infrastructure and its dependencies. Consider this to be the first section of a multi-part report.

@kevingashley tweets asking for followup analysis

@kevingashley tweets asking for followup analysis

@WilliamKilbride tweets asking for followup and lessons learned

@WilliamKilbride tweets asking for followup and lessons learned

Image Credits

Turtle image CC-BY “Unrecognised MJ” from the Noun Project

Real-time Stream of DOIs being cited in Wikipedia


Watch a real-time stream of DOIs being cited (and “un-cited!” ) in Wikipedia articles across the world: http://goo.gl/0AknMJ


For years we’ve known that the Wikipedia was a major referrer of CrossRef DOIs and about a year ago we confirmed that, in fact, the Wikipedia is the 8th largest refer of CrossRef DOIs. We know that people follow the DOIs, too. This despite a fraction of Wikipedia citations to the scholarly literature even using DOIs. So back in August we decided to create a Wikimedia Ambassador programme. The goal of the programme was to promote the use of persistent identifiers in citation and attribution in Wikipedia articles. We would do this through outreach and through the development of better citation-related tools.

Remember when we originally wrote about our experiments with the PLOS ALM code and how that has transitioned into the DOI Event Tracking Pilot? In those posts we mentioned that one of the hurdles in gathering information about DOI events is the actual process of polling third party APIs for activity related to millions of DOIs. Most parties simply wouldn’t be willing handle the load of a 100K API calls an hour. Besides, polling is a tremendously inefficient process, only a fraction of DOIs are ever going to generate events, but we’d have to poll for each of them, repeatedly, forever, to get an accurate picture of DOI activity. We needed a better way. We needed to see if we could reverse this process and convince some parties to instead “push” us information whenever they saw DOI related events (e.g. citations, downloads, shares, etc). If only we could convince somebody to try this…

Wikipedia DOI Events

In December 2014 we took the opportunity of the 2014 PLOS/CrossRef ALM Workshop in San Francisco too meet with Max Klein and Anthony Di Franco where we kicked off a very exciting project.

There’s always someone editing a Wikipedia somewhere in the world. In fact, you can see a dizzying live stream of edits. We thought that given that there are so many DOIs in Wikipedia, that live stream may contain some diamonds (DOIs are made of diamond, that’s how they can be persistent). Max and Anthony went away and came back with a demo that contains a surprising amount of DOI activity.

That demo is evolving into a concrete service, called Cocytus. It is running at Wikimedia Labs monitoring live edits as you read this.

For now we’re feeding that data into the DOI Events Collection app (which is an off-shoot of the Chronograph project). We are in the process of modifying the Lagotto code so that we can instead push those events into the DOI Event Tracking Instance.

The first DOI event we noticed was delightfully prosaic: The DOI for “The polymath project” is cited by the Wikipedia page for “Polymath Project”. Prosaic perhaps, but the authors of that paper probably want to know. Maybe they can help edit the page.

Or how about this. Someone wrote a a paper about why people edit Wikipedia and then it was cited by Wikipedia. And then the citation was removed. The plot thickens…

We’re interested in seeing how DOIs are used outside of the formal scholarly literature. What does that mean? We don’t fully know, that’s the point. We have retractions in scholarly literature (and our CrossMark metadata and service allow publishers to record that), but it’s a bit different on Wikipedia. Edit wars are fought over … well you can see for yourself.

Citations can slip in and out of articles. We saw the DOI 10.1001/archpediatrics.2011.832 deleted from “Bipolar disorder in children”. If we’d not been monitoring the live feed (we had considered analysing snapshots of the Wikipedia in bulk) we might never have seen that. This is part of what non-traditional citations means, and it wasn’t obvious until we’d seen it.

You can see this activity on the Chronograph’s stream. Or check your favourite DOI. Please be aware that we’re only collecting newly added citations as of today. We do intend to go back and back-fill, but that may take some time- as it * cough * requires polling again.

Some Technical Things

A few interesting things that happened as a result of all this:

Secure URLs

SSL and HTTPS were invented so you could do things like banking on the web without fear of interception or tampering. As the web becomes a more important part of life, many sites are upgrading from HTTP to HTTPS, the secure version. This is not only because your confidential details may be tampered with, but because certain governments might not like you reading certain materials.

Because of this, some time ago, Wikipedia decided to embark on an upgrade to HTTPS last year, and they are a certain way along the path. The IDF, who are responsible for running the DOI system, upgraded to HTTPS this Summer, although most DOIs are referred to by HTTP still.

We met with Dario Taraborelli at the ALM workshop and discussed the DOI referral data that is fed into the Chronograph. We put two and two together and realised that Wikipedia was linking to DOIs (which are mostly HTTP) from pages which might be served over HTTPS. New policies in HTML5 specify that referrer URL headers shouldn’t be sent from HTTPS to HTTP (in case there was something secret in them). The upshot of this is, if someone’s browsing Wikipedia via HTTPS and click on a normal DOI, we won’t know that the user came from Wikipedia. Not a huge problem today, but as Wikipedia switches over to entirely secure, we’re going to miss out on very useful information.

Fortunately, the HTML5 specification includes a way to fix this (without leaking sensitive information). We discussed this with Dario, and he did some research, and came up with a suggestion, which got discussed. It’s fascinating to watch a democratic process like this take place and take part in it.

We’re waiting to see how the discussion turns out, and hope that it all works out so we can continue to report on how amazing Wikipedia is at sending people to scholarly literature.

How shall I cite thee?

Another discussion grew out of that process, and we started talking to a Wikipedian called Nemo (note to Latin scholars: we weren’t just talking to ourselves). Nemo (real name Federico Leva) had a few suggestions of his own. Another way to solve the referrer problem is by using HTTPS URLs (HTML5 allows browsers to send the referrer domain when going from HTTPS to HTTPS).

This means going back to all the articles that use DOIs and change them from HTTP to HTTPS. Not as simple as it sounds, and it doesn’t sound simple. We started looking into how DOIs were cited on Wikipedia.

After some research we found that there are more ways that we expected to cite DOIs.

First, there’s the URL. You can see it in action in this article. URLs can take various forms.

  • http://dx.doi.org/10.5555/12345678
  • http://doi.org/10.5555/12345678
  • https://dx.doi.org/10.5555/12345678
  • https://doi.org/10.5555/12345678
  • http://doi.org/hvx
  • https://doi.org/hvx

Second there’s the official template tag, seen in action here:

<ref name="SCI-20140731">{{cite journal |title=Sustained miniaturization and anatomical innovation in the dinosaurian ancestors of birds |url=http://www.sciencemag.org/content/345/6196/562 |date=1 August 2014 |journal=[[Science (journal)|Science]] |volume=345 |issue=6196 |pages=562–566 |doi=10.1126/science.1252243 |accessdate=2 August 2014 |last1=Lee |first1=Michael S. Y. |first2=Andrea|last2=Cau |first3=Darren|last3=Naish|first4=Gareth J.|last4=Dyke}}</ref>

There’s a DOI in there somewhere. This is the best way to cite DOIs, firstly as it’s actually a proper traditional citation and there’s nothing magic about DOIs, secondly because it’s a template tag and can be re-rendered to look slightly different if needed.

Third there’s the old official DOI template tag that’s now discouraged:

<ref name="Example2006">{{Cite doi|10.1146/annurev.earth.33.092203.122621}}</ref> 

And then there’s another one.


Knowing all this helps us find DOIs. But if we want to convert DOIs links in Wikipedia to use HTTPS, it means that there are more template tags to modify and more pages to re-render.

Nemo also put DOIs on the Interwiki Map which should make automatically changing some of the URLs a lot easier.

We’re very grateful to Nemo for his suggestions and work on this. We’ll report back!

The elephant in the room

Those of you who know how DOIs work will have spotted an unsecured elephant in the room. When you visit a DOI, you visit the URL, which hits the DOI resolver proxy server, which returns a message to your browser to redirect to the landing page on the publisher’s site.

Securely talking to the DOI resolver by using HTTPS instead of HTTP means that no-one can eavesdrop and see which DOI you are visiting, or tamper with the result and send you off to a different page. But the page you are sent to will be, in nearly all cases, still HTTP. Upgrading infrastructure isn’t trivial, and, with over 4000 members (mostly publishers), most CrossRef DOIs will still redirect to standard HTTP pages for the foreseeable future.

You can keep as secure as possible by using HTTPS Everywhere.


There’s lots going on, watch this space to see developments. Thanks for reading this, and all the links. We’d love to know what you think.


Not long after this blog post was published we saw something very interesting.

Interesting DOI

That’s no DOI. We like interesting things, but they can panic us. This turned out to be a great example of why this kind of thing can be useful. A minute’s digging and we found the article edit:

Wikipedia typo

It turns out that this was a typo: someone put a title when they should have put in a DOI. And, as the event shows, this was removed from the Wikipedia article.

CrossRef’s DOI Event Tracker Pilot


CrossRef’s “DOI Event Tracker Pilot”- 11 million+ DOIs & 64 million+ events. You can play with it at: http://goo.gl/OxImJa

Tracking DOI Events

So have you been wondering what we’ve been doing since we posted about the experiments we were conducting using PLOS’s open source ALM code? A lot, it turns out. About a week after our post, we were contacted by a group of our members from OASPA who expressed an interest in working with the system. Apparently they were all about to conduct similar experiments using the ALM code, and they thought that it might be more efficient and interesting if they did so together using our installation. Yippee. Publishers working together. That’s what we’re all about.

So we convened the interested parties and had a meeting to discuss what problems they were trying to solve and how CrossRef might be able to help them. That early meeting came to a consensus on a number of issues:

  • The group was interested in exploring the role CrossRef could play in providing an open, common infrastructure to track activities around DOIs, they were not interested in having CrossRef play a role in the value-add services of reporting on an interpreting the meaning of said activities.
  • The working group needed representatives from multiple stakeholders in the industry. Not just open access publishers from OASPA, but from subscription based publishers, funders, researchers and third party service providers as well.
  • That it was desirable to conduct a pilot to see if the proposed approach was both technically feasible and financially sustainable.

And so after that meeting, the “experiment” graduated to becoming a “pilot.” This CrossRef pilot is based on the premise that the infrastructure involved in tracking common information about “DOI events” can be usefully separated from the value-added services of analysing and presenting these events in the form of qualitative indicators. There are many forms of events and interactions which may be of interest. Service providers will wish to analyse, aggregate and present those in a range of different ways depending on the customer and their problem. The capture of the underlying events can be kept separate from those services.

In order to ensure that the CrossRef pilot is not mistaken for some sub rosa attempt to establish new metrics for evaluating scholarly output, we also decided eschew any moniker that includes the word “metrics” or synonyms. So the “ALM Experiment” is dead. Long live the “”DOI Event Tracker” (DET) pilot. Similarly PLOS’s open source “ALM software” has been resurrected under the name “Lagotto.”

The Technical Issues

CrossRef members are interested in knowing about “events” relating to the DOIs that identify their content. But our members face a now-classic problem. There are a large number of sources for scholarly publications (3k+ CrossRef members) and that list is still growing. Similarly, there are an unbounded number of potential sources for usage information. For example:

  • Supplemental and grey literature (e.g. data, software, working papers)
  • Orthogonal professional literature (e.g. patents, legal documents, governmental/NGO/IGO reports, consultation reports, professional trade literature).
  • Scholarly tools (e.g. citation management systems, text and data mining applications).
  • Secondary outlets for scholarly literature (institutional and disciplinary repositories, A&I services).
  • Mainstream media (e.g. BBC, New York Times).
  • Social media (e.g. Wikipedia, Twitter, Facebook, Blogs, Yo).

Finally, there is a broad and growing audience of stakeholders who are interested in seeing how the literature is being used. The audience includes publishers themselves as well as funders, researchers, institutions, policy makers and citizens.

Publishers (or other stakeholders) could conceivably each choose to run their own system to collect this information and redistribute it to interested parties. Or they can work with a vendor to do the same. But either case, they would face the following problems:

  • The N sources will change. New ones will emerge. Old ones will vanish.
  • The N audiences will change. New ones will emerge. Old ones will vanish.
  • Each publisher/vendor will need to deal with N source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers/vendors and for the sources.
  • Each audience will need to deal with N publisher/vendor APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the audiences and for the publishers.
  • If publishers/vendors use different systems which in turn look at different sources, it will be difficult to compare or audit results across publishers/vendors.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal?

And then there is the simple issue of scale. Most parties will be interested in comparing the data that they collect for their own content, with data about their competitors. Hence, if they all run their own system, they will each be querying much more than their own data. If, for example, just the commercial third-party providers were interested in collecting data covering the formal scholarly literature, they would each find themselves querying the same sources for the same 80 million DOIs. To put this into perspective, to refresh the data for 10 million DOIs once a month, would require sources to support ~ 14K API calls an hour. 60 million DOIs would require 100K API calls an hour. Current standard API caps for many of the sources that people are interested in querying hover around 2K per hour. We may see these sources lift that cap for exceptional cases, but they are unlikely to do so for many different clients all of whom are querying essentially the same thing.

These issues typify the “multiple bilateral relationships” problem that CrossRef was founded to try and ameliorate. When we have many organizations trying to access the exact same APIs to process the exact same data (albeit to different ends), then it seems likely that CrossRef could help make the process more efficient.

Piloting A Proposed Solution

The CrossRef DET pilot aims to show the feasibility of providing a hub for the collection, storage and propagation of DOI events from multiple sources to multiple audiences.

Data Collection

  • Pull: DET will collect DOI event data from sources that are of common interest to the membership, but which are unlikely to make special efforts to accommodate the scholarly communications industry. Examples of this class of source include large, broadly popular services like FaceBook, Twitter, VK, Sina Weibo, etc.
  • Push: DET will allow sources to send DOI event data directly to CrossRef in one of three ways:
    • Standard Linkback: Using standards that are widely used on the web. This will automatically enable linkback-aware systems like WordPress, Moveable Type, etc. to alert DET to DOI events.
    • Scholarly Linkback: A to-be-defined augmented linkback-style API which will be optimized to work with scholarly resources and which will allow for more sophisticated payloads including other identifiers (e.g. ORCIDs, FundRefs), metadata, provenance information and authorization information. This system could be used by tools designed for scholarly communications. So, for example, it could be used by publisher platforms to distribute events related to downloads or comments within their discussion forums. It could also be used by third party scholarly apps like Zotero, Mendeley, Papers, Authorea, IRUS-UK, etc. in order to alert interested parties in events related to specific DOIs.
    • Redirect: DET will also be able to serve as a service discovery layer that will allow sources to push DOI event data directly to an appropriate publisher-controlled endpoint using the above scholarly linkback mechanism. This can be used by sources like repositories in order to send sensitive usage data directly to the relevant publishers.

Data Propagation

Parties may want to use the DET in order to propagate information about DOI events. The system will support two broad data propagation patterns:

  • one-to-many: DOI events that are commonly harvested (pulled) by the DET system from a single source will be distributed freely to anybody who queries the DET API. Similarly, sources that push DOI events via the standard or scholarly linkback mechanisms, will also propagate their DOI events openly to anybody who queries the DET API. DOI events that are propagated in either of these cases will be kept and logged by the DET system along with appropriate provenance information. This will be the most common, default propagation model for the DET system.
  • one-to-one: Sources of DOI events can also report (push) DOI event data directly to owner of the relevant DOI if the DOI owner provides & registers a suitable end-point with the DET system. In these cases, data sources seeking to report information relating to a DOI, will be redirected (with a suitable 30X HTTP status and relevant headers) to the end-point specified by the DOI owner. The DET system will not keep the request or provenance information. One-to-one propagation model is designed to handle use cases where the source of the DOI event has put restrictions on the data and will only share the DOI events with the owner (registrant) of the DOI. This use case may be used, for example, by aggregators or A&I services that want to report confidential data directly back to a publisher. The advantage of the redirect mechanism is that CrossRef is not put into the position of having to secure sensitive data as said data will never reside on CrossRef systems.

Note that the two patterns can be combined. So, for example, a publisher might want to have public social media events reported to the DET and propagated accordingly, but to also to private third parties report confidential information directly to the publisher.

So Where Are We?

So to start with, the DET Working Group has grown substantially since the early days and we have representatives from a wide variety of stakeholders. The group includes:

  • Andi Rutherford, Mendeley
  • Cameron Neylon, PLOS
  • Chris Shillum, Elsevier
  • Dom Mitchell, Co-action Publishing
  • Euan Adie, Altmetric
  • Jennifer Lin, PLOS
  • Juan Pablo Alperin, PKP
  • Kevin Dolby, Wellcome Trust
  • Liz Ferguson, Wiley
  • Mark Patterson, eLife
  • Martin Fenner, PLOS
  • Mike Thelwell, U Wolverhampton
  • Rachel Craven, BMC
  • Richard O’Beirne, OUP
  • Ruth Ivimey-Cook, eLife
  • Victoria Rao, Elsevier

As well as the usual contingent of CrossRef cat-herders including: Geoffrey Bilder, Rachael Lammey & Joe Wass.

When we announced the then-DET experiment, we said that one of the biggest challenges would be to create something that scaled to industry levels. At launch, we only loaded in about 317,500+ CrossRef DOIs representing publications from 2014 and we could see the system was going to struggle. Since then Martin Fenner and Jennifer Lin at PLOS have been focusing on making sure that the Lagotto code scales appropriately and now it is currently humming along with just over 11.5 million DOIs for which we’ve gathered over 64 million “events.” We aren’t worried about scalability on that front any more.

We’ve also shown that third parties should be able to access the API to provide value added reporting and metrics. As a demonstration of this, PLOS configured a copy of its reporting software “Parascope” to point at the CrossRef DET instance. The next step we’re taking is to start testing the “push” API mechanism and the “point-to-point redirect” API mechanism. For the push API, we should have a really exciting demo available to show within the next few days. And on the point-to-point redirect, we have a sub-group exploring how the point-to-point redirect mechanism could potentially be used for reporting COUNTER stats as a compliment to the Sushi initiative.

The other major outstanding task we have before us is to calculate what the costs will be of running the DET system as a production service. In this case we expect to have some pretty accurate data to go on as we will have had close to half a year of running the pilot with a non-trivial number of DOIs and sources. Note that the work group is concerned to ensure that the underlying data from the system remains open to all. Keeping this raw data open as seen as critical to establishing trust in the metrics and reporting systems that third parties build on the data. The group has also committed to leaving the creation of value-add services to third parties. As such we have been focusing on exploring business models based around service-level-agreement backed versions of the API to complement the free version of the same API. The free API will come with no guarantees of uptime, performance characteristics or support. For those users that depend on the API in order to deliver their services, we will offer paid-for SLA-backed versions of the free APIs. We can then configure our systems so that we can independently scale these SLA-backed APIs in order to meet SLA agreements.

Our goal is to have these calculations complete in time for the working group to make a recommendation to the CrossRef board meeting in July 2015.
Until then, we’ll use CrossTech as a venue for notifying people when we’ve hit new milestones or added new capabilities to the DET Pilot system.

Problems with dx.doi.org on January 20th 2015- what we know.

Hell’s teeth.

So today (January 20th, 2015) the DOI HTTP resolver at dx.doi.org started to fail intermittently around the world. The doi.org domain is managed by CNRI on behalf of the International DOI Foundation. This means that the problem affected all DOI registration agencies including CrossRef, DataCite, mEDRA etc. This also means that more popularly known end-user services like FigShare and Zenodo were affected. The problem has been fixed, but the fix will take some time to propagate throughout the DNS system. You can monitor the progress here:


Now for the embarrassing stuff…

At first lots of people were speculating that the problem had to do with somebody forgetting to renew the dx.doi.org domain name. Our information from CNRI was that the problem had to do with a mistaken change to a DNS record and that the domain name wasn’t the issue. We corrected people who were reporting that domain name renewal as the cause, but eventually we learned that it was actually true. We have had it confirmed that the problem originated with CNRI manually renewing the domain name at the last minute. Ugh. CNRI will issue a statement soon. We’ll link to it as soon as they do. UPDATE (Jan 21st): CNRI has sent CrossRef a statement. They do not have it on their site yet, so we have can included it below.

In the mean time, if you are having trouble resolving DOIs, a neat trick to know is that you can do so using the Handle system directly. For example:


CrossRef will, of course, also analyse what occurred, and issue a public report as well. Obviously, this report will include an analysis of how the outage effected DOI referrals to our members.

The amazingly cool thing is that everybody online has been very supportive and has helped us to diagnose the problem. Some have even said that the event underscores a point we often make about so-called “persistent-identifiers”- which is that they are not magic technology; the “persistence” is the result of a social contract. We like to say that CrossRef DOIs are as persistent as CrossRef staff. Well, to that phrase we have to add “and IDF staff” and “CNRI staff” and “ICANN staff”. It is turtles all the way down.

We don’t want to dismiss this event as an inevitable consequence of interdependent systems.And we don’t want to pass the buck. We need to learn something practical from this. How can we guard against this type of problem in the future? Again, people following this issue on Twitter have already been helping with suggestions and ideas. Can we crowd-source the monitoring of persistent identifier SLAs? Could we leverage Wikipedia, Wikidata or something similar to monitor critical identifiers and other infrastructure like purls, DOIs, handles, PMIDs, perma.cc, etc? Should we be looking at designating special exceptions to the normal rules governing DNS names? Do we need to distribute the risk more? Or is it enough cough to simply ensure that somebody, somewhere in the dependency chain had enabled DNS protection and auto-renewal for critical infrastructure DNS names?

Truly, we are humbled. For all the redundancy built into our systems (multiple servers, multiple hosting sites, Raid drives, redundant power), we were undone by a simple administrative task. CrossRef, IDF and CNRI- we all feel a bit crap. But we’ll get back. We’ll fix things. And we’ll let you know how we do it.

We will update this space as we know more. We will also keep people updated on twitter on @CrossRefNews. And we will report back in detail as soon as we can.

CNRI Statement

"The doi.org domain name was inadvertently allowed to expire for a brief period this morning (Jan 20). It was reinstated shortly after 9am this morning as soon as the relevant CNRI employee learned of it. A reminder email sent earlier this month to renew the registration was apparently missed. We sincerely apologize for any difficulties this may have caused. The domain name has since been placed on automatic renewal, which should prevent any repeat of this event."

Linking data and publications

Do you want to see if a CrossRef DOI (typically assigned to publications) refers to DataCite DOIs (typically assigned to data)? Here you go:


Conversely, do you want to see if a DataCite DOI refers to CrossRef DOIs? Voilà:



“How can we effectively integrate data into the scholarly record?” This is the question that has, for the past few years, generated an unprecedented amount of handwringing on the part researchers, librarians, funders and publishers. Indeed, this week I am in Amsterdam to attend the 4th RDA plenary in which this topic will no doubt again garner a lot of deserved attention.

We hope that the small example above will help push the RDAs agenda a little further. Like the recent ODIN project, It illustrates how we can simply combine two existing scholarly infrastructure systems to build important new functionality for integrating research objects into the scholarly literature.

Does it solve all of the problems associated with citing and referring to data? Can the various workgroups at RDA just cancel their data citation sessions and spend the week riding bikes and gorging on croquettes? Of course not. But my guess is that by simply integrating DataCite and CrossRef in this way, we can make a giant push in the right direction.

There are certainly going to be differences between traditional citation and data citation. Some even claim that citing data isn’t “as simple as citing traditional literature.” But this is a caricature of traditional citation. If you believe this, go off an peruse the MLA, Chicago, Harvard, NLM and APA citation guides. Then read Anthony Grafton’s, The Footnote? Are you back yet? Good, so let’s continue…

Citation of any sort is a complex issue- full of subtleties, edge-cases exceptions, disciplinary variations and kludges. Historically, the way to deal with these edge-cases has been social, not technical. For traditional literature we have simply evolved and documented citation practices which generally make contextually-appropriate use of the same technical infrastructure (footnotes, endnotes, metadata, etc.). I suspect the same will be true in citing data. The solutions will not be technical, they will mostly be social. Researchers, and publishers will evolve new, contextually appropriate mechanisms to use existing infrastructure deal with the peculiarities of data citation.

Does this mean that we will never have to develop new systems to handle data citation? Possibly But I don’t think we’ll know what those systems are or how they should work until we’ve actually had researchers attempting to use and adapt the tools we have.

Technical background

About five years ago, CrossRef and DataCite explored the possibility of exposing linkages between DataCite and CrossRef DOIs. Accordingly, we spent some time trying to assemble an example corpus that would illustrate the power of interlinking these identifiers. We encountered a slight problem. We could hardly find any examples. At that time, virtually nobody cited data with DataCite DOIs and, if they did, the CrossRef system did not handle them properly. We had to sit back and wait a while.

And now the situation has changed.

This demonstrator harvests DataCite DOIs using their OAI-PMH API and links them in a graph database with CrossRef DOIs. We have exposed this functionality on the “labs” (i.e. experimental) version of our REST API as a graph resource. So…

You can get a list of CrossRef DOIs that refer to DataCite DOIs as follows:


And the converse:


Caveats and Weasel Words

  • We have not finished indexing all the links.
  • The API is currently a very early labs project. It is about as reliable as a devolution promise from Westminster.
  • The API is run on a pair of raspberry-pi’s connected to the internet via bluetooth.
  • It is not fast.
  • The representation and the API is under active development.

    Things will change. Watch the CrossRef Labs site for updates on this collaboration with DataCite

Citation needed

Remember when I said that the Wikipedia was the 8th largest referrer of DOI links to published research? This despite only a fraction of eligible references in the free encyclopaedia using DOIs.

We aim to fix that. CrossRef and Wikimedia are launching a new initiative to better integrate scholarly literature in the world’s largest public knowledge space, Wikipedia.

This work will help promote standard links to scholarly references within Wikipedia, which persist over time by ensuring consistent use of DOIs and other citation identifiers in Wikipedia references. CrossRef will support the development and maintenance of Wikipedia’s citation tools on Wikipedia. This work will include bug fixes and performance improvements for existing tools, extending the tools to enable Wikipedia contributors to more easily look up and insert DOIs, and providing a “linkback” mechanism that alerts relevant parties when a persistent identifier is used in a Wikipedia reference.

In addition, CrossRef is creating the role of Wikimedia Ambassador (modeled after Wikimedian-in-Residence) to act as liaison with the Wikimedia community, promote use of scholarly references on Wikipedia, and educate about DOIs and other scholarly identifiers (ORCIDs, PubMed IDs, DataCite DOIs, etc) across Wikimedia projects.

Starting today, CrossRef will be working with Daniel Mietchen to coordinate CrossRef’s Wikimedia-related activities. Daniel’s team will be composed of Max Klein and Matt Senate, who will work to enhance Wikimedia citation tools, and will share the role of Wikipedia ambassador with Dorothy Howard.

Since the beginnings of Wikipedia, Daniel Mietchen has worked to integrate scholarly content into Wikimedia projects. He is part of an impressive community of active Wikipedians and developers who have worked extensively on linking Wikipedia articles to the formal literature and other scholarly resources. We’ve been talking to him about this project for nearly a year, and are happy to finally get it off the ground.


Matt, Max and Daniel at #wikimania2014. Photo by Dorothy.

]7 Matt, Max and Daniel at #wikimania2014. Photo by Dorothy.


Many Metrics. Such Data. Wow.

many_metrics CrossRef Labs loves to be the last to jump on an internet trend, so what better than than to combine the Doge meme with altmetrics? Want to know how many times a CrossRef DOI is cited by the Wikipedia? http://det.labs.crossref.org/works/doi/10.1371/journal.pone.0086859

Or how many times one has been mentioned in Europe PubMed Central?


Or DataCite?



Back in 2011 PLOS released its awesome ALM system as open source software (OSS). At CrossRef Labs, we thought it might be interesting to see what would happen if we ran our own instance of the system and loaded it up with a few CrossRef DOIs. So we did. And the code fell over. Oops. Somehow it didn’t like dealing with 10 million DOIs. Funny that.

But the beauty of OSS is that we were able to work with PLOS to scale the code to handle our volume of data. CrossRef contracted with Cottage Labs  and we both worked with PLOS to make changes to the system. These eventually got fed back into the main ALM source on Github. Now everybody benefits from our work. Yay for OSS.

So if you want to know technical details, skip to Details for Propellerheads. But if you want to know why we did this, and what we plan to do with it, read on.


There are (cough) some problems in our industry that we can best solve with shared infrastructure. When publishers first put scholarly content online, they used to make bilateral reference linking agreements. These agreements allowed them to link citations using each other’s proprietary reference linking APIs. But this system didn’t scale. It was too time-consuming to negotiate all the agreements needed to link to other publishers. And linking through many proprietary citation APIs was too complex and too fragile. So the industry founded CrossRef to create a common, cross-publisher citation linking API. CrossRef has since obviated the need for bilateral linking arrangements.

So-called altmetrics look like they might have similar characteristics. You have ~4000 CrossRef member publishers and N sources (e.g. Twitter, Mendeley, Facebook, CiteULike, etc.) where people use (e.g. discuss, bookmark, annotate, etc.) scholarly publications. Publishers could conceivably each choose to run their own system to collect this information. But if they did, they would face the following problems:

  • The N sources will be volatile. New ones will emerge. Old ones will vanish.
  • Each publisher will need to deal with each source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers and for the sources.
  • If publishers use different systems which in turn look at different sources, it will be difficult to compare results across publishers.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal? This isn’t a complete list, but it shows that there might be some virtue in publishers sharing an infrastructure for collecting this data. But what about commercial providers? Couldn’t they provide these ALM services? Of course – and some of them currently do. But normally they look on the actual collection of this data as a means to an end. The real value they provide is in the analysis, reporting and tools that they build on top of the data. CrossRef has no interest in building front-ends to this data. If there is a role for us to play here, it is simply in the collection and distribution of the data.

No, really, WHY?

Aren’t these altmetrics an ill-conceived and meretricious idea? By providing this kind of information, isn’t CrossRef just encouraging feckless, neoliberal university administrators to hasten academia’s slide into a Stakhanovite dystopia? Can’t these systems be gamed?


takes deep breath. wipes spittle from beard

These are all serious concerns. Goodhart’s Law and all that… If a university’s appointments and promotion committee is largely swayed by Impact Factor, it won’t improve a thing if they substitute or supplement Impact Factor with altmetrics. Amy Brand has repeatedly pointed out, the best institutions simply don’t use metrics this way at all (PowerPoint presentation). They know better.

But yes, it is still likely that some powerful people will come to lazy conclusions based on altmetrics. And following that, other lazy, unscrupulous and opportunistic people will attempt to game said metrics. We may even see an industry emerge to exploit this mess and provide the scholarly equivalent of SEO. Feh. Now I’m depressed and I need a drink.

So again, why is CrossRef doing this? Though we have our doubts about how effective altmetrics will be in evaluating the quality of content, we do believe that they are a useful tool for understanding how scholarly content is used and interpreted. The most eloquent arguments against altmetrics for measuring quality, inadvertently make the case for altmetrics as a tool for monitoring attention.

Critics of altmetrics point out that much of the attention that research receives outside of formal scholarly communications channels can be ascribed to:

  • Puffery. Researchers and/or university/publisher “PR wonks” over-promoting research results.
  • Innocent misinterpretation. A lay audience simply doesn’t understand the research results.
  • Deliberate misinterpretation. Ideologues misrepresent research results to support their agendas.
  • Salaciousness. The research appears to be about sex, drugs, crime, video games or other popular bogeymen.
  • Neurobollocks. A category unto itself these days.

In short, scholarly research might be misinterpreted. Shock horror. Ban all metrics. Whew. That won’t happen again.

Scholarly research has always been discussed outside of formal scholarly venues. Both by scholars themselves and by interested laity. Sometimes these discussions advance the scientific cause. Sometimes they undermine it. The University of Utah didn’t depend on widespread Internet access or social networks to promote yet-to-be peer-reviewed claims about cold fusion. That was just old-fashioned analogue puffery. And the Internet played no role in the Laetrile or DMSO crazes of the 1980s. You see, there were once these things called “newspapers.” And another thing called “television.” And a sophisticated meatspace-based social network called a “town square.”

But there are critical differences between then and now. As citizens get more access to the scholarly literature, it is far more likely that research is going to be discussed outside of formal scholarly venues. Now we can build tools to help researchers track these discussions. Now researchers can, if they need to, engage in the conversations as well. One would think that conscientious researchers would see it as their responsibility to remain engaged, to know how their research is being used. And especially to know when it is being misused.

That isn’t to say that we expect researchers will welcome this task. We are no Pollyannas. Researchers are already famously overstretched. They barely have time to keep up with the formally published literature. It seems cruel to expect them to keep up with the firehose of the Internet as well.

Which gets us back to the value of altmetrics tools. Our hope is that, as altmetrics tools evolve, they will provide publishers and researchers with an efficient mechanism for monitoring the use of their content in non-traditional venues. Just in the way that citations were used before they were distorted into proxies for credit and kudos.

We don’t think altmetrics are there yet. Partly because some parties are still tantalized by the prospect of usurping one metric for another. But mostly because the entire field is still nascent. People don’t yet know how the information can be combined and used effectively. So we still make naive assumptions such as “link=like” and “more=better.” Surely it will eventually occur to somebody that, instead, there may be a connection between repeated headline-grabbing research and academic fraud. A neuroscientist might be interested in a tool that alerts them if the MRI scans in their research paper are being misinterpreted on the web to promote neurobollocks. An immunologist may want to know if their research is being misused by the anti-vaccination movement. Perhaps the real value in gathering this data will be seen when somebody builds tools to help researchers DETECT puffery, social-citation cabals, and misinterpretation of research results?

But CrossRef won’t be building those tools. What we might be able to do is help others overcome another hurdle that blocks the development of more sophisticated tools; getting hold of the needed data in the first place. This is why we are dabbling in altmetrics.

Wikipedia is already the 8th largest referrer of CrossRef DOIs. Note that this doesn’t just mean that the Wikipedia cites lots of CrossRef DOIs, it means that people actually click on and follow those DOIs to the scholarly literature. As scholarly communication transcends traditional outlets and as the audience for scholarly research broadens, we think that it will be more important for publishers and researcher to be aware of how their research is being discussed and used. They may even need to engage more with non-scholarly audiences. In order to do this, they need to be aware of the conversations. CrossRef is providing this experimental data source in the hope that we can spur the development of more sophisticated tools for detecting and analyzing these conversations. Thankfully, this is an inexpensive experiment to conduct – largely thanks to the decision on the part of PLOS to open source its ALM code.

What Now?

CrossRef’s instance of PLOS’s ALM code is an experiment. We mentioned that we had encountered scalability problems and that we had resolved some of them. But there are still big scalability issues to address. For example, assuming a response time of 1 second, if we wanted to poll the English-language version of the Wikipedia to see what had cited each of the 65 million DOIs held in CrossRef, the process would take years to complete. But this is how the system is designed to work at the moment. It polls various source APIs to see if a particular DOI is “mentioned”. Parallelizing the queries might reduce the amount of time it takes to poll the Wikipedia, but it doesn’t reduce the work. Another obvious way in which we could improve the scalability of the system is to add a push mechanism to supplement the pull mechanism. Instead of going out and polling the Wikipedia 65 million times, we could establish a “scholarly linkback” mechanism that would allow third parties to alert us when DOIs and other scholarly identifiers are referenced (e.g. cited, bookmarked, shared). If the Wikipedia used this, then even in an extreme case scenario (i.e. everything in Wikipedia cites at least one CrossRef DOI), this would mean that we would only need to process ~ 4 million trackbacks.

The other significant advantage of adding a push API is that it would take the burden off of CrossRef to know what sources we want to poll. At the moment, if a new source comes online, we’d need to know about it and build a custom plugin to poll their data. This needlessly disadvantages new tools and services as it means that their data will not be gathered until they are big enough for us to pay attention to. If the service in question addresses a niche of the scholarly ecosystem, they may never become big enough. But if we allow sources to push data to us using a common infrastructure, then new sources do not need to wait for us to take notice before they can participate in the system.

Supporting (potentially) many new sources will raise another technical issue- tracking and maintaining the provenance of the data that we gather. The current ALM system does a pretty good job of keeping data, but if we ever want third parties to be able to rely on the system, we probably need to extend the provenance information so that the data is cheaply and easily auditable.

Perhaps the most important thing we want to learn from running this experimental ALM instance is: what it would take to run the system as a production service? What technical resources would it require? How could they be supported? And from this we hope to gain enough information to decide whether the service is worth running and, if so, by whom. CrossRef is just one of several organizations that could run such a service, but it is not clear if it would be the best one. We hope that as we work with PLOS, our members and the rest of the scholarly community, we’ll get a better idea of how such a service should be governed and sustained.

Details for Propellerheads

Warning, Caveats and Weasel Words

The CrossRef ALM instance is a CrossRef Labs project. It is running on R&D equipment in a non-production environment administered by an orangutang on a diet of Redbulls and vodka.

So what is working?

The system has been initially loaded with 317,500+  CrossRef DOIs representing publications from 2014. We will load more DOIs in reverse chronological order until we get bored or until the system falls over again.

We have activated the following sources:

  • PubMed
  • DataCite
  • PubMedCentral Europe Citations and Usage
  • We have data from the following sources but will need some work to achieve stability:

  • Facebook
  • Wikipedia
  • CiteULike
  • Twitter
  • Reddit
  • Some of them are faster than others. Some are more temperamental than others. WordPress, for example, seems to go into a sulk and shut itself off  after approximately 1,300 API calls.

    In any case, we will be monitoring and tweaking the sources as we gather data. We will also add new sources as we get requested API keys. We will probably even create one or two new sources ourselves. Watch this blog and we’ll update you as we add/tweak sources.

    Dammit, shut up already and tell me how to query stuff.

    You can login to the CrossRef ALM instance simply using a Mozilla Persona (yes, we’d eventually like to support ORCID too). Once logged-in, your account page will list an API key. Using the API key, you can do things like:


    And you will see that (as of this writing), said Nature article has been cited by the Wikipedia article here:

    <a href="http://en.wikipedia.org/wiki/HE0107-5240">http://en.wikipedia.org/wiki/HE0107-5240</a>

    PLOS has provided lovely detailed instructions for using the API– So, please, play with the API and see what you make of it. On our side we will be looking at how we can improve performance and expand coverage. We don’t promise much- the logistics here are formidable. As we said above, once you start working with millions of documents, the polling process starts to hit API walls quickly. But that is all part of the experiment. We appreciate your helping us and would like your feedback. We can be contacted at:



    DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?

    The South Park movie , “Bigger, Longer & Uncut” has a DOI:

    a) http://dx.doi.org/10.5240/B1FA-0EEC-C316-3316-3A73-L

    So does the pornographic movie, “Young Sex Crazed Nurses”:

    b) http://dx.doi.org/10.5240/4CF3-57AB-2481-651D-D53D-Q

    And the following DOI points to a fake article on a “Google-Based Alien Detector”:

    c) http://dx.doi.org/10.6084/m9.figshare.93964

    And the following DOI refers to an infamous fake article on literary theory:

    d) http://dx.doi.org/10.2307/466856

    This scholarly article discusses the entirely fictitious Australian “Drop Bear”:

    e) http://dx.doi.org/10.1080/00049182.2012.731307

    The following two DOIs point to the same article- the first DOI points to the final author version, and the second DOI points to the final published version:

    f) http://dx.doi.org/10.6084/m9.figshare.96546

    g) http://dx.doi.org/10.1007/s10827-012-0416-6

    This following two DOIs point to the same article- there is no apparent difference between the two copies:

    h) http://dx.doi.org/10.6084/m9.figshare.91541

    i) http://dx.doi.org/10.1038/npre.2012.7151.1

    Another example where two DOIs point to the same article and there is no apparent difference between the two copies:

    j) http://dx.doi.org/10.1364/AO.39.005477

    k) http://dx.doi.org/10.3929/ethz-a-005707391

    These journals assigned DOIs, but not through CrossRef:

    l) http://dx.doi.org/10.3233/BIR-2008-0496

    m) http://dx.doi.org/10.6084/m9.figshare.95564

    n) http://dx.doi.org/10.3205/cto000081

    These two DOIs are assigned to two different data sets by two different RAs:

    o) http://dx.doi.org/10.1107/S0108767312019034/eo5016sup1.xls

    p) http://dx.doi.org/10.1594/PANGAEA.726855

    This DOI appears to have been published, but was not registered until well after it was published. There were 254 unsuccessful attempts to resolve it in September 2012 alone:

    q) http://dx.doi.org/10.4233/uuid:995dd18a-dc5d-4a9a-b9eb-a16a07bfcc6d

    The owner of prefix, ‘10.4223,’ who is responsible for the above DOI had 378,790 attempted resolutions in September 2012 of which there were 377,001 failures. The top 10 DOI failures for this prefix each garnered over 200 attempted resolutions. As of November 2012 the prefix had only registered 349 DOIs.

    Of the above 16 example DOIs 11 cannot be used for CrossCheck or CrossMark. 3 cannot be used with content negotiation. To search metadata for the above examples, you need to visit four sites:





    The 14 examples come from just 4 of the 8 existing DOI registration agencies (RAs) It is virtually impossible for somebody without specialized knowledge to tell which DOIs are CrossRef DOIs and which ones are not.


    So DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right? Wrong.

    The examples above are useful because they help elucidate some misconceptions about the DOI itself, the nature of the DOI registration agencies and, in particular issues being raised by new RAs and new DOI allocation models.

    DOIs are just identifiers

    CrossRef’s dominance as the primary DOI registration agency makes it easy to assume CrossRef’s *particular* application of the DOI as a scholarly citation identifier is somehow intrinsic to the DOI. The truth is, the DOI has nothing specifically to do with citation or scholarly publishing. It is simply an identifier that can be used for virtually any application. DOIs could be used as serial numbers on car parts, as supply-chain management identifiers for videos and music or as cataloguing numbers for museum artifacts. The first two identifiers listed in the examples (a & b) illustrate this. They both belong to MovieLabs and are part of the EIDR (Entertainment Identifier Registry) effort to create a unique identifier for television and movie assets. At the moment, the DOIs that MoveLabs are assigning are B2B-focused and users are unlikely to see them in the wild. But we should recall that CrossRef’s application of DOIs was also initially considered a B2B identifier- but it has since become widely recognized and depended on by researchers, librarians and third parties. The visibility of EIDR DOIs could change rapidly as they become more popular.

    Multiple DOIs can be assigned to the same object

    There is no International DOI Foundation (IDF) prohibition against assigning multiple DOIs to the same object. At most the IDF suggests that RAs might coordinate to avoid duplicate assignments, but it provides no guidelines on how such cross-RA checks would work.

    CrossRef, in its particular application of the DOI, attempts to ensure that we don’t assign two different copies of the same article with different DOIs, but that is designed in order to avoid having publishers mistakenly making duplicate submissions. Even then, there are subtle exceptions to this rule- the same article, if legitimately published in two different issues (e.g. a regular issue and a thematic issue) will be assigned different DOIs. This is because, though the actual article content might be identical, the *context* in which it is cited is also important to record and distinguish. Finally, of course, we assign multiple DOIs to the same “object” when we assign book-level and chapter level DOIs. Or when we assign DOIs to components or reference work entries.
    The likelihood of multiple DOIs being assigned to the same object increases as we have multiple RAs. In the future we might legitimately have a monograph that has different Bowker DOIs for different e-book platforms (Kindle, iPad, Kobo.) yet all three might share the same CrossRef DOI for citation purposes.

    Again, the examples show this already happening. The examples f & g are assigned by DataCite (via FigShare) and CrossRef respectively. The first identifies the author version and was presumably assigned by said author. The second identifies the publisher version and was assigned by the publisher.

    Although CrossRef, as a publisher-focused RA, might have historically proscribed the assignment of CrossRef DOIs to archive or author versions , there has never been and could never be any such restrictions on other DOI RAs. These are legitimate applications of two citation identifiers to two versions of the same article.

    However, the next set of examples, h, i, j and k show what appears to be a slightly different problem. In these cases articles that appear to be in all aspects *identical* have been assigned two separate DOIs by different RAs. In one respect this is a logistical or technical problem- although CrossRef can check for such potential duplicate assignments within its own system, there is no way for us to do this across different RAs. But this is also a marketing and education problem- how do RAs with similar constituencies (publishers, researchers, librarians) and application of the DOI (scholarly citation) educate and inform their members about best practice in applying DOIs in that particular RAs context?

    DOI registration agencies are not focused on content types, they are focused on constituencies and applications

    The examples f through k also illustrate another area of fuzzy thinking about RAs- that they are somehow built around particular content types. We routinely hear people mistakenly explain that difference between CrossRef and DataCite is that “CrossRef assigns DOIs to journal articles” and that “DataCite assigns DOIs to data.” Sometimes this is supplemented with “and Bowker assigns DOIs to books.” This is nonsense. CrossRef assigns DOIs to data (example o) as well as conference proceedings, programs, images, tables, books, chapters, reference entries, etc. And DataCite covers a similar breadth of content types including articles (examples c, h, f, l, m ). The difference between CrossRef, DataCite and Bowker is their constituencies and applications- not the content types they apply DOIs to. CrossRef’s constituency is publishers. DataCite’s constituency is data repositories, archives and national libraries. But even though CrossRef and DataCite have different constituencies, they share a similar application of the DOI- that is the use of DOI as citation identifiers. This is in contrast to MovieLabs whose application of the DOI is supply chain management.

    DOI registration agency constituencies and applications can overlap *or* be entirely separate

    Although CrossRef’s constituency is “publishers”, we are catholic in our definition of “publisher” and have several members who run repositories that also “publish” content such as working papers and other grey literature (e.g. Woods Hole Oceanographic Institution, University of Michigan Library, University of Illinois Library). DataCite’s constituency is data repositories, archives and national libraries, but this doesn’t stop DataCite (through CDL/FigShare) from working with the publisher, PLoS, on their “Reproducibility Initiative” which requires the archiving of article-related datasets. PloS has announced that they will host all supplemental data sets on FigShare but will assign DOIs to those items through CrossRef.

    CrossRef’s constituency of publishers overlaps heavily with Airiti, JaLC, mEDRA, ISTIC and Bowker. In the case of all but Bowker we also overlap in our application of the DOI in the service of citation identification. Bowker, though it shares CrossRef’s constituency, uses DOIs for supply chain management applications.

    Meanwhile, EIDR is an outlier, its constituency does not overlap with CrossRef’s *and* its application of the DOI is different as well.

    The relationship between RA constituency overlap (e.g. scholarly publishers vs television/movie studios) and application overlap (e.g. citation identification vs. supply chain management) can be visualized as such:

    RA Application/Constituency overlap

    The differences (subtle or large) between the various RAs are not evident to anybody without a fairly sophisticated understanding of the identifier space and the constituencies represented by the various RAs. To the ordinary person these are all just DOIs, which in turn are described as simply being “persistent interoperable identifiers.”

    Which of course begs the question, what do we mean by “persistent” and “interoperable?”

    DOIs only are as persistent as the registration agency’s application warrants.

    The word “persistent” does not mean “permanent.” Andrew Treloar is known to point out that the primary sense of the word “persistent” in the New Oxford American Dictionary is:

    Continuing firmly or obstinately in a course of action in spite of difficulty or opposition

    Yet presumably the IDF once chose to use the word “persistent” instead of “perpetual” or “permanent” for other reasons. “Persistence” implies longevity, without committing to “forever.”

    It may sound prissy, but it seems reasonable to expect that the useful life-expectancy for the identifier used for managing inventory of the the movie “Young Sex Crazed Nurses” might be different than the life expectancy for the identifier used to cite Henry Oldenburg’s “Epistle Dedicatory” in the first issue of the Philosophical Transactions. In other words, some RAs have a mandate to be more “obstinate” than others and so their definitions of “persistence” may vary. Different RAs have different service level agreements.

    The problem is that ordinary users of the “persistent” DOI have no way of distinguishing between those DOIs that are expected to have a useful life of 5 years and those DOIs that are expected to have a useful lifespan of 300+ years. Unfortunately, if one of the more than 6 million non-CrossRef DOIs breaks today, it will likely be blamed on CrossRef.

    Similarly, if a DOI doesn’t work with an existing CrossRef service, like OpenURL lookup, CrossCheck, CrossMark or CrossRef Metadata Search, it will also be laid at the foot of CrossRef. This scenario is likely to become even more complex as different RAs provide different specialized services for their constituencies.

    Ironically, the converse doesn’t always apply. CrossRef oftentimes does not get credit for services that we instigated at the IDF level. For instance, FigShare has been widely praised for implementing content negotiation for DOIs even though this initiative had nothing to do with FigShare, instead it was implemented by DataCite with the prodding and active help of CrossRef (DataCite even used CrossRef’s code for a while). To be clear, we don’t begrudge praise for FigShare. We think FigShare is very cool- this just serves as an example of the confusion that is already occurring.


    DOIs are only “interoperable” at a least common denominator level of functionality

    There is no question that use of CrossRef DOIs has enabled the interoperability of citations across scholarly publisher sites. The extra level of indirection built into the DOI means that publishers do not have to worry about negotiating multiple bilateral linking agreements and proprietary APIs. Furthermore, at the mundane technical level of following HTTP links, publishers also don’t have to worry about whether the DOI was registered with mEDRA, DataCite or CrossRef as long as the DOI in question was applied with citation linking in mind.

    However, what happens if somebody wants to use metadata to search for a particular DOI? What happens if they expect that DOI to work with content negotiation or to enable a CrossCheck analysis or show a CrossMark dialog or carry FundRef data? At this level, the purported interoperability of the DOI system falls apart. A publisher issuing DataCite DOIs cannot use CrossCheck. A user with a mEDRA DOI cannot use it with content negotiation. Somebody searching CrossRef Metadata Search or using CrossRef’s OpenURL API will not find DataCite records. Somebody depositing metadata in an RA other than CrossRef or DataCite will not be able to deposit ORCIDs.

    There are no easy or cheap technical solutions to fix this level of incompatibility baring the creation of a superset of all RA functionality at the IDF level. But even if we had a technical solution to this problem- it isn’t clear that such a high-level of interoperability is warranted across all RAs. The degree of interoperability that is desirable between RAs is only in proportion to the degree that they serve overlapping constituencies (e.g. publishers) or use the DOI for overlapping applications (e.g. citation)

    DOI Interoperability matters more for some registration agencies than others

    This raises the question of what it even means to be “interoperable” between different RAs that share virtually no overlap in constituencies or applications. In what meaningful sense do you make a DOI used for inventory control “interoperable” with a DOI used for identifying citable scholarly works? Do we want to be able to check “Young Sex Crazed Nurses” for plagiarism? Or let somebody know when the South Park movie has been retracted or updated? Do we need to alert somebody when their inventory of citations falls below a certain threshold? Or let them know how many copies of a PDF are left in the warehouse?

    The opposite, but equally vexing issue arrises for RAs that actually share constituencies and/or applications. CrossRef, DataCIte and mEDRA have *all* built separate metadata search capabilities, separate deposit APIs, separate OpenURL APIs, and separate stats packages- *all* geared at handling scholarly citation linking.

    Finally, it seems a shame that a third party, like ORCID, who wants to enable researchers to add *any* DOI and its associated metadata to their ORCID profile, will end up having to interface with 4-5 different RAs.

    Summary and closing thoughts

    CrossRef was founded by publishers who were prescient in understanding that, as scholarly content moved online, there was the potential to add great value to publications by directly linking citations to the documents cited. However, publishers also realized that many of the architectural attributes that made the WWW so successful (decentralization, simple protocols for markup, linking and display, etc.), also made the web a fragile platform for persistent citation.

    The CrossRef solution to this dilemma was to introduce the use of the DOI identifier as a level of citation indirection in order to layer a persist-able citation infrastructure onto the web. The success of this mechanism has been evident at a number of levels. A first-order effect of the system is that it has allowed publishers to create reliable and persistent links between copies of publisher content. Indeed uptake of the CrossRef system by scholarly and professional publishers has been rapid and almost all serious scholarly publishers are now CrossRef members.

    The second order effects of the CrossRef system have also been remarkable. Firstly, just as researchers have long expected that any serious paper-based publication would include citations, now researchers expect that serious online scholarly publications will also support robust online citation linking. Secondly, some have adopted a cargo-cult practice of seeing the mere presence of a DOI on a publication as a putative sign of “citability” or “authority.” Thirdly, interest in use of the DOI as a linking mechanism has started to filter out to researchers themselves, thus potentially extending the use of CrossRef DOIs beyond being primarily a B2B citation convention.

    The irony is that although the DOI system was almost single-handedly popularized and promoted by CrossRef, the DOI brand is better known than CrossRef itself. We now find that new RAs like EIDR, DataCite and new services like FigShare are building on the DOI brand and taking it in new directions. As such the first and second order benefits of CrossRef’s pioneering work with DOIs are likely to be effected by the increasing activity of the new DOI RAs as well as the introduction of new models for assigning and maintaining DOIs.

    How can you trust that a DOI is persistent if different RAs have different conceptions of persistence? How can you expect the presence of a DOI to indicate “authority” or “scholarliness” if DOIs are being assigned to porn movies? How can you expect a DOI to point to the “published” version of an article when authors can upload and assign DOIs to their own copies of articles?

    It is precisely because we think that some of the qualities traditionally (and wrongly) accorded to DOIs (e.g. scholarly, published, stewarded, citable, persistent) are going to be diluted in the long term that we have focused so much of our recent attention on new initiatives that have a more direct and unambiguous connection to assessing the trustworthiness of CrossRef member’s content. CrossCheck and the CrossCheck logos are designed to highlight the role that publishers play in detecting and preventing academic fraud. The CrossMark identification service will serve as a signal to researchers that publishers are committed to maintaining their scholarly content as well as giving scholars the information they need to verify that they are using the most recent and reliable versions of a document. FundRef is designed to make the funding sources for research and articles transparent and easily accessible. And finally we have been both adjusting CrossRef’s branding and display guidelines as well as working with the IDF to refine its branding and display guidelines so as to help clearly differentiate different DOI applications and constituencies.

    Whilst it might be worrying to some that DOIs are being applied in ways that CrossRef has not expected and may not have historically endorsed, we should celebrate that the broader scholarly community is finally recognizing the importance of persist-able citation identifiers.

    These developments also serve to reinforce a strong trend that we have encountered in several guises before. That is, the complete scholarly citation record is made up of more than citations to the formally published literature. Our work on ORCID underscored that researchers, funding agencies, institutions and publishers are interested in developing a more holistic view of the manifold contributions that are integral to research. The “C” in ORCID stands for “contributor” and ORCID profiles are designed to ultimately allow researchers to record “products” which include not only formal publications, but also data sets, patents, software, web pages and other research outputs. Similarly, CrossRef’s analysis of the CitedBy references revealed that one in fifteen references in the scholarly literature published in 2012 included a plain, ordinary HTTP URI- clear evidence that researchers need to be able to cite informally published content on the web. If the trend in CitedBy data continues, then in two to three years one in ten citations will be of informally published literature.

    The developments that we are seeing are a response to the need that users have to persistently identify and cite the full gamut of content types that make up the scholarly literature. If we can not persistently site these content types, the scholarly citation record will grow increasingly porous and structurally unsound.  We can either stand back and let these gaps be filled by other players under their terms and deal reactively with the confusion that is likely to ensue- or we can start working in these areas too and help to make sure that what gets developed interacts with the existing online scholarly citation record in a responsible way.