Geoffrey Bilder – 2015 March 02
Crossref’s “DOI Event Tracker Pilot”- 11 million+ DOIs & 64 million+ events. You can play with it at: http://goo.gl/OxImJa
So have you been wondering what we’ve been doing since we posted about the experiments we were conducting using PLOS’s open source ALM code? A lot, it turns out. About a week after our post, we were contacted by a group of our members from OASPA who expressed an interest in working with the system. Apparently they were all about to conduct similar experiments using the ALM code, and they thought that it might be more efficient and interesting if they did so together using our installation. Yippee. Publishers working together. That’s what we’re all about.
So we convened the interested parties and had a meeting to discuss what problems they were trying to solve and how Crossref might be able to help them. That early meeting came to a consensus on a number of issues:
And so after that meeting, the “experiment” graduated to becoming a “pilot.” This Crossref pilot is based on the premise that the infrastructure involved in tracking common information about “DOI events” can be usefully separated from the value-added services of analysing and presenting these events in the form of qualitative indicators. There are many forms of events and interactions which may be of interest. Service providers will wish to analyse, aggregate and present those in a range of different ways depending on the customer and their problem. The capture of the underlying events can be kept separate from those services.
In order to ensure that the Crossref pilot is not mistaken for some sub rosa attempt to establish new metrics for evaluating scholarly output, we also decided eschew any moniker that includes the word “metrics” or synonyms. So the “ALM Experiment” is dead. Long live the “”DOI Event Tracker” (DET) pilot. Similarly PLOS’s open source “ALM software” has been resurrected under the name “Lagotto.”
Crossref members are interested in knowing about “events” relating to the DOIs that identify their content. But our members face a now-classic problem. There are a large number of sources for scholarly publications (3k+ Crossref members) and that list is still growing. Similarly, there are an unbounded number of potential sources for usage information. For example:
Finally, there is a broad and growing audience of stakeholders who are interested in seeing how the literature is being used. The audience includes publishers themselves as well as funders, researchers, institutions, policy makers and citizens.
Publishers (or other stakeholders) could conceivably each choose to run their own system to collect this information and redistribute it to interested parties. Or they can work with a vendor to do the same. But either case, they would face the following problems:
And then there is the simple issue of scale. Most parties will be interested in comparing the data that they collect for their own content, with data about their competitors. Hence, if they all run their own system, they will each be querying much more than their own data. If, for example, just the commercial third-party providers were interested in collecting data covering the formal scholarly literature, they would each find themselves querying the same sources for the same 80 million DOIs. To put this into perspective, to refresh the data for 10 million DOIs once a month, would require sources to support ~ 14K API calls an hour. 60 million DOIs would require 100K API calls an hour. Current standard API caps for many of the sources that people are interested in querying hover around 2K per hour. We may see these sources lift that cap for exceptional cases, but they are unlikely to do so for many different clients all of whom are querying essentially the same thing.
These issues typify the “multiple bilateral relationships” problem that Crossref was founded to try and ameliorate. When we have many organizations trying to access the exact same APIs to process the exact same data (albeit to different ends), then it seems likely that Crossref could help make the process more efficient.
The Crossref DET pilot aims to show the feasibility of providing a hub for the collection, storage and propagation of DOI events from multiple sources to multiple audiences.
Parties may want to use the DET in order to propagate information about DOI events. The system will support two broad data propagation patterns:
Note that the two patterns can be combined. So, for example, a publisher might want to have public social media events reported to the DET and propagated accordingly, but to also to private third parties report confidential information directly to the publisher.
So to start with, the DET Working Group has grown substantially since the early days and we have representatives from a wide variety of stakeholders. The group includes:
As well as the usual contingent of Crossref cat-herders including: Geoffrey Bilder, Rachael Lammey & Joe Wass.
When we announced the then-DET experiment, we said that one of the biggest challenges would be to create something that scaled to industry levels. At launch, we only loaded in about 317,500+ Crossref DOIs representing publications from 2014 and we could see the system was going to struggle. Since then Martin Fenner and Jennifer Lin at PLOS have been focusing on making sure that the Lagotto code scales appropriately and now it is currently humming along with just over 11.5 million DOIs for which we’ve gathered over 64 million “events.” We aren’t worried about scalability on that front any more.
We’ve also shown that third parties should be able to access the API to provide value added reporting and metrics. As a demonstration of this, PLOS configured a copy of its reporting software “Parascope” to point at the Crossref DET instance. The next step we’re taking is to start testing the “push” API mechanism and the “point-to-point redirect” API mechanism. For the push API, we should have a really exciting demo available to show within the next few days. And on the point-to-point redirect, we have a sub-group exploring how the point-to-point redirect mechanism could potentially be used for reporting COUNTER stats as a compliment to the Sushi initiative.
The other major outstanding task we have before us is to calculate what the costs will be of running the DET system as a production service. In this case we expect to have some pretty accurate data to go on as we will have had close to half a year of running the pilot with a non-trivial number of DOIs and sources. Note that the work group is concerned to ensure that the underlying data from the system remains open to all. Keeping this raw data open as seen as critical to establishing trust in the metrics and reporting systems that third parties build on the data. The group has also committed to leaving the creation of value-add services to third parties. As such we have been focusing on exploring business models based around service-level-agreement backed versions of the API to complement the free version of the same API. The free API will come with no guarantees of uptime, performance characteristics or support. For those users that depend on the API in order to deliver their services, we will offer paid-for SLA-backed versions of the free APIs. We can then configure our systems so that we can independently scale these SLA-backed APIs in order to meet SLA agreements.
Our goal is to have these calculations complete in time for the working group to make a recommendation to the Crossref board meeting in July 2015.
Until then, we’ll use CrossTech as a venue for notifying people when we’ve hit new milestones or added new capabilities to the DET Pilot system.
2019 October 04
2019 September 13
2019 August 29