Dominika Tkaczyk – 2020 March 10
Detective Matcher stopped abruptly behind the corner of a short building, praying that his loud heartbeat doesn’t give up his presence. This missing DOI case was unlike any other before, keeping him awake for many seconds already. It took a great effort and a good amount of help from his clever assistant Fuzzy Comparison to make sense of the sparse clues provided by Miss Unstructured Reference, an elegant young lady with a shy smile, who begged him to take up this case at any cost.
The final confrontation was about to happen, the detective could feel it, and his intuition rarely misled him in the past. He was observing DOI
10.2307/257306, which matched Miss Reference’s description very well. So far, there was no indication that DOI had any idea he was being observed. He was leaning on a wall across the street in a seemingly nonchalant way, just about to put out his cigarette. Empty dark streets and slowly falling snow together created an excellent opportunity to capture the fugitive.
Suddenly, Matcher heard a faint rustling sound. Out of nowhere, another shady figure, looking very much like
10.5465/amr.1982.4285592, appeared in front of the detective, crossed the street and started running away. Matcher couldn’t believe his eyes. These two DOIs had identical authors, year and title. They were even wearing identical volume and issue! He quickly noticed minor differences: slight alteration in the journal title and lack of the second page number in one of the DOIs, but this was likely just a random mutation. How could have he missed the other DOI? And more importantly, which of them was the one worried Miss Reference simply couldn’t live without?
In an ideal world, the relationship between research outputs and DOIs is one-to-one: every research output has exactly one DOI assigned and each DOI points to exactly one research output.
As we all know too well, we do not live in a perfect world, and this one-to-one relationship is also sometimes violated. One way to violate it is to assign more than one DOI to the same object. This can cause problems.
First of all, if there are two DOIs referring to the same object, eventually they both might end up in different systems and datasets. As a result, merging data between data sources becomes an issue, because we no longer can rely on comparing the DOI strings only.
Reference matching algorithms will also be confused when they encounter more than one DOI matching the input reference. They might end up assigning one DOI from the matching ones at random, or not assigning any DOI at all.
And finally, more than one DOI assigned to one object is hugely problematic for document-level metrics such as citation counts, and eventually affects h-indexes and impact factors. In practice, metrics are typically calculated per DOI, so when there are two DOIs pointing to one document, the citation count might be split between them, effectively lowering the count, and making every academic author’s biggest nightmare come true.
It seems we shouldn’t simply cover our eyes and pretend this problem does not exist. So what are we doing at Crossref to make the situation better?
Despite these efforts, we still see duplicates that are not explained by anything in the metadata. In this blog post, I will try to understand this problem better and assess how big it is. I also define three member-level metrics that can show how much a given member contributes to duplicates in the system and can flag members with unusually high fractions of duplicates.
query.bibliographicin Crossref’s REST API and the resulting item list was examined.
In total, I tested 590 members and 524,496 DOIs. Among them, 4,240 DOIs (0.8%) were flagged as duplicates of other DOIs. This shows the problem exists, but is not huge.
I also analyzed separately two categories of duplicates:
Self-duplicates are more common: 3,603 (85%) of all detected duplicates are self-duplicates, and only 637 (15%) are other-duplicates. This is also good news: self-duplicates involve one member only, so they are easier to handle.
To explore the levels of self-duplicates among members, I used a custom member-level metric called self-duplicate index. Self-duplicate index is the fraction of self-duplicates among the member’s DOIs, in this case calculated over a sample.
On average, members have a very small self-duplicate index of 0.67%. In addition, in the samples of 44% of analyzed members no self-duplicates were found. The histogram shows the skewness of the distribution:
As we can see in the distribution, there are only a few members with high self-duplicate index. The table shows all members with the self-duplicate higher than 10%:
|Name||Total DOIs||Sample size||Self-duplicate index|
|University of California Press||129,741||798||36%|
|American Society of Hematology||137,124||990||24%|
|Pro Reitoria de Pesquisa, Pos Graduacao e Inovacao - UFF||7,756||919||19%|
|American Diabetes Association||49,536||946||18%|
Other-duplicate index is the fraction of other duplicates among the member’s DOIs, in this case calculated from a sample.
On average, members have a very low other-duplicate index of only 0.13%. What is more, 89% members have no other-duplicates in the sample, and the distribution is even more skewed than in the case of self-duplicates:
Here is the list of all members with more than 2% of other-duplicates in the sample:
|Name||Total DOIs||Sample size||Other-duplicate index|
|American Bryological and Lichenological Society||5,593||844||41%|
|American Mathematical Society (AMS)||83,015||844||4%|
American Bryological and Lichenological Society is a clear outlier with 41% of their sample flagged as duplicates. Interestingly, all those duplicates come from one other member only (JSTOR) and JSTOR was the first to deposit them.
Similarly, all other-duplicates detected in the American Mathematical Society’s sample are shared with JSTOR, and JSTOR was the first to deposit them.
Maney Publishing’s 51 other-duplicates are all shared with a member not listed in this table: Informa UK Limited.
JSTOR is the only member in this table, whose 36 other-duplicates are shared with multiple (8) members.
Another interesting observation is that the members in this table (apart from JSTOR) are rather small or medium, in terms of total DOIs registered by them. It is also worrying that Informa UK Limited, a member that shares 51 other-duplicates flagged in Maney Publishing’s sample, was not flagged by this index. The reason might be differences in the overall number of registered DOIs: two members that deposited the same number of other-duplicates, but have different overall numbers of registered DOIs, will have different other-duplicate indexes.
To address this issue, I looked at a third index called global other-duplicate index. Global other-duplicate index is the fraction of globally detected other-duplicates involving a given member.
Global other-duplicate index has a useful interpretation: it tells us how much the overall number of other-duplicates would drop, if the given member resolved all its other-duplicates (for example by setting appropriate relations or correcting the metadata so that it is no longer so similar).
Here is the list of members with global-duplicate index higher than 2%:
|Name||Total DOIs||Global other-duplicate index|
|American Bryological and Lichenological Society||5,593||54%|
|Informa UK Limited||4,275,507||15%|
|American Mathematical Society (AMS)||83,015||6%|
|Liverpool University Press||31,870||3%|
|Cambridge University Press (CUP)||1,621,713||2%|
|Ovid Technologies (Wolters Kluwer Health)||2,152,723||2%|
|University of Toronto Press Inc. (UTPress)||46,778||2%|
Note that the values add up to more than 100%. This is because in every other-duplicate there are two members involved, so the involvement adds up to 200%.
As we can see, all the members from the previous table are in this one as well. Apart from them, however, this index flagged several large members. Among them, Informa UK Limited, that was missing from the previous table.
All the indexes defined here are useful in identifying members that contribute a lot of duplicates to the Crossref metadata. They can be used to help to clean up the metadata, and also to monitor the situation in the future.
It is important to remember that index values presented here were calculated on a single sample of DOIs drawn for a given member. The values would be different if a different sample was used, and so they shouldn’t be treated as exact numbers.
The tables include members with the index exceeding a certain threshold, chosen arbitrarily, for illustrative purposes. Different runs with different samples could result in different members being included in the tables, especially in their lower parts.
To obtain more stable values of indexes, multiple samples could be used. Alternatively, in the case of smaller members, exact values could be calculated from all their DOIs.
2020 March 27
2020 March 24