Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Researchers discover a new form of scientific fraud: 'sneaked references' (phys.org)
137 points by toss1 on July 10, 2024 | hide | past | favorite | 44 comments


More info at https://retractionwatch.com/2023/10/09/how-thousands-of-invi... and https://arxiv.org/abs/2310.02192 . As the latter makes clear, this type of fraud is most likely done by journal editors (or their assistants), not by the authors:

> When registering a new publication and its references at Crossref, a publisher may sneak extra undue references in the metadata sent in addition to the ones originally present. Then, digital libraries (e.g., SpringerLink) and bibliometric platforms (e.g., Dimensions) harvest these metadata, undue citations included. These sneaked references are processed and counted even if they are not present in the original publication.

The three journals in this particular case are all published by Technoscience Academy, an OA publisher operating out of India (not one of the well-known ones). I would think twice as an author before I submitted to any journal from this publisher, lest my paper is abused for manipulations like this (although I'm not sure if it has any journals worth submitting to anyway).

NB (because I got confused first): This is not really about Hindawi. Hindawi published the (trash) article that these fake citations were pumping up, but the pumping-up happened using Technoscience Academy journals.


There are troublesome downstream effects of an automated society that we're only beginning to reckon with. Humans may not be not perfect, but there's something to be said for keeping some processes slow and inefficient.


Not say that Hindawi is an uncontroversial party https://retractionwatch.com/2024/03/14/up-to-one-in-seven-of...


Hey at least those editors actually do something during the publishing process!


Article doesn't really address why this is happening. My guess would be financial incentives for researchers and professors to publish in international journals, a common practice in some countries. For instance, according to "Analysis of Chinese universities’ financial incentives for academic publications":

In recent years payments based directly on the number of citations a paper receives have become more popular, but are still much less common than those based on the journal’s impact factor.

https://opportunities-insight.britishcouncil.org/insights-bl...


H-index is still a frequently cited measure for an academic researcher's "impact" when comparing individuals across fields -- the idea being that authors with more papers and more citations are "better" (all other things being equal). Papers with higher citation counts also appear more prominently among search results (e.g. Google Scholar).

If you make the analogy between the www: H-index is pagerank, citations are back links, and authors (researchers) are the domain names. Gaming h-index is akin to SEO hacking for academic authors.


It’s also used for evaluating Visa applications, e.g. the O-1:

“The petitioner provides evidence demonstrating that the total rate of citations to the beneficiary’s body of published work is high relative to others in the field, or the beneficiary has a high h-index[30] for the field. Depending on the field and the comparative data the petitioner provides, such evidence may indicate a beneficiary’s high overall standing for the purpose of demonstrating that the beneficiary is among the small percentage at the top of the field.[31]”


How would a paper have a h-index?


Ah, my original note was ambiguous - fixed. Link for convenience:

https://en.m.wikipedia.org/wiki/H-index


Hilarious to me how this is basically, "researchers discover SEO techniques."


Editors and publishers, not researchers. Researchers aren't the ones handling metadata


Speedrunning two decades of blackhat SEO...


They mean references cited in the metadata but not in the actual paper. So it’s “invisible” and can be gamed because some citation trackers rely on the metadata rather than having to parse the paper.


That’s weird. With every paper I’ve published (in chemistry) the authors don’t handle the metadata. We upload our article sources (in latex or word) and the publisher handles the rest. I’ve never done anything more.

Is this different in other fields? Or in sketchy journals?


It seems like it's being done entirely on the publisher end, with them – or friends – benefiting:

> For example, a single researcher who was associated with Technoscience Academy benefited from more than 3,000 additional illegitimate citations. Some journals from the same publisher benefited from a couple hundred additional sneaked citations.

Perhaps this publisher or others also offer this as some kind of backroom deal / service.


"The post caught the attention of several sleuths who are now the authors of the JASIST article. We used a scientific search engine to look for articles citing the initial article. Google Scholar found none, but Crossref and Dimensions did find references. The difference? Google Scholar is likely to mostly rely on the article's main text to extract the references appearing in the bibliography section, whereas Crossref and Dimensions use metadata provided by publishers."

So Google Scholar uses the text which is good. Then obvious solution is to go and look up where it has been cited, which would be easy to do with google scholar.

I don't know why anyone would jeopardize their career like this.


This looks like someone on the editorial side snuck the references in. Don't think authors have a way to do it.


> I don't know why anyone would jeopardize their career like this.

Publish or perish.


Great plan. Nobody has _ever_ worked out ways to game Google's search algorithms!


Why did it take the article 7 paragraphs to tell me what the fraud was? Ad revenue driven journalism.


>Some legitimate references were also lost, meaning they were not present in the metadata.

It's possible that some of the inconsistency between metadata and text could just be due to incompetence - it's harder to find a profit motive for dropping legitimate citations. Why wouldn't this sort of metadata auto-generated from the text (aside from enabling fraud, of course)?


Which is harder to detect: replacing reference 17 with the one you're trying to pump, or adding reference 35 when the bibliography in the original paper clearly stops at 34?


> it's harder to find a profit motive for dropping legitimate citations

Competitiveness for citation points, especially with someone in or adjacent to your niche?

Also, the non-profit: pettiness.


I'm unclear on whether to pin this on the publisher or on the authors.

In the first example shown in the linked pre-print [1] there's a paper with 62 downloads that's been cited 107 times within two months. The pre-print looks deeper into a paper with 7 "real" references whose metadata has an extra 40 references not found in the PDF. This leaves us with three options:

  * the author of a paper with 62 downloads (not an amazing number) was convinced into joining a citation ring along with 40 other authors,
  * the publisher has been sneaking references onto unsuspecting papers, or
  * the publisher has a vulnerability on their metadata system that's being actively exploited by the two scholars identified in the pre-print.
Whatever the case, I'm glad the solution is as simple as "you should parse the references yourself". I do however wonder: is someone checking whether all of the references are actually referenced within the paper?

[1] https://arxiv.org/pdf/2310.02192


> is someone checking whether all of the references are actually referenced within the paper?

How would that work for paperback references? That would be a nightmare. If an author cites 20 different sources, verifier needs to checkout 20 sources at the library (if they are even available)


What I meant was: if the body of your paper references seven other papers then your References section should be exactly seven papers long. Otherwise you could sneak an unrelated paper that way and hardly anyone would notice.


So how does the fraud work? Researcher wants to boost his citation count so they can get more funding, respect, etc. They ask their friends to cite their paper in a metadata-only reference in their other papers, even though the papers didn't really reference anything from the original paper.

They should be able to find citation "rings" then, whole groups which regularly do this, probably associated with specific institutions or journals.

The linked study did part of this: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.2489...

> An analysis of the 10 sneaked references in Dimensions reveals that they benefit mainly two authors (Initials JNR & BK)

Now, it would be interesting to see if JNR and BK's publications used this trick and in turn benefitted, some other group.


> So how does the fraud work? Researcher wants to boost his citation count so they can get more funding, respect, etc. They ask their friends to cite their paper in a metadata-only reference in their other papers, even though the papers didn't really reference anything from the original paper.

This is probably the publisher's doing rather than the author of the paper.


Do the publishers actually _do_ anything? (Apart from invoicing academic institutions and demanding unpaid work from peer reviewers, of course).

Wouldn't surprise me in the slightest to find they demanded authors create/submit the metadata with the references in it, and that it never gets shown to the unpaid peer reviewers, and is never checked by anybody.


Does LaTeX allow this to happen? Maybe a simple typesetting change to exclude references that are not mentioned in the text?

This is a problem with the journal review and editors. Also, typesetting tools that create the final version can and should be setup to protect things like these. I know folks may want to go hunt for sexy genai tooling to solve this - but I think the solution is much simpler.


> Does LaTeX allow this to happen? Maybe a simple typesetting change to exclude references that are not mentioned in the text?

> This is a problem with the journal review and editors. Also, typesetting tools that create the final version can and should be setup to protect things like these. I know folks may want to go hunt for sexy genai tooling to solve this - but I think the solution is much simpler.

The issue in the article isn' a paper being listed in the references but not actually cited elsewhere of the paper; it's not something within the actual paper at all. It's metadata created by the publisher.

So it presumably doesn't have anything to do with what latex allows or doesn't allow.


In most cases the published articles is document that has some additional metadata added during the publishing process. Hence, there's an inherent disconnect between the article and the metadata that must be examined for accuracy and truth.

Instead, the published article should really be a view of the structured data, metadata, and text (i.e. the true content) that makes up the article. Formatting and such can be a pain, but using this approach would mean the published article is a view of the truth rather than the metadata being created as something of an afterthought.


Extracting the key definition:

  These additional references were only in the
  metadata, distorting citation counts and
  giving certain authors an unfair advantage.
Papers with metadata that doesn't match the contents of the paper. The article notes that Google Scholar is unaffected, as it extracts citations from the paper itself by parsing the text of the printed bibliography.


Why isn't Hanlon's razor applicable here? Maybe for a significant number it is, and maybe not all.

The problem would be if this turns into a negative index, it can have equally bad repercussions. So attribution to malice and intent is important because there will be people adversely affected.

If this is a publisher/SEO fuckup, that needs to be seen distinct from "fraud"


Something similar hit the news in Spain recently:

https://cadenaser.com/castillayleon/2024/03/15/el-candidato-...


A different kind of problem.

Sometimes there are just five people working on a particular theme. Avoiding entirely to cite your previous work because some people could frown is stupid. Hides a fifth of the extant knowledge to the readers for no real benefit. As researchers now are required to produce constantly, they only can release small increases in their work. Each article isolated will not made any difference, but are part of a slow chain. Results don't come necessarily in a constant predictable stream. Citing the previous chapter in those science series is reasonable or even necessary to understand the current article.

Citing only your work and not other's work would be the problem.

And sometimes this things just happen. I could talk about "my curriculum" on the official web of my university. A list of several articles with my name. I never wrote, toke part on the design on the web or supported it in any way.

And is totally fake. I didn't wrote a single one of the articles cited on it.

After scratching my head a little I see that they commit a obvious mistake in the database query (that is not my problem anymore to fix).



so is the implication those cited are paying the publishers to add them to the metadata of the papers they publish? what is the actual mechanism?


left as an exercise for the reader


Its SEO spam link building for academics


"I'm In!"


This would only be relevant for shit journals that use authors’ pdfs directly and don’t do their own processing.


feels like a neural net could detect these by scoring and ranking the relevance of the content of a reference paper to the content of the paper citing it. maybe it's time for a citizen science project to dismantle these academic fraud rings? they form networks that capture academic administrations and have significant downstream effects on policy and education. just by identifying the worst and most egregious offenders, leaving the merely dodgy alone, it could break up the hold they have on institutions.


Wouldnt a text search for each metadata reference in the publication take care of this problem?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: