Ethics group

Research in Publication Ethics (VBI Summer Program)

Much of the research in publication ethics being done over the past six years has concerned the measurement and characterization of highly similar scientific papers discovered by a similarity search program. The program was created in 2004 and named eTBLAST as a "take-off" of the BLAST programs that search for regions of similarity between gene sequences. First reports about eTBLAST noted an improvement over keyword literature searches with a specific hybrid search algorithm consisting of a low-sensitivity weighted keyword-based first pass followed by a novel sentence-alignment based second pass. (Lewis, 2006).

Errami (2006) also explained that the original goals for the eTBLAST text similarity search program were to help scientists and professionals locate authors, journals and publications related to a specific topic. The user entered a paragraph or abstract of a specific topic into eTBLAST and the result will be a list of related articles, authors and journals. The utility of identifying related articles proved to have an additional purpose of identifying highly similar articles. In some cases, these could be instances of plagiarism and/or duplicate publication.

The possibility of a high level of duplication in the Medline database, with approximately 19 million documents, led to the collection of highly similar articles in a database of about 79,000 pairs of documents. This database was called Deja Vu. The database allowed the study of plagiarism and duplicate publication and increased the awareness of unethical behavior (Errami, 2007). Specific data on possible plagiarism and duplicate publications was gathered from Deja Vu. (Errami, 2008).

Some of the more interesting cases of duplication led to questions sent directly to authors of the original journal articles, authors of the duplicate journal articles, and editors of the earlier and later journals. While there were a variety of responses given, journal editors showed the most concern, with over 50% of the cases leading to some form of apology or sanction. (Long, 2009).

As the data from eTBLAST and Deja Vu proved to become important in studying the ethics of scientific publications, the shortcomings of eTBLAST were also discussed and improved upon. When only comparing abstracts, eTBLAST has a low sensitivity and is somewhat slow. Errami et al. (2010) suggest that using shorter statistically improbably phrases (SIP's) of the abstract in the search will give better results, will be faster and runs on less CPU than the abstract as a whole.

A study in press has looked further into the premise of comparing abstracts to predict full text similarity. Garner et al. (2010) propose that when abstracts are compared in eTBLAST there is a specificity of 20.1% to predict the full text similarity. The study also broke down articles into sections (introduction, methods, results, etc.) and compared the similarity of each section vs. the similarity of the abstract. The study reported that a similar results section would be the best indicator of duplicate publications, while the methods section of a paper is the most commonly reused.

One unrelated study on duplicate publications expanded text similarity research into other fields. While eTBLAST only searches through MEDLINE and PubMed and looks at mostly biomedical manuscripts, Lariviere and Gingras (2010) looked at several other fields such as social sciences and humanities. Their method for finding duplicate publications was different as it was based on metadata of the paper, instead of searching for similar text in the abstracts. Lariviere and Gingras classified duplicate publications as papers that have the exact same title, same first author and same number of cited references. The instances of duplication found in this study were found to be lower than results of eTBLAST studies. This study also went on to describe and expand upon results related to duplication in different scientific fields. The results could be lower due to the fact that even slightly altered metadata between two papers would not result in a duplicate find (Lariviere 2010).

All of the previous studies discussed above have focused on analyzing data through purely computational methods. When some of the concerns and limitations of those methods were mentioned, a summer's worth of human curating commenced. A group of students manually examined and analyzed hundreds of pairs of papers from the Deja vu database which led to some new and different questions about scientific publication data. Findings in this area are part of the 2010 VBI Ethics in Summer Research Program, presented Aug. 6th.

References --

Errami, M., Hicks, J.M., Fisher, W., Trusty, D., Wren, J.D., Long, T.C. and Garner, H.R. "Deja vu- A Study of Duplicate Citations in Medline." Bioinformatics. 2008; 24(2):243-249

Errami, M., Sun, Z., Long, T.C., George, A.C. and Garner, H.R. "Deja vu- a database of highly similar citations in the scientific literature." Nucleic Acids Research. 2008; 37: D921-D92

Errami, M., Sun, Z., George, A.C., Long, T.C., Skinner, M.A., Wren, J.D. and Garner, H.R. "Identifying duplicate content using statistically improbable phrases." Bioinformatics. 2010; 26(11):1453-1457

Errami, M., Wren, J.D., Hicks, J.M., and Garner, H.R. "eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications." Nucleic Acids Research. 2007; 35: W12-W15.

Garner, H.R., et al. "Characterizations of the text similarity in full text biomedical citations." In press.

Lariviere, V. and Gingras, Y. "On the prevalence and scientific impact of duplicate publications in different scientific fields (1980-2007)." Journal of Documentation. 2010; 66(2): 179-190.

Lewis, J., Ossowski, S., Hicks, J.M., Errami, M., and Garner, H.R. "Text similarity: an alternative way to search MEDLINE."Bioinformatics. 2006; 22(18):2298-304.

Long, T.C., Errami, M., George, A.C., Sun, Z. and Garner, H.R. "Scientific Integrity: Responding to Possible Plagiarism." Science. 2009; 323(5919):1293-1294.