SIGCSE 2001 DC Application -- Thomas Lancaster

Introduction

For years Computer Science departments have used computers to find similarities in submissions of program source code. Finding similarities in non-constrained texts, such as a standard student essay, is a much more computationally intensive process which has only recently become feasible. This research aims to show how plagiarism can be detected as part of a Four-Stage Plagiarism Detection Process and how tools can aid of manually verification of similarity once it is suspected.

Previous research in the area

Reports before the growth of the Web suggest that over 75% of students cheat and over 50% plagiarise [7]. With the addition of Web plagiarism the figure can be expected to be much higher. Much work through the years has been done on finding plagiarism in student submissions of program source code [6,9]. There is little evidence of any academic work on finding similarities in student submissions apart from the commerical plagiarism detection services, such as plagiarism.org, that have recently appeared.

Goals of the research

The research aims to show that plagiarism detection is both feasible and usable and to provide a complete plagiarism detection system that meets the Four-Stage Plagiarism Detection Process, collection, analysis, investigation and confirmation, with the desired intention of minimising additional work for tutors.

Current status

A review of literature is in progress, mainly focusing on stylistics literature and the small amount of plagiarism literature available from a sociological perspective. Prototype versions of four systems are available, Student Submission System (SSS), a system by which students can submit RTF or Java files for testing. Text Ranker (TRANK), a system which orders pairs of submissions with a corpus in order of similarity. Visualisation and Analysis of Similarity Tool (VAST), a system which presents a graphical representation of the similarity between two text documents to allow a tutor to quickly confirm their similarity. Text Analysis Tool (TAT), a system which presents a rolling representation of the stylistic properties of a submission so find areas that are likely to represent extra-corpal plagiarism.

Six papers have been published, are awaiting production or are in preparation. A plagiarism taxonomy provides many definitions for the work and makes it easier for tutors to discuss plagiarism [1]. A review of the existing Web plagiarism detection systems finds that they can usually find Web plagiarism but most are priced unfeasibly for academic use [3]. Initial results from TRANK and associated issues are presented [2]. The improvements of using VAST over a traditional 'eye only' approach are discussed [8]. Attempts to verify that the use of VAST with TRANK is error free are described [5]. The VAST rankings are compared with those of experienced human markers [4]. Work is being prepared on improving the use of TAT, finding plagiarism from Web sources and verifying TRANK against similarity visualisations produced from VAST. South Bank University is also preparing to be part of a large scale UK trial of commercial plagiarism detection software.

Interim conclusions

Free-text plagiarism detection is possible, although work needs to be done to reduce its computational intensitivity, as well as to find a 'best' set of metrics for a particular documents, metrics which might, or might not, be ideal for all possible corpora. More research needs to be directed towards the human-led stages of the plagiarism detection process, as manually verifying plagiarism is currently the biggest bottleneck in the system. Visual aids, such as the similarity visualisations, should be a big help here. It has also been shown that humans have a lot of difficulty in agreeing on the amount of similarity in a pair of student submissions, so it can argued that the computer ranking scores above humans in consistency. The computer ranking has been shown to avoid false hits and missed pairs but there is still work to be done to verify that machine ranking approximates human ranking well.

Open issues

The breadth of the field mean that there are many possible areas of focus. There is an associated need to narrow down the research question.

Current stage in your program of study

Currently nine months into the PhD research programme.

What you hope to gain from participating in the Doctoral Consortium

Discussing the plagiarism research should prove invaluable in deciding which areas of the field are most important to take forwards.

Bibliographic references

[1] Culwin F. & Lancaster T., A Descriptive Taxonomy of Student Plagiarism. Awaiting publication, available from South Bank University, London (2000).

[2] Culwin F. & Lancaster T., Pro-Active Anti Plagiarism in Action, an Initial Report. At Learning Matters: Improving Practice in Higher Education - organised by Institute of Learning and Teaching (2000).

[3] Culwin F. & Lancaster T., A Review of Electronic Services for Plagiarism Detection in Student Submissions. At 8th Annual Conference on the Teaching of Computing - organised by the LTSN Centre for Information and Computer Sciences (2000).

[4] Culwin F. & Lancaster T., Variability of Free-Text Similarity Assessment. In preparation, available from South Bank University, London (2000).

[5] Culwin F. & Lancaster T., Towards an Error Free Plagiarism Detection Process. Awaiting publication, available from South Bank University, London (2000).

[6] Culwin F. & Naylor J., Pragmatic Anti-Plagiarism. Proceedings Third Conference on the Teaching of Computing, DCU Dublin IE (1995).

[7] Franklin-Stokes A. & Newstead S., Undergraduate Cheating: Who Does What & Why? Studies in Higher Education, 20, 2, p159-172 (1995).

[8] Lancaster T. & Culwin F., Visualising Intra-Corpal Plagiarism. Awaiting publication, available from South Bank University, London (2000).

[9] Wise M. J., YAP3: Improved Detection of Similarities in Computer Program and Other Texts. Presented at SIGCSE, Philadelphia, USA, Feb 15-17 1996, pp 130-134 (1996).

Useful Web sites

Centre for Interactive Systems Engineering - South Bank University

Plagiarism.org, Copycatch.com, Findsame.com, Integriguard.com, HowOriginal.com, Essay Verification Engine.

Student plagiarism in an online world.

Issues in plagiarism for the new millenium: an assessment Odyssey.

Plagiarism in Colleges in USA.