Elizabeth Morris

Ph. D. Student

Texas Tech University

Committee Chairperson: Dr. Susan Mengel

A Machine Learning Approach to Automate Classification of Literature in a SAM Research Database

Keywords

Data mining, Machine Learning, Classifier Systems, SAM Analysis, XCS

INTRODUCTION AND THEORETICAL BACKGROUND

In the mid-eighties, researchers at the University of Miami confronted the problem of information overload while conducting an investigation on worker performance [1]. Their research required literature sources from different fields, such as engineering, business, and psychology, only to name a few. To cope with the large amount of literature, they devised a methodology to partition literature into matrices in order to find patterns or voids in the literature. This approach was termed State-of-the-Art Matrix or SAM Analysis [2].

The current implementation of the SAM Analysis is a manual process, thus restricting the amount of information for conveying partitioning decisions. This restriction is due to the capacity of humans to reduce and analyze large amounts of data. For example, during the first phase of the manual process, researchers construct models or categories that best describe the topic area. These categories are derived from workers knowledgeable in the area or domain experts. In the next phase, articles in the data stores are read and assigned to these pre-defined categories based on the judgement of assessors. This manual approach presents major challenges to researchers who must deal with identifying and utilizing the information hidden in a large data corpus. First, it is only practical for a small number of articles and categorization relies on subjective judgement of assessors. In order to manage and use the data effectively, a more flexible approach is necessary for classifying information in these large data stores. This problem presents an environment that is appropriate for applying machine learning and data mining techniques to automate the process of classifying articles in huge volumes of data.

PREVIOUS RESEARCH IN THE AREA

The availability of large data sources has triggered a significant amount research in the field of Knowledge Discovery in Databases (KDD). The field of KDD is the synthesis of various research fields, such as machine learning, databases, and artificial intelligence [3]. However, the goal of this mixture of research fields is mining these large quantities of data in order to discover knowledge. The KDD process consists of a number of activities including data collection, abstraction, and cleansing, and use of machine learning techniques to find patterns in data (see [3] for steps of the KDD process).

Classification modeling is a machine learning technique that maps data instances into one or more pre-defined classes for subsequent use in detecting trends and identifying objects. This modeling technique can be automated using supervised learning methods. For example, supervised categorization is one important area of research where Learning Classifier Systems (LCS) have been applied to model the human categorization process [4].

Learning Classifier Systems are a machine learning paradigm in which an agent learns a task by interacting with its environment. The use of rewards or other forms of feedback guide the performance of the agent by modifying its rule-based model of the environment [4]. Learning Classifier Systems have been in existence for more than twenty years, dating back to the middle 1970's with the introduction of John Holland's Cognitive System One (CS-1) [5]. Since that time, research has flourished in the LCS community leading to a number of modifications to the traditional approach which significantly improved the LCS architecture. These improvements led to a new type of Classifier System, XCS [6,7], developed by Wilson that evolves accurate, maximally general classifiers.

To this end, the two disciplines, data mining and machine learning, will be combined in the context of a Learning Classifier System to automate the classification step of the SAM Analysis.

GOALS OF THE RESEARCH

This research will involve automating the classification step of the SAM Analysis based on the domain interests of the researcher. To accomplish this goal, the extent of this study involves the following subtasks:

Building profiles to capture user interests
Providing data visualization operations allowing the user to build classifiers
Constructing a Learning Classifier System to label publications using a general inductive process

The first task will provide a means to capture changing user objectives. Researchers with similar backgrounds and interests can exploit existing rule sets (classifiers) to discover interests of similar users. This study will investigate user registration as a means for capturing profiles.

The second task will allow the user to construct classifiers through visually building decision trees. This approach to building classifiers provides a means for researchers to encode knowledge regarding the best way to classify publications under a specific set of categories A data visualization tool, by Ware et al [8], demonstrates an interactive method for constructing decision tree classifiers. This tool enables the user to build a decision tree graphically using two-dimensional polygons. This tool is part of the Weka workbench (http://www.cs.waikato.ac.nz/ml) and will be considered as a starting point for the development of this task.

The third task is to classify research publications in a SAM database. The Classifier System proposed for automating classification of knowledge in research databases will be based on XCS's architecture. SAM databases present an important platform for investigating the use of automated data management tools and techniques in the research domain and pose challenges in the real-world application of knowledge discovery in research databases.

This study will remove the requirement for researchers to sift manually through volumes of information in their data stores reducing their workload in terms of time and effort. By allowing researchers to restructure data stores based on their interests, they are exposed to activity in their area and discover new ideas for research endeavors.

INTERIM CONCLUSIONS

The investigation thus far indicates that a LCS is a promising paradigm for use in building an automated data management tool that mine's large research databases. However, additional research is needed to demonstrate the ability of this tool to learn pre-defined categories of articles and classify new articles in large research databases.

OPEN ISSUES

As the field of KDD continues to evolve, there are relatively few research efforts investigating the use of LCS in KDD. There are no well-known academic or real-world applications for use as benchmarks to validate the results of this research.

CURRENT STAGE IN THE PROGRAM STUDY

All work to satisfy the Ph.D. course requirements have been completed. Also, the qualifying examination has been successfully completed. Admission to Ph. D. candidacy has been received from the graduate school at Texas Tech University. This candidacy is for the Doctorate of Computer Science in the College of Engineering. Work on the Ph. D. dissertation is in progress.

Currently, the literature review is being conducted for this research effort. Also, data repositories have been obtained from researchers at the Texas Tech Center for Systems Solutions (CSS) and will be used as training and testing data sets .

WHAT CAN BE GAINED BY PARTICIPATING IN THE DOCTORAL CONSORTIUM

By participating in the Doctoral Consortium, I will have an opportunity to share this research effort and obtain input from others involved in similar research endeavors. I hope to gain feedback pertaining to new ideas, feasibility, and scope of the research project.

BIBLIOGRAPHIC REFERENCES

Sumanth, D.; Omachonu, V. K.; Beruvides, M.G. (1990). "A review of the state-of-the-art research on white collar/knowledge-worker productivity". International Journal of Technology Management, Vol. 5, No. 3, pp. 337-355.
Beruvides, M. G. (March 25, 2000). "The State of the Art Matrix Analysis: A Programmatic, Chronological and Statistical Approach to Research Literature Analysis". Texas Tech University, Industrial Engineering Department Working Paper WP2000.01.
Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. (1996). "From Data Mining to Knowledge Discovery in Databases". American Association for Artificial Intelligence", pp. 37-54.
Lanzi, P. L.; Riolo, Rick L. (2000). "A Roadmap to the Last Decade of Learning Classifier Systems". In Lanzi, P. L.; Stolzmann, W.; Wilson, S. W. (Ed.), Learning Classifier Systems: From Foundations to Applications, pp. 33-61. Springer-Verlag, Berlin; New York.
Holland, J. H.; Reitman, J.S. (1978). "Cognitive Systems Based on Adaptive Algorithms". Evolutionary Computations: The Fossil Record, Fogel, David (Ed.), IEEE Press, 1998, pp. 464-480.
Wilson, S. W. (2000). "State of XCS Classifier System Research". In P. L. Lanzi, Stolzmann, W., and Wilson, S. W. (Ed.), Learning Classifier Systems: From Foundations to Applications, pp. 63-82. Springer-Verlag, Berlin; New York.
Wilson, S. W. (1995). "Classifier Fitness Based on Accuracy". Evolutionary Computation, 3(2), pp 149-175.
Ware, M.; Frank, E.; Holmes, G.; Hall, M.; Witten, I.H. (2001). "Interactive Machine Learning: Letting Users Build Classifiers". In International Journal of Human-Computer Studies, Vol. 55, pp. 281-292.