Elizabeth Morris
Ph. D. Student
Texas Tech University
Committee Chairperson: Dr. Susan Mengel
A
Machine Learning Approach to Automate Classification of Literature in a SAM
Research Database
Keywords
Data
mining, Machine Learning, Classifier Systems,
SAM Analysis, XCS
In the mid-eighties, researchers at the University of Miami confronted
the problem of information overload while conducting an investigation on worker
performance [1]. Their research required literature sources from different fields, such as
engineering, business, and psychology, only to name a few. To cope with
the large amount of literature, they devised a methodology to partition
literature into matrices in order to find patterns or voids in the literature.
This approach was termed State-of-the-Art Matrix or SAM Analysis [2].
The current
implementation of the SAM Analysis is a manual
process, thus restricting the amount of information for conveying partitioning
decisions. This restriction is due to the capacity of humans to reduce and
analyze large amounts of data. For example, during the
first phase of the manual process, researchers construct models or categories
that best describe the topic area. These categories are derived from workers
knowledgeable in the area or domain experts. In the next phase, articles in the data stores are
read and assigned to these pre-defined categories based on the judgement of
assessors. This manual approach presents major challenges to researchers who must deal
with identifying and utilizing the information hidden in a large data corpus.
First, it is only practical for a small number of articles
and categorization relies on subjective judgement of assessors. In order to
manage and use the data effectively, a more flexible approach is necessary for
classifying information in these large data stores. This problem presents an
environment that is appropriate for applying machine learning and data mining
techniques to automate the process of classifying articles in huge volumes of data.
The availability of large data sources has triggered a significant amount
research in the field of Knowledge Discovery in Databases (KDD). The field of
KDD is the synthesis of various research fields, such as machine learning,
databases, and artificial intelligence [3].
However, the goal of this mixture of research fields is mining these large
quantities of data in order to discover knowledge. The KDD process consists of a
number of activities including data collection, abstraction, and cleansing, and
use of machine learning techniques to find patterns in data (see [3] for steps
of the KDD process).
Classification modeling is a machine learning technique that maps
data instances into one or more pre-defined classes for subsequent use
in detecting
trends and identifying objects. This modeling technique
can be automated using supervised learning methods. For example, supervised categorization is
one important area of research where Learning Classifier Systems (LCS) have been
applied to model the human categorization process [4].
Learning Classifier Systems are a machine learning paradigm in which an agent
learns a task by interacting with its environment. The use of rewards or other
forms of feedback guide the performance of the agent by modifying its rule-based
model of the environment [4]. Learning Classifier Systems have been in existence for more than twenty years,
dating back to the middle
1970's with the introduction of John Holland's Cognitive System One (CS-1) [5]. Since that
time, research has flourished in the LCS community leading to a number of modifications
to the traditional approach which significantly improved the LCS architecture.
These improvements led to a new type of Classifier System, XCS [6,7], developed
by Wilson that evolves accurate, maximally general classifiers.
To this end, the two disciplines, data mining and machine
learning, will be combined in the context of a Learning Classifier System to automate the classification step of the SAM Analysis.
This research will involve automating the classification step of the SAM Analysis based on the domain interests of the researcher. To accomplish this goal, the extent of this study involves the following subtasks:
The first task will provide a means to capture changing user objectives. Researchers with similar backgrounds and interests
can exploit existing rule sets (classifiers) to discover interests of similar users. This
study will investigate user registration as a means for capturing profiles.
The second task will allow the user to construct classifiers through
visually building decision trees. This approach to building classifiers provides
a means for researchers to encode knowledge regarding the best way to classify
publications under a specific set of categories A data visualization tool, by Ware et al
[8], demonstrates an interactive method for constructing decision tree
classifiers. This tool enables the user to build a decision tree graphically
using two-dimensional polygons. This tool is part of the Weka workbench (http://www.cs.waikato.ac.nz/ml)
and will be considered as a starting point for the development of this task.
The third task is to classify research publications in a SAM database.
The Classifier System proposed for
automating classification of knowledge in research databases will be based on XCS's architecture.
SAM databases present an important platform for investigating the use of automated
data management tools and techniques in the research domain and pose
challenges in the real-world application of knowledge discovery in research databases.
This study will remove the
requirement for researchers to sift manually through volumes of information in
their data stores reducing their workload in terms of time and effort. By
allowing researchers to restructure data stores based on their interests,
they are exposed to activity in their area and discover new ideas for research endeavors.
The investigation thus far indicates that a LCS is a promising paradigm for use
in building an automated data management tool that mine's large research
databases. However, additional research is needed to demonstrate the
ability of this tool to learn pre-defined categories of articles and
classify new articles in large research databases.
As the field of KDD continues to evolve, there are relatively few research
efforts investigating the use of LCS in KDD. There are no well-known academic or
real-world applications for use as benchmarks to validate the results of this
research.
All work to satisfy the Ph.D. course requirements have been completed.
Also, the qualifying examination has been successfully completed. Admission to
Ph. D. candidacy has been received from the graduate school at Texas Tech University.
This candidacy is for the Doctorate of Computer Science in the College of
Engineering. Work on the Ph. D. dissertation is in progress.
Currently, the literature review is being conducted
for this research effort. Also, data repositories have been obtained from
researchers at the Texas Tech Center for Systems Solutions (CSS) and will be
used as training and testing data sets .
By participating in the Doctoral Consortium, I will have an opportunity to share this research effort and obtain input from others involved in similar research endeavors. I hope to gain feedback pertaining to new ideas, feasibility, and scope of the research project.
Sumanth, D.; Omachonu, V. K.; Beruvides, M.G. (1990). "A review of the state-of-the-art research on white collar/knowledge-worker productivity". International Journal of Technology Management, Vol. 5, No. 3, pp. 337-355.