An enhanced text classifier for automatic document classification

Automatic classification has become an important research area due to the exponential growth of digital content in the modern world. Evidently, manual classification of documents is very painstaking and laborintensive task. It takes much time to organize a collection of documents according to the subject area. This research has developed a computer programme that can automatically classifying a given text document. Therefore, the user gets correct classification results just after feeding the document to the new system. For the process of classification, we use a new algorithm developed by enhancing basic form of an existing text classifier called tf-idf. The results were obtained for classification accuracy of the new text classification algorithm. They were compared with the results obtained for the basic tf-idf classifier. The research revealed that, the newly developed classifier algorithm can obtain better classification accuracy than the basic tf-idf classifier.


Introduction
Due to the continued growth of information in both printed and electronic formats, it is becoming extremely difficult to organize text materials in a proper way. Therefore, it is a usual practice of following a standard classification scheme for the purpose of organizing bibliographic materials suitably. Generally, the classification schemes play an outstanding role since they facilitate subject access as well as locating of the available bibliographic 1 Assistant Librarian, Sabaragamuwa University of Sri Lanka. Sri Lanka. Email: manju@sab.ac.lk 2 Senior Assistant Librarian, University of Moratuwa, Sri Lanka. Email: ruga@uom.lk 139 materials in a library or any other repository. Although the importance of these classification systems attain such an exceeding significance, the overall work of manual classification is very tricky. Hence, one may feel the output of the process will not so worth relative to the contribution as the classifier has to pay a plenty of time and full attention throughout this work. Contrast to the printed documents, the volume of electronic information is rapidly increased due to the higher usage of Internet and Web content available in the modern world. As a result, the difficulty of classifying electronic documents has begun to make huge troubles than classifying the printed documents.
Thus, there is an inevitable need for a tool that can classify electronic text documents.
For this purpose, automatic text classification systems have been introduced to automatically assign the electronic documents into appropriately pre-defined categories.
Moreover, many classification algorithms have been proposed to accomplish this task and they are known as text classifiers. In general, these algorithms determine how far a given test document 1 matches with the pre-defined categories. Different text classifiers have their own methodologies to match the most related subject category to the given test document. For example, probabilistic methods, distance learning methods, support vector machines, genetic algorithms, hidden Markov models, decision tree methods, decision rule methods, regression methods, neural network methods (Sebastiani, 2002;Tao, Ling, & Cheng, 2005) and tf-idf (term frequency-inverse document frequency) based methods (Abbas, Smaïli, & Berkani, 2010;Tao et al, 2005) use different concepts to classify the given documents.
In our study, we mainly focus to enhance the basic form of the tf-idf classifier as it has some limitations. In general, the tf-idf weight function (Salton, Wong, & Yang, 1975)

141
Lucene has been designed for the purpose of using as an information retrieval search engine, its flexibility has been exploited by this study to adopt the system as a text classifier as well.
In addition to the introduction part, we have organized this paper as follows.
Section 2 of the paper represents an account of some related works to text classification.
We discuss the details of the methodology in section 3. The experimental results of the research are discussed in the section 4. Finally, we have concluded the paper in section 5.

Related work
The area of text classification and classification algorithms have been studying extensively for many years. As a result, a wide range of supervised learning 1 methods has been used in this area.
Using a considerable number of text classifiers can be seen with the development of Internet applications. Dumais and Chen (2000) give an account of one of these kinds of initiatives called hierarchical approach. This study explores the use of hierarchical structure for classifying a vast amount of heterogeneous collection of Web documents.
For this purpose, it uses Support Vector Machine (SVM) learning model which was not previously used for hierarchical problems. A probabilistic description-oriented approach (Gövert, Lalmas, & Fuhr, 1999) has also been reported for categorizing of Web content.
Here, probabilistic indexing is used and documents are categorized using the ݇ −nearest neighbor (kNN) classifier by giving it a probabilistic interpretation. However, this study has considered the features specific to Web documents as well as standard features of text documents.
1 Machine is given a pre-defined example documents and the goal of the machine is to learn to produce the matching output for the input document.
142 Calvo, Lee, & Li (2004) applied automatic Naïve Bayes classifier on news stories from Reuters RCV1 corpus and to another with over 41,000 Web sites. It performed flat multilabel classification using two distinct thresholding strategies called score-based and rankbased. Billsus and Pazzani (1999) report another news classification effort which especially focused on the difference between user's long-term and short-term interests with their dynamic nature.

Methodology Processing
Two major pre-processing phases have been followed in this research. First, the removal of stop words has been taken place to reduce less significant words of the text. Secondly, stemming process is carried out to reduce the number of index terms with the same root.
Lucene's default stop words list has been used to remove the stop words. It covers a wide area of stop words that are known in general. However, it is further enriched by comparing with some other available stop word lists.
1 2 The figure 1 shows a part of the stop words list that has been used in this research.

Figure 1: Portion of the stop words list used
Whenever it is required, this system allows the user to update the stop words list in order to improve the performances of the system.
Stemming is the next step of pre-processing. For this purpose, Porter's stemming algorithm is used as it is the most commonly used algorithm for word stemming in English language (Smirnov, 2008;Zhou, Smalheiser, & Yu, 2006). A sample of the stemmed words can be given as in the figure 2 along with their original terms.

Administered administ
Clairvoyance clairvoy Define defin 145 the initial stage, now it is easy to recognize and extract such kind of highly related terms using the following process.

Feature Selection
One This process of determining the proper number and terms which most appropriately describe the subject area of the test document is known as feature selection. In our study, the size of the feature space (the sample of most critical terms) is experimentally determined as follows.
First, we have implemented the new text classifier algorithm (given by the equation (4)) using the Lucene search engine library. Then, 47 electronic test documents have been used for the classification. These documents fall within the subject range of class numbers 110 to 139 of the DDC (edition 21) scheme. Furthermore, these documents were selected without having any detail knowledge of the specific content.
After that they were properly classified by an experienced subject classifier. At this stage, the classifier carefully examined their core content. Then again, the same set of documents was classified using the new automatic classifier. This time, 47 test documents 146 were classified based on 1 to 6 numbers of keywords. That means each document is classified six times by keeping the dimension of the feature space as 1, 2, 3, 4, 5 and 6 for each 1 . Here, each document of the test collection was tested for the classification accuracy based on 385 training documents. Variation of the accuracy of classification against the number of keywords can be given as in the figure 4. Here, accuracy level 5 was given for an exact matching (i.e. if the subject of the test document is exactly matched with the classification results) while the minimum zero was given for inaccurate classification results. Moreover, accuracy levels 4, 3, and 2 were given for one step super/sub, two steps super/sub and three steps super/sub classification to the exact subject of the test document respectively. In addition, accuracy level 1 was given for the same level of subjects (or their sub classes).
By examining the figure 4, one can conclude that the accuracy of classification considerably goes down with one and three number of keywords. However, it has given some increased accuracy with respect to two keywords. Yet, there are a lot of variations around two and it is not so safe to select this number. With respect to four and more keywords, good and steady classification results have been given. This characteristic is common for most of the test documents. For example, figure 5 shows that the accuracy of classification is increased after considering four keywords for given two individual documents of the test sample. Accordingly, in this study, the feature space has been limited to a fixed value of four elements. This threshold value is the minimum value which starts to give the most accurate results for the test documents. As a result of selecting the most appropriate minimum value, now the system is able to do the classification process with least effort and taking less time. Furthermore, the proposed method is more beneficial as it uses a fixed and pre-determined size for the dimension of the feature space. These two facts further reduce the computational time that has to necessarily spend in automatic feature selection.

Scoring
In this study, a new text classifier has been developed using an existing term frequency weight function called the tf-idf weight function. This function has the ability to assign a value for a document based on a few factors. They are; the term frequency, the total number of terms in that document, the number of documents that a particular word occurs in the collection and the number of total documents in the corpus. Accordingly, the tf-idf weight function can also be considered as a basic type of classifier.
This basic tf-idf function is given by the equation (1).
݂݅݀ ∶ inverse document frequency of the term ‫ݐ‬ in the collection.
Here, ‫݂ݐ‬ , and ݂݅݀ values can be defined as follows.
Where, ݂ , ∶ number of occurrences of the considered term ‫ݐ‬ in the document ݀ .
݂ , : sum of the number of occurrences of all terms in document ݀ .

and,
Where, |ܰ| ∶ total number of documents in the collection.
ܰ denotes the entire collection of documents.
But before using this tf-idf weight as a basic tf-idf classifier, it has to be further developed to find out solutions for the following flaws.
1. This basic form considers tf-idf value only regarding a single keyword. Therefore, it is possible to incorrectly classify the documents when the term with the highest frequency does not correctly imply the subject of a document. Moreover, there can be certain subjects which cannot be decided using only a single key term.
2. Since this function considers only a single key term, it will not be able to consider the importance of the terms with equally higher frequencies.

150
In order to overcome these two difficulties, the study has recommended the following resolutions. The new text classifier algorithm is built based on these explanations.
1. Instead of considering a single key term, this research focuses on more than one key term.
2. As the threshold value is expanded into four, the text classifier has to be enhanced to consider the importance of up to four key words from the test document and the same from the training set. Moreover, it is also necessary to consider the level of importance of those terms in determining the subject stream of the test document. In order to do this, it is essential to numerically represent how far they gain significance within the test document.
To achieve these goals, we have developed the following new algorithm (4).

Training Sample
Training sample or the training set is a collection of documents where the training documents are stored. By using them, one can compare and determine the subject of an external document. This can be done by evaluating the similarities or relevancies between the test document and the training documents in the training set. In this study, we have used 385 training documents within the domain of philosophy related subjects.
Only text documents have been selected since this study focuses only on textual information. Documents for the training set have been selected from the Wikipedia (4) (5) 151 online encyclopedia, Stanford encyclopedia of philosophy, Google directory and also from the subject gateway of Bulletin Board for Libraries. As the selected documents are not pre-classified they were further examined and classified by experienced subject classifiers according to the DDC scheme. One of the important facts noticeable here is, more than one training document have been selected from the same subject. However, their contents are not similar to each other. Hence, there is more possibility to select the most relevant training document for the given test document as there are multiple documents from the same subject stream.

Results
For the evaluation, 58 test documents belonging to 32 distinct subjects from the selected domain (DDC subject class 110 to 139) were chosen. After selection, they were specifically classified by an experienced subject classifier. These test documents were again classified automatically by using the new classifier and the basic tf-idf classifier. As the training set, 385 pre-classified text documents were used. Finally, the results were evaluated based on the precision, recall and ‫ܨ‬ ଵ measures.
In text classification precision can be defined as the fraction of retrieved categories that are relevant.
Recall is the fraction of relevant categories that are retrieved.

Recall
A single measure that combines precision and recall is the ‫ܨ‬ ଵ measure. (8)

152
The obtained results are given by the table 1 for each subject.  According to the values obtained for precision and recall, it is possible to draw the precision-recall curve given in the figure 6. In fact, this graph has been drawn according to 154 the precision and recall values obtained for the DDC class Spells-Curses-Charms and based on the new classification algorithm. One can notice that, it has the inherent sawtooth shape of a general Precision-Recall graph.

Conclusions
This study has developed a new form of text classifier using an existing weight algorithm.
The best classification results were given when it was considered topmost four keywords of the test documents. Therefore, four keywords can be considered as the most suited feature space size for the new algorithm. these factors, first we can conclude that, the new text classifier is able to classify documents much accurately than the basic tf-idf classifier does. Secondly, it is obvious that, the highest frequency term is not the only factor that correctly determines the subject of a document. But quite a few other high frequency terms as well. Moreover, this depends on the nature of the text classifier. Figure A1 and Figure A2 show the classification results obtained for a test document belongs to the subject Telepathy in psychic phenomena. The same document was classified twice using the basic tf-idf and the new classifier. Here one can notice that how far the classification accuracy has been increased after implementing the new classifier. Following figures show the classification results obtained by the system for some test documents. Figure A3 and Figure A4 give the five topmost results for the basic tf-idf classifier and the new classifier. In these two cases, we have input a document which belongs to the subject area of Mind in philosophy.  Again another test document which falls into the subject area of Telepathy in psychic phenomena was classified separately using the two classifiers. Five topmost results are given by the Figure A7 and Figure A8.