Content area
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
Details
Word sense disambiguation;
Bayesian analysis;
Computer science;
Information storage;
Bioinformatics;
Studies;
Classification;
Neural networks;
Relevance;
Information processing;
Methods;
Algorithms;
Text editing;
Information theory;
Information technology;
Information retrieval;
Numerical experiments;
Artificial intelligence;
Laboratories;
Statistical analysis;
Probabilistic models;
Datasets;
Support vector machines;
Documents