Full Text

Turn on search term navigation

Headnote

Outlier mining is an important task to discover the data records which have an exceptional behavior comparing with other records in the remaining dataset. Outliers do not follow with other data objects in the dataset. There are many effective approaches to detect outliers in numerical data. But for categorical dataset there are limited approaches. We propose an algorithm NAVF (Normally distributed attribute value frequency) to detect outliers in categorical data. This algorithm utilizes the frequent pattern data mining method. It avoids problem of giving k-outliers to get optimal accuracy in any classification models in previous work like Greedy, AVF, FPOF, and FDOD while finding outliers. The algorithm is applied on UCI ML Repository datasets like Nursery, Breast cancer mushroom and bank dataset by excluding numerical attributes. The experimental results show that it is efficient for outlier detection in categorical dataset.

Keywords: Outliers, Categorical, AVF, NAVF, FPOF, FDOD

(ProQuest: ... denotes formulae omitted.)

1 Introduction

Outlier analysis is an important research field in many applications like credit card fraud, intrusion detection in networks, medi-cal field .This analysis concentrate on detect-ing infrequent data records in dataset.

Most of the existing systems are concentrated on numerical attributes or ordinal attributes .Sometimes categorical attribute values can be converted into numerical values. This process is not always preferable. In this paper we discuss a simple method for categorical data is presented.

AVF method is one of the efficient methods to detect outliers in categorical data. The mechanism in this method is that, it calcu-lates frequency of each value in each data at-tribute and finds their probability, and then it finds the attribute value frequency for each record by averaging probabilities and selects top k- outliers based on the least AVF score. The parameter used in this method is only "k", the no. of outliers. FPOF is based on frequent patterns which are adopted from Apriority algorithm [1]. This calculates fre-quent patterns item sets from each object. From these frequencies it calculates FPOF score and finds the least k- outliers as the least FPOF scores. This method takes more time to detect outliers comparing with AVF. The parameters used in it are σ, a threshold value to decide frequent sub sets in each data object. The next method is based on Entropy score. Greedy [2] is another method to detect outliers from categorical data. The previous approaches used to detect outliers were

2 Existing Approaches

Statistical based

This method adopted a parametric model that describes the distribution of the data and the data was mostly unvaried [3, 4]. The main drawbacks of this method are difficulty of finding a correct model for different datasets and their efficiency decreases as the no. of dimensions increases [4]. To rectify this problem the Principle component method can be used. Another method to handle high di-mensional datasets is to convert the data re-cords in layers however; these ideas are not practical for more than or equal to three di-mensions.

Distance-Based

Distance based methods do not make any as-sumptions about the distribution of the data records because they must compute the dis-tances between records. But these make a high complexity. So these methods are not useful for large datasets. There are some im-provements exist in the distance-based algo-rithms, such as Knorr's et al. [5], they have explained that apart of dataset records belong to each outlier must be less than some threshold value. Still it is an exponential on the number of nearest neighbours.

Density Based

These methods are based on finding the den-sity of the data and identifying outliers as those lying in regions with low density. Bre-unig et al. have calculated a local outlier fac-tor (LOF) to identify whether an object con-tains sufficient neighbour around it or not[6]. They have decided a record as an outlier when the record LOF which is a user defined threshold. Papadimitriou et al. presented a similar technique called Local Correlation In-tegral, which deals of selecting the minimum points (min pts) in LOF through statistical methods in [7]. The density based methods have some advantages that they can detect outliers that are missed by techniques with single, global criterion methods. The termi-nology used in this paper is given below

3 Algorithms

Greedy algorithm

If any dataset consists outliers then it devi-ates from its original behavior and this da-taset gives wrong results in any analysis. The Greedy algorithm proposed the idea of find-ing a small subset of the data records that contribute to eliminate the disturbance of the dataset. This disturbance is also called entro-py or uncertainty. We can also define it for-mally as 'let us take a dataset D with m at-tributes A1, A2- Am and d(Ai) is the do-main of distinct values in the variable Ai, then the entropy of single attribute Aj is

E(Aj) = -Σ (x) log2(p(x)) (1)

xε d(Aj)

Because of all attributes are independent to each other, Entropy of the entire dataset

D={ A1, A2- Am} is equal to the sum of the entropies of each one of the m attrib-utes, and is defined as follows

E(A1,A2-Am)=E(A1)+E(A2)+-E(Am)(2)

When we want to find entropy the Greedy algorithm takes k outliers as input [2]. All re-cords in the set are initially designated as non-outliers. Initially all attribute value's frequencies are computed and using these frequencies the initial entropy of the dataset is calculated. Then, Greedy algorithm scans k times over the data to determine the top k outliers keeping aside one non-outlier each time. While scanning each time every single non-outlier is temporarily removed from the dataset once and the total entropy is recalcu-lated for the remaining dataset. For any non-outlier point that results in the maximum de-crease for the entropy of the remaining data-set is the outlier data-point removed by the algorithm. The Greedy algorithm complexity is O(k *n*m*d), where k is the required number of outliers, n is the number of objects in the dataset D, m is the number of attributes in D, and d is the number of distinct attribute values, per attribute. Pseudo code for the Greedy Algorithm is as follows

Algorithm: Greedy

Input: Dataset - D

Target number of outliers - k

Output: k outliers detected

label all data points x1,x2,-xn as non-outliers Calculate initial frequency of each attribute value and update hash table in each iteration calculate initial entropy

...

However entropy needs k as input and need to find number of outliers more times to get optimal accuracy of any classification model.

AVF algorithm

The algorithm discussed above is linear with respect to data size and it needs k-scans each time. The other models also exist which are based on frequent item set mining (FIM) need to create a large space to store item sets, and then search for these sets in each and every data point .These techniques can be-come very slow when we select low thresh-old value to find frequent item sets from dataset

Another simpler and faster approach to detect outliers that minimizes the scans over the data and does not need to create more space and more

Search for combinations of attribute values or item sets is Attribute Value Frequency (AVF) algorithm. An outlier point xi is de-fined based on the AVF Score below:

... (2)

In this approach [1] again we need to find k-outliers many times to get optimal accuracy of any classification model. Pseudo code for the AVF Algorithm is as follows

...

The AVF algorithm complexity is lesser than Greedy algorithm since AVF needs only one scan to detect outliers. The complexity is O (n * m). It needs 'k' value as input. In FPOF [8] this has discussed frequent pattern based outlier detection, in this too k-value and an-other parameter 'σ 'are required as threshold. This also discussed about frequent pattern based method to find infrequent object, in this too it requires k-value, and another pa-rameter 'σ' as input.

N AVF algorithm

This proposed model (NAVF) has been de-fined as an optimal number of outliers in a single instance to get optimal precision in any classification model with good precision and low recall value. This method calculates 'k' value itself based on the frequency. Let us take the data set 'D' with 'm' attributes A1, A2- Am and d (Ai) is the domain of dis-tinct values in the variable Ai. kN is the num-ber of outliers which are normally distrib-uted. To get 'kN' this model used Gaussian theory. If any object frequency is less than "mean-3 S.D" then this model treats those objects as outliers. This method uses AVF score formula to find AVF score but no k-value is required. Let D be the Categorical dataset, contains 'n' data points, xi, where i= 1...n. If each datapoint has 'm' attributes, we can write xi = [ xi1, .....xil,.....xim ], where xil is the value of the lth attribute of xi.

Algorithm

Input: Dataset - D,

Output: K detected outliers.

Step 1: Read data set D

Step 2: Label all the Data points as non-outliers

Step 3: calculate normalized frequency of each attribute value for each point xi

Step 4: calculate the frequency score of each record xi as, Attribute Value Frequency of xi is:

...

Step 5: compute N-seed values a and b as b=mean (xi), a=b-3*std (xi), if max (Fi) > 3*std (Fi)

Step 6: If Fi< a, then declare xi as outlier

Step 7: return KN detected outliers.

4 Experimental Results

In this paper this model has been applied on Breast Cancer, Nursery data and Bank mar-keting data from UCI Machine repository [9]. This method has implemented the approach of using MATLAB tool. We ran our experi-ments on a workstation with a Pentium( R ) D, 2.80 GHz Processor and 1 .24 GB of RAM.

Nursery data consists of nine attributes and 6236 records. This data divided into two parts based on parent attribute, first part con-tains 4320 records with usual parent type, and second part contain 1916 records with pretentious parent type which is used as out-liers in our experiment. In first iteration 956 sample records are selected randomly using Clementine tool; from each two records one is selected. These 956 records are mixed up with part one and applied normally distrib-uted AVF to get outliers. The found outliers are given in Table 2. Similarly in the next it-eration 382 records are selected randomly as one record from each five records and mixed up with first part and applied the same proc-ess. The results are given in the Table 2. Similarly one record is selected from each eight records and ten records and repeated the same process. This method has been im-plemented on Nursery dataset, Breast cancer and Bank dataset which are taken from UCI Machine learning repository [9]. This method compared with different number of outliers from each sample. Comparison graph is given in Figure 3. For Nursery Data Figure 1 shows the frequency of different attribute values and their structure.

Figure 2 shows the outliers which are ap-peared in red colour for the Nursery Data

These red collared points are under "Mean-3SD" line which we can observe in the Fig-ure 2. This Figure 2 is drawn by MATLAB tool for the data taken by the 1-in-2 sample method.

In the first sample from nursery the NAVF model found out only 4.60% of outliers from 956 outliers which are mixed up with 4320 records which totals to 5276 records.. In the next sample of 382 records, 34.8% of correct outliers are found by NAVF. For the sample of 238 records NAVF found 239 outliers in which 238 are correct, which means that NAVF model found 100% outliers correctly. Similarly NAVF model found 100% outliers in the sample of 190 records (as outliers) mixed up with 4320 records in part one.

In case of breast cancer dataset, correct out-liers found by NAVF model did not touch 100%. In breast cancer data 119, 48, 29, 23 outliers are selected respectively using 1-in-2, 1 -in-5, 1-in-8, 1-in-10 sampling from be-nign breast cancer. NAVF found 35, 9, 9, and 14 correct and 0, 0, 0, 1 wrong from 119, 48, 29, 23 outliers. The results are given in Table 3.The comparison of outlier detection is shown in Figure 4.

In Bank marketing data, only categorical at-tributes are selected and 2644, 1027, 661, 528 outliers are selected respectively using 1-in-2, 1 -in-5, 1-in-8, 1-in-10 sampling method from the attribute Y="yes" and ap-plied the same process as above. In this data NAVF found 274, 198, 152, 126 correct out-liers and 100, 168, 202, 213 wrong outliers from the random sample of 2644, 1027, 661, 528 outliers taken by 1-in-2, 1 -in-5, 1-in-8, 1-in-10 sampling method. The found outliers are arranged in Table 4. In this table the true positives and false positives found by our model are arranged.

Similarly we applied the same process for all samples in banking data. The results are summarized in the Table 4 and its graph is given in Figure 5. Different classification models are tested for accuracy of the bank dataset after deleting the outliers. Different classifiers are tested on 1-in-5 sample data which contain 39922 original records mixed up with 1027 outliers. The NAVF model has found 366 records as outliers in which 198 are correct outliers and 168 are wrong outlier (original records). When Neural network, C5, CRT, QUEST, CHAID Linear Regression, Decision Logic Classifiers are applied on the above sample data, the classifiers have given the accuracies as given in the Table 5.

Different classifiers are applied on the re-maining dataset after deleted the outliers by NAVF model. All classifiers have given good results for NAVF. Only the decision logic classifier gave very less accuracy (37.058) by NAVF.

5 Conclusion and Future Work

To sum up, this proposed method gives the optimal number of outliers 'KN '.In existing models it is mandatory to give the number of outliers to find them .While taking the num-ber of outliers sometimes the original data may be missed. If any classifier modelled us-ing this data, wrong classifiers may be mod-elled. In future there is a possibility of check-ing the precision and recall values of each model with the existing models. The same method can also be applied on mixed type of dataset.

References

Reference

[1] M. E. Otey, A. Ghoting, and and A. Parthasarathy, "Fast Distributed Outlier Detection in Mixed-Attribute Data Sets," Data Mining and Knowledge Discovery He, Z., Deng, S., Xu, X., "A Fast Greedy algorithm for outlier mining", Proc. of PAKDD, 2006.

[2] I. S. Jacobs and C. P. Bean, "Fine particles, thin films and exchange anisotropy," in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271-350.

[3] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining: Pearson Addison-Wesley, 2005

[4] E. Knorr, R. Ng, and V. Tucakov, "Distance-based outliers: Algorithms and applications," VLDB Journal, 2000. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, "LOF: Identifying density based local outliers," presented at ACM SIGMOD International Conference on Management of Data, 2000

[5] S. Papadimitriou, H. Kitawaga, P. Gibbons, and C. Faloutsos, "LOCI: Fast outlier detection using the local correlation integral," presented at International Conference on Data Engineering, 2003

[6] Z. He, X. Xu, J. Huang, and S. Deng, "FP-Outlier: Frequent Pattern Based Outlier Detection", Computer Science and Information System (ComSIS'05)," 2005

[7] S. Wu and S. Wang, "Information-Theoretic Outlier Detection for Large-Scale Categorical Data, IEEE Transactions on Knowledge Engineering and Data Engineering,2011

[8] A. Frank, & A. Asuncion, (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[9] B. Yu, M. Song, L. Wang "Local Isola-tion Coefficient-Based Outlier Mining Algorithm", International Conference on Information Technology and Computer Science 2009

[10] M. Gebski, A. Penev Raymond, K. Wong "Grouping Categorical Anoma-lies" IEEE/WIC/ACM International Con-ference on Web Intelligence and Intelli-gent Agent Technology2008

[11] E. Müller , I. Assent , U Steinhausen , Thomas Seidl "Out Rank ranking outli-ers in high dimensional data" ICDE Workshop 2008

[12] Z. Huang. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 1998

[13] C. R. Palmer, C. Faloutsos." Elec-tricity Based External Similarity of Cate-gorical Attributes." In Proc. of the 7th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'03), 2003

[14] S. Q. Le, T. B. Ho." A Conditional Probability Distribution-Based Dissimi-larity Measure for Categorical Data". In: Proc. of the 8th Pacific-Asia Conference on Advances in Knowledge

[15] Discovery and Data Mining (PAKDD'04), 2004

[16] V. Cheng, C. H. Li, J. T. Kwok, C-K. Li. "Dissimilarity learning for nominal data. Pattern Recognition", 2004

[17] S-G. Lee, D-K. Yun. "Clustering Categorical and Numerical Data : A New Procedure Using Multi-dimensional Scal-ing". International Journal of Information Technology and Decision Making",2003

[18] C. Li, G. Biswas. "Un supervised learning with mixed numeric and nominal data". IEEE Transactions on Knowledge and Data Engineering, 2002

[19] Z. Huang, M. K. Ng. "A fuzzy k-modes algorithm for clustering categori-cal data. IEEE Transaction on Fuzzy Sys-tems", 1999

[20] Mahoney, M.V.and Chan, P.K.. "Learning non stationary models of nor-mal network traffic for detecting novel attacks". Proc. of the ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining. 2002.

[21] Palpanas, T., Papadopoulos, D., Kalogeraki, V., and Gunopulos, D. "Dis-tributed deviation detection in sensor networks". SIGMOD Record, 2003.

[22] Knorr, E. and Ng, R.T. "Algorithms for mining distance-based outliers in large datasets", Proc. of the International Conference on Very Large Databases. 1998.

[23] Lazarevic, A., Ertoz, L.,Ozgur, A.,Kumar,V., and Srivastava, J.."Acomparative study of outlier detec-tion schemes for network intrusion detec-tion". Proc. of the SIAM International Conference on Data Mining 2003.

AuthorAffiliation

D. LAKSHMI SREENIVASA REDDY1, B. RAVEENDRA BABU2, A. GOVARDHAN3

1Rise Gandhi Group of institutions, Ongole, India

2VNR VJIET, Hyderabad, India

3JNTUH, Hyderabad, India

[email protected], [email protected], [email protected]

AuthorAffiliation

Mr. D. LAKSHMI SREENIVASA REDDY obtained his Master Degree from JNT university, Hyderabad, and doing PhD in the same University in Computer Science & Engineering. He is currently heading the Department of Computer Science & Engineering, Rise Gandhi Groups of Institutions, On-gole. He has 11 years of teaching experience. His Research interest is in Data mining.

Dr. B. RAVEENDRA BABU obtained his Masters in Computer Science and Engineering from Anna University, Chennai. He received his Ph.D. in Ap-plied Mathematics at S.V University, Tirupati. He is currently heading the Department of Computer Science & Engineering, RVR & JC College of En-gineering, Guntur. He has 27 years of teaching experience. He has several publications to his credit. His research areas of interest include VLDB, Image Processing, Pattern analysis and Wavelets.

Dr. A. GOVARDHAN did his BE in Computer Science and Engineering from Osmania University College of Engineering, Hyderabad in 1992, M. Tech from Jawaharlal Nehru University(JNU), Delhi in 1994 and he earned his PhD from Jawaharlal Nehru Technological University, Hyderabad (JNTUH) in 2003.He joined Jawaharlal Nehru Technological University in the year 1994 as an Assist. Prof., became Associate Professor in 1999 and Professor in 2006.

Word count: 3116

Show less

Abstract

Translate

Details

Title

Outlier Analysis of Categorical Data using NAVF

Author

Reddy, D Lakshmi Sreenivasa; Babu, B Raveendra; Govardhan, A

Pages

5-13

Publication year

2013

Publication date

2013

Publisher

INFOREC Association

ISSN

1453-1305

e-ISSN

1842-8088

Source type

Scholarly Journal

Language of publication

English

ProQuest document ID

1395397782

Outlier Analysis of Categorical Data using NAVF

Jump to:

Full Text

Abstract

Details

Suggested sources