Feature Selection for Classification with Artificial Bee Colony Programming

Feature selection and classification are the most applied machine learning pro-cesses. In the feature selection, it is aimed to find useful properties containing class information by eliminating noisy and unnecessary features in the data sets and facilitating the classifiers. Classification is used to distribute data among the various classes defined on the resulting feature set. In this chapter, artificial bee colony programming (ABCP) is proposed and applied to feature selection for classification problems on four different data sets. The best models are obtained by using the sensitivity fitness function defined according to the total number of classes in the data sets and are compared with the models obtained by genetic programming (GP). The results of the experiments show that the proposed technique is accurate and efficient when compared with GP in terms of critical features selection and classification accuracy on well-known benchmark problems.


Introduction
In recent years, data learning and feature selection has become increasingly popular in machine learning researches. Feature selection is used to eliminate noisy and unnecessary features in collected data that can be expressed more reliably and high success rates are obtained in classification problems. There are several works which related to solve genetic programming (GP) in feature selected classification problem [1][2][3][4]. Since artificial bee colony programming (ABCP) is a recently proposed method, there is no work related to this field. In this chapter, we evaluated the success of classification by selecting the features of GP and ABCP automatic programming methods using different data sets.

Goals
The goal of this chapter is classify models are obtained with comparable accuracy to alternative automatic programming methods. The overall goals of chapter are set out below.
1. Evaluation of the performance of models with parameters such as classification accuracy, complexity.
2. Whether ABCP method actually can select related/linked features. 3. Evaluating training performance of automatic programming methods to determine if there is overfitting.
The organization of the chapter is as follows: background is described in Section 2, detailed description of GP and ABCP is introduced in Section 3. Then, experiments and results are presented and discussed in Section 4. The chapter is concluded in Section 5 with summarizing the observations and remarking the future work.

Feature selection
Feature selection makes it possible to obtain more accurate results by removing irrelevant and disconnected features in model prediction. The model prediction provides the functional relationship between the output parameter y and the input parameters x of the data set. Removing irrelevant features reduces the dimension of the model, thus it reduces space complexity and computation time [5,6].
Feature selection methods are examined in three main categories as filter methods, embedded methods and wrapper methods [7,8]. Filtering methods evaluate features with the selection criterion based on correlations between features (feature relevance) and redundancy and associate of features with class label vectors. Wrapper methods take into account the success of classification accuracy and decide whether or not an object will be included in the model. In order to obtain the successful model, it is not preferred in time constrained problems because the data set is trained and tested many times [9]. Embedded methods perform feature selection as part of model construction is based on identifying the best divisor.
In recent years, increasing interest in discovering potentially useful information has led to feature selection researches [10][11][12][13][14][15]. In [10], a spam detection method of binary PSO with mutation operator (MBPSO) was proposed to reduce the spam labeling error rate of non-spam email. The method performed more successful than many other heuristic methods such as genetic algorithm (GA), particle swarm optimization (PSO), binary particle swarm optimization (BPSO), and ant colony optimization (ACO). Sikora and Piramuthu suggested GA for feature selection problem using Hausdorff distance measure [11]. GA was quite successful the accuracy of prediction accuracy and computational efficiency in real data mining problems. In [12], a wrapper framework was proposed to find out the number of clusters in conjunction in the selection of features for uncontrolled learning and normalize the tendencies of feature selection criteria according size. Feature subset selection using expectation maximization clustering (FSSEM) was used as the performance criterion for the maximum likelihood. Schiezaro and Pedrini proposed a feature selection method based on artificial bee colony (ABC) [13]. The method presented better results for the majority of the data sets compared to ACO, PSO, and GA. Yu et al. showed that selecting the discriminative genes of GP and expressing the relationships between the genes as mathematical equations were proof that GP has been applied feature selector and cancer classifier [2]. Landry et al. compared k-nearest neighbor (k-NN) with decision trees generated by GP in several benchmark datasets [14]. GP was more reliable performance for feature selection and classification problems. Our chapter is the first to work with the ABCP's ability to select the necessary features in datasets.

Classification
Classification provides a number of benefits to make it easier to learn about data and to monitor the data. Several researches have been applied to solve the classification problems [15][16][17]. Fidelis et al. classified each chromosome based on GA that represented classification rules [15]. The algorithm was evaluated in different data sets and achieved successful results. A new algorithm was proposed to learn the distance measure for the closest neighbor classifier for k-nearest multi class classification in [16]. Venkatesan et al. proposed progressive technique for multi class classification can learn new classes dynamically during the run [17].
Much work has been devoted to classification using GP and ABC [18][19][20][21][22][23][24][25]. GP based feature selection age layered population structure as a new algorithm for feature selection with classification was compared with other GP versions in [18]. Lin et al. proposed the feature layered genetic programming method for feature selection and feature extraction [19]. The method, had a multilayered architecture, was built using multi population genetic programming. The experimental results show that the method achieved high success in both feature selection and feature extraction as well as classification accuracy. Ahmed et al. aimed at automatic feature selection and classification of mass spectrometry data with very high specificity and small sample representation using GP [20]. GP achieved higher success as a classification method by selecting fewer features than other conventional methods. Liu et al. designed a new GP based ensemble system to classify different cancer types where the system was used to increase the diversity of each ensemble system [21]. ABC was used data clustering on benchmark problems and was compared conventional classification techniques in [22]. Karaboga et al. applied ABC on training feed forward neural networks and classified different datasets [23]. ABC was used to improve the performance of classification in several domains avoiding the issues related to band correlation in [24]. Chung et al. proposed ABC as a new tool for data mining particularly in classification and compared evolutionary techniques, standard algorithms such as naive Bayes, classification tree and nearest neighbor (k-NN) [25]. Works showed that GP and ABC are successful in classification area. In this chapter is the first work to compare GP and recently proposed ABCP method in feature selected classification.

GP and ABCP
This section explicitly details GP and ABCP automatic programming methods.

GP
GP, most well-known method, was developed by Koza [26]. GP has been applied to solve numerous interesting problems [27][28][29]. The basic steps for the GP algorithm are similar to the steps of genetic algorithm (GA) and use the same analogy as GA. The most important difference GP and GA is representation of individuals. While GA express individuals as fixed code sequences, GP express them as parse trees. Flow chart of GP is given in Figure 1 [30].
The first step in the flow chart is the creation of the initial population. Each individual in the population is represented by a tree where each component is called node. The production of tree nodes is provided by terminals (constants or variables such as x, y, 5) and functions (arithmetic operators such as +, À/, sin, cos). Individuals are produced by the full method, the grow method, or the ramped half and half method [31]. Individuals are evaluated predetermined objective function. GP aims to increase the number of individuals with high quality survival and to decrease the number of low quality individuals. Individuals with high quality are more likely to pass on to the next generation. Individuals are developed them with exchange operators such as reproduction, crossover and mutation. Choosing the best individuals according to fitness are applied with methods like tournament, roulette wheel [32]. The crossover operator allows hybrid of two selected individuals to produce a new individual. Generally, sub-trees taken from two crossing points selected from parent trees are crossed to obtain new hybrid trees. The mutation operator provides unprecedented and unexplored individual elements [33]. Substitution of randomly selected tree instead of randomly selected node in the tree is called subtree mutation. Another method of mutation is a single point mutation. In this method, if the terminal is selected randomly from the tree, it is changed with the value selected from the terminal set. If the function is selected randomly from the tree, the value is selected from the function set. The best individuals of the previous generation are transferred to the current generation with elitism operator. The program is terminated when it is reached according to predefined stopping criteria such as the specific fitness value of the individuals, the number of generations.

ABCP
ABC algorithm was developed by Karaboga, modeling the food source search the intelligent foraging behavior of a honey bee swarm [34]. ABCP that was inspired ABC was introduced first time as a new method on symbolic regression [35]. In ABC, the positions of the food sources, i.e., solutions, are carried out with fixed size arrays and displays the values found by the algorithm for the predetermined variables as in GA. In the ABCP method, the positions of food sources are expressed in tree structure that is composed of different combinations of terminals and functions that are specifically defined for problems. The mathematical relationship of the solution model in ABCP can be represented the individuals in Figure 2 is described Eq. (1). In these notations, x is used to represent the independent, and f(x) is dependent variable.
In the ABCP model, the position of a food source is defined as a possible solution and nectar of the food source is defined for the quality of the solution. There are three different types of bees, as in the ABC: employed bee, onlooker bee and scout bee in the ABCP algorithm. Employed bees are responsible for bringing the hive of nectar from specific sources that have been previously discovered and they share information about the quality of the source with the onlooker bees. Every food source is visited by one employed bee who then takes nectar to hive. The onlooker bees monitor the employed bees in hives and turn to a new source using the information shared by the employed bees. After employed and onlooker bees complete the search processes, source are checked whether source nectars are exhausted. If a source is abandoned, the employed bee using the source becomes the scout bee and randomly searches for new sources. The main steps of ABCP algorithm is given in the flow chart of ABCP algorithm in Figure 3.
In ABCP, the production of solutions and the determination of the quality of solutions are carried out in a similar way to GP. In the initialization of the algorithm, solutions are produced by the full method, the grow method, or the ramped half and half method [26]. The quality of solutions is found by analyzing each tree according to fitness measurement procedure.
In employed bee phase, candidate solution is created using information sharing mechanism which is the most fundamental difference between ABC and ABCP [36]. In this mechanism, when a candidate solution (v i ) is generated, the neighbor node solution x k , taken from the tree, is randomly selected considering the predetermined probability p ip . The node selected from the neighbor solution x k determines what information will be shared with the current solution and how much it will be shared. Then node x i , which represents the current solution in the tree that determines how to use the neighboring node, is randomly selected in the probability distribution of p ip . The candidate solution v i is produced by replacing the nodes of the current solution node x i and the neighbor solution node x k . This sharing mechanism is shown in Figure 4. Figure 4a and b are: node x i representing the current solution and neighbor node x k taken from the tree respectively, Figure 4c neighboring information and the generated candidate solution are given in Figure 4d. After the candidate solution is generated, a greedy selection process is applied between the node x i expressing the current solution and the candidate solution v i . Candidate solution is evaluated and greedy selection is used for each employed bee.
In onlooker bee phase, employed bees come into hive and share their nectar with the onlooker bees after they complete the research process. The source selection is based on the selection probability of the solution that is based on the nectar qualities, p i is calculated Eq. (2): where fit i quality of the solution i, fit best quality of the best solution current solutions [35]. When the solutions are selected, the onlooker bees begin to look for new sources by acting like employed bees. The quality of the newly found solution is checked. If a new solution is more qualified, the solution is taken into memory and the current source is deleted from the memory.
After the employed bees and onlooker bees complete the search in each cycle, the penalty points of the respective sources are incremented by one if they cannot find more qualify sources. When a better source is found, the penalty point of that source is reset. If the penalty point exceeds the 'limit' parameter, the employed bee of that source becomes a scout bee and randomly determines new source instead of an abandoned source.

Experimental design
This section demonstrate feature selected classification ability of GP and ABCP, set of experiments conducted.

Datasets
In this chapter, the experiments are conducted on four real world datasets. All datasets are taken from UCI [37]. The first of data set is Wisconsin diagnostic breast cancer (WDBC). The dataset classifies a tumor as either benign or malignant is the diagnosis of breast cancer. It consists of 30 input parameters that determine whether the tumor of 569 patients is benign or malignant. When the data set is examined, it is observed that $60% of the benign and remainder of the tumors is malignant. The malignant tumor in the data set is defined as 1 and benign tumor is 0. The entry set contains 10 parameters for the suspected community. These input parameters are given as radius, texture, circumference, area, fluency, density, concavity, concavity points, symmetry and fractal. Dataset has an average, standard error, and worst error value for each record. Thus, there are totally 30 input parameters.
It has been used in much recent work on cancer classification of machine learning algorithms [38][39][40]. Bagui et al. tried to classify two large breast cancer data sets with many machine learning methods such as linear, quadratic, k-NN [39]. In the paper, 9 variable WBC (Wisconsin breast cancer) and 30 variable WDBC (Wisconsin diagnostics breast cancer) data sets were reduced to 6 and 7 variables, respectively. WDBC is classified J48 decision trees, multi-layer perception (MLP), naive Bayes (NB), sequential minimal optimization (SMO), distance based K nearest neighbor (IBK, instance based for K-nearest neighbor) in [40]. Kathija et al. used support vector machines (SVM) and Naive Bayes to classify WDBC in the paper [40].
The second dataset is the dermatology data set, contains 34 features, 33 of which are linear values and one of which is nominal. The differential diagnosis of erythematosquamous disease is a real problem in dermatology. Diagnosis usually requires a biopsy, but unfortunately, these diseases share many histopathological features. Patients were initially evaluated clinically in the data set. Then, skin samples were taken for evaluation of 22 histopathological features. The values of the histopathological features were determined by analysis of the samples under a microscope. There are multiple researches to diagnose dermatological diseases [41][42][43][44][45][46]. Rambhajani et al. used the Bayesian technique as a feature selection in the paper [42]. When several measures such as accuracy, sensitivity, and specificity are evaluated high successful results obtained in the model classification of 15 features for the dermatology data set with 34 features. Pappa et al. proposed a multi object GA called C4.5 that performed on six different data sets including the dermatology dataset for feature selection [46].
The other dataset is Wine which is the results of chemical analyzes of wines from three different varieties of the same region of Italy. The analysis is based on the amounts of 13 features present in each of the three wine varieties. Zhong et al. proposed a modified approach to the nonsmooth Newton method and compared with support vector algorithm called standard v-KSVCR method in wine dataset [47]. A proposed block based affine matrix for spectral clustering methods was compared with 10 different datasets including wine dataset standard classification methods in [48].
The last dataset Horse colic which reveals the presence or absence of colic disease depending on various pathological values of horse colic. Nock et al. used the symmetric nearest neighbor (SRN), which calculates the scores of the closest neighbor's relations in [49]. This chapter aims to be able to diagnose that the tumor is benign or malignant in WDBC, to identify six different dermatologic diseases in Dermatology, to recognize three varieties of wines in Wine and to presence of colic disease was investigated in Horse Colic.

Training sets and test sets
In this chapter, each dataset is split into a training set and test set to investigate feature selected classification performance of the evolved models. The number of features, training instances and test instances of the four datasets are shown in Table 1. All datasets are almost split with 70% of instances randomly selected from the datasets for training and the other 30% instances forms test set. In each run, the training and test instances are reconstructed by selecting from random instances of datasets.

Settings
Similar parameter values and functions are used for comparison with GP and ABCP. Since the real input features of the data sets were used, the results obtained from the solutions are theoretically in the range of [À∞, ∞]. Result values to be able to define discrete class values (such as class 0, class 1), it is necessary to be first drawn to a range defined earlier and be contained the total number of classes. The fitness function is defined in Eq. (3).
where N c is the number of output classes, g o is the result of the current solution. In this chapter, the fitness function is the weighted sum of the ratios of the total class numbers in the data set of correctly predicted class numbers. For example, in the binary classification, the fitness function is obtained by summing up ratio of correct predicted 0 to total number of 0 in the data set with ratio of correct predicted 1 to total number of 1 in the data set.
For binary classification problems, this function is defined as SFF (sensitivity fitness function) given in Eq. (4) [50]. where n c (i,k) is the number of correctly predicted states when compared to the k class in data set from the class k for the ith solution, n a (i,k) the number of all records in class k in the data set is the number of inputs defined in the range [0, 1] refers to a real number. The generalized version of Eq. (4) is given in Eq. (5) for multiple class problems investigated.
In general, the weight value (w) is used equally. In this case, the proportion of the ratio distribution for each class is adjusted equally. In some cases, a penalty parameter can be added to avoid misclassification in unbalanced data sets. The parameter is added to the fitness function that defined in Eq. (5) as expressed in Eq. (6). It evaluates the models obtained from the solutions. Where p is the penalty factor and N is the total number of nodes in the solution.
The data sets are evaluated according to the SFF function defined in Eq. (6). The complexity of the obtained solution is calculated as in Eq. (7) in proportion to the depth of the tree and the number of nodes.
where C is tree complexity, d is the depth of the solution tree, and n is the number of nodes at depth.
The control parameters used by the automatic programming methods are given in Table 2. The population size and the iteration size are set by the number of features and the number of classes of the data set. Dermatology has more features and classes than other datasets, therefore population size and iteration number are chosen as the highest for this dataset. As seen from Table 2, the weight value is defined in proportion to the number of classes in the output of each data set. Each class is equal importance. The penalty point given in Eq. (6) was set equal to 0.001 for all data sets. The maxx function specifies the maximum value of vector, the minx function specifies the minimum value of vector. The ifbte checks the value of left operand, if it is greater than or equal to the value of right operand, then condition becomes true. The iflte checks the value of left operand, if it is less than or equal to the value of right operand, then condition becomes true. How the functions operate condition expressions are defined in Eqs. (8) and (9).

Simulation results
For each data set, GP and ABCP are run 30 times according to configuration in Table 2. The classification success of GP and ABCP methods are given in Table 3 in terms of mean, best and worst values for each dataset. SFF and success percentage (SP) results are given in Table 3 for both training and test cases. As the SFF increased, the success rate of classification increased. The highest mean classification in training (93.43%) was obtained ABCP in Wine. Both methods showed lower SFF and classification success compared to other data sets in Horse colic.  (14), x 8 (13), x 22 (12), x 7 (11), x 14 (9), x 10 (9), x 2 (7) x 23 (15), x 1 (15),  Table 6. Number of features selected by the methods.

Analysis of evolved models
The evolved models of best classifier solutions in ABCP are shown in Table 4. It can be observed that both methods extracted successful models with few features. The methods extracted models regardless of the total number of features of the data sets. In general, ABCP has achieved higher success rate of classification than GP using less features. Table 5 shows general information about the best solution tree. Less complex models are shown in the table with bold typing. When the trees of the best models are analyzed structurally, ABCP, except for the dermatology, shows the best models with less complexity. The detailed information about the inputs of mathematical models of the best solutions in each run are presented in Table 6. Features are ordered most common in equations on the table. Equations which are most common, three features (x 7 , x 8 , x 28 ) are same in WDBC; seven features (x 7 , x 14 , x 15 , x 22 , x 27 , x 31 , x 33 ) are same in dermatology; four features (x 7 , x 10 , x 11 , x 12 ) are same in wine; eight features (x 1 , x 8 , x 10 , x 19 , x 21 , x 22 , x 23 , x 26 ) are same in horse colic in both methods. In the best models of the 30 runs, the frequently available features in both of the methods were evaluated as inputs for success of classification. For example, in total 30 runs for WDBC x 28 15 times; for dermatology x 31 were seen.

Conclusion
In this chapter, selecting features in classification problems are investigated using GP and ABCP and the literature study related to this field is included. In the performance analysis of the methods, four classification problems are used. As results of 30 runs, the features of the best models were examined. Both methods were found to extract successful models with the same features. According to the experimental results, ABCP is able to extract successful models in training set and it has comparable accuracy to GP. This chapter shows that ABCP can be used in high level automatic programming for machine learning. Several interesting automatic programming methods such as Multi-Gen GP and Multi-Hive ABCP can be further researched in the near future.