Survey on data mining techniques for disease predictionDurga Kinge1, S. K. Gaikwad21Durga Kinge, Dept. Of Computer engineering, College Of Engineering Pune, Maharashtra, India2Prof S. K. Gaikwad , Dept. Of Computer engineering, College Of Engineering Pune, Maharashtra, India———————————————————————***———————————————————————
Abstract – Medicinal services produces gigantic information on every day ground having diverse structures like printed, images, numbers pool and so forth. However there is absence of devices accessible in heathcare to process this data. Data mining frameworks are utilized to extricate information from this data which can be utilized by media proficient individual to figure future procedures. Heart illness is the primary driver of death in the masses. Early recognizing and hazard expectations are essential for patient’s medicines and specialists analysis. Data mining algorithms like Decision trees (J48), Bayesian classifiers, Multilayer perceptron, Simple logistic and Ensemble techniques are utilized to determine the heart ailments. In this work, different data mining classification procedures are analyzed for testing their precision and execution on preparing medicinal informational index. The classification results will be envisioned by various representation procedures like 2D diagrams, pie graphs, and different techniques. The beforehand mentioned calculations are analyzed and assessed based on their exactness, time utilization factor, territory under ROC and so on. Key Words: Data mining, Decision tree, Ensemble techniques, multilayer perceptron, Bayesian classifiers, simple logistic.1. INTRODUCTION Heart ailment are one of the significant reason of death and disability on the planet, killing 17.5 million individuals every year and more than twenty-three million anticipated passing from cardiovascular sickness by 2030. Coronary illness incorporates different sorts of conditions that can influence center reason. The heart is an important organ of human body. On the off chance that the blood dissemination to the body is lacking, the organs of the body that is cerebrum and heart quit working and passing happens in couple of minutes. The peril factors related are distinguished as age, family history, diabetes , hypertension, elevated cholesterol, tobacco, smoke, liquor inward breath, heftiness, physical idleness, chest torment write and less than stellar eating routine 1.Medical industry is data rich yet learning poor. There is requirement for a wise emotionally supportive network for ailment forecast. Data mining strategies like Classification, regression are utilized to anticipate the infection. With the advancements of computing facility gave by software engineering innovation, it is currently conceivable to anticipate many states of infirmities more accurately15.Data mining is a cognitive procedure of discovering the hidden approach patterns from large data set. It is generally utilized for applications, for example, financial data ,analytic thinking, retail, media transmission industry, genome data analysis, logical applications and health mind frameworks and so on. Data mining holds Extraordinary potential to improve heath frameworks by utilizing data and analytics to recognize the accepted procedures that enhance care and reduce cost. WEKA is a effective tool as it contains both supervised and unsupervised learning techniques14. We utilize WEKA because it causes us to evaluate and compare data mining techniques (like Classification, Clustering, and Regression etc.) conveniently on real data.The objective of this work is to anayze the potential utilization of classification based data mining techniques like naive bayes, decision tree(j48), ensemble algorithms and simple logistic and so forth. 3. THEORITICAL BACKGROUND 3.1 Data mining techniques used for Heart Disease PredictionThere are two forms of analysis algorithms introduced in data mining as classification and prediction. 3.1.1 ClassificationClassification is a supervised technique which assigns items in the collection to target category or classes. Mainly two classes are present- binary and muti-class.The classification task takes as input the component vector X and predicts its value for the outcome Y i.e.C(X) ? Ywhere:X is a feature vectorY is a response taking values in the set CC( X) are the values in the set C. It is one of a few strategies utilized for the analysis of substantial datasets adequately. A classification assignments begins with the records whose class labels are known. In the training phase, algorithm discovers relationships between the values of the predictors and the values of the objective. Diverse classification algorithms utilize distinctive techniques for discovering relationships. These connections are summarized in a model, which would then have the capacity to connected to an other informational collection in which the class assignments are obscure for testing reason.3.1.2 PredictionRegression is adapted to foresee the scope of numeric or continuous values given a particular dataset. Following equation demonstrate that regression is the way toward estimating the value of a continuous target (p) as a function (F) of one or more predictors (x1 , x2 , …, xn), a set of parameters (R1 , R2 , …, Rn), and a measure of error (e).Regression helps in distinguishing the behavior of a variable when other variable(s) are changed in the process.3.1.3 ClusteringIt is unsupervised learning technique in which specific arrangement of unlabeled occurrences are gathered in view of their characteristics. By representing the records including fewer clusters loses certain fine details, but achieves simplification. Cluster analysis expects to discover the groups with the end goal that the inter-cluster similarity is low and the intra-group similitude is high. There are few distinctive methodologies of clustering: partitioning, hierarchical, density-based, grid-based and constrained-based methods.3.1.4 Ensemble learningEnsemble learning is also called committee based learning/ multiple classifier systems/ classifier combinations. The idea of deploying multiple model has been utilized for long time. In this compose multiple classifiers are consolidated to solve same problem by constructing set of hypothesis and join them to use. Training data might not provide adequate information for choosing single best learner. Ensemble helps to minimize noise, bias, variance. For getting good ensemble base learners ought to be more accurate and diverse. Accuracy can be accomplished by cross-validation and diversity can be achieved by sub sampling of training examples, manipuating attributes and outputs, injecting randomness.Strategies used for ensemble learning are Boosting, Bagging, Stacking. These are also called as meta algorithms. Bagging : It is also called as bootstrap aggregation which tries to implement similar learners on small sample population and then takes mean of all predictions. Equation for bagging = It is parallel technique used to decrease the variance . It trains M distinct trees on various subsets of data picked arbitrarily with substitution and compute the ensemble. It uses bootstrap sampling to obtain data subsets of training the base learners. Boosting : Boosting is an ensemble technique in which the predictors are not made autonomously, but rather consecutively used to reduce bias. This system utilizes the rationale in which the subsequent predictors learn from the mistakes of the previous predictors and finally takes weighted average. In this way, the perceptions have an unequal likelihood of showing up in resulting models and ones with the highest error appear most. Equation for boosting = In this way, the perceptions have an unequal likelihood of showing up in resulting models and ones with the most elevated blunder seem most. Stacking : In this type the base level models are trained based on a complete training set, then the meta model is trained on the output of base level for improving predictive force. 3.2 Algorithms used for disease prediction3.2.1 Decision tree algorithm(J48)It is a supervised learning algorithm used to predict class / value of target variable using decision rules. Each inward hub of the tree relates to an attribute, and each leaf hub compares to a class name. The record’s attribute values are continuously compared with other internal nodes of the tree until leaf node is reached with predicted class value. Decision Trees follow Sum of Product (SOP) representation for all the classes. Decision trees can handle both categorical and numerical data. Attribute selection is based on information gain and gini index.Different highlights of j48 algorithm are that it supports tree pruning, can handle missing values and furthermore gives out efficient yield for prediction analysis in weka11. 3.2.2 Naive Bayes The Naive Bayes Classifier system depends on the Bayesian speculation and is especially suited when the dimensionality of the input is high. Bayes Theorem – It works on conditional probability. Conditional probability is the probability that an occasion will happen, given that other occasion has just happened.whereP(H) is the likelihood of hypothesis H being valid. This is known as the prior probability.P(E) is the likelihood of the evidence(regardless of the hypothesis).P(E|H) is the likelihood of the evidence given that hypothesis is true.P(H|E) is the likelihood of the hypothesis given that the evidence is there.3.2.3 Support Vector Machine (SVM) A Support Vector Machine (SVM) is a supervised learning classifier characterized by by a separating hyperplane. The hyperplane is a line that partitions a plane in two sections where each class lay in either side.There are 2 sorts of SVM classifiers:1. Linear SVM Classifier2. Non-Linear SVM ClassifierSVMs are powerful when the quantity of highlights are very vast. Since the SVM algorithm works locally on numeric properties, it utilizes a z-score standardization on numeric characteristics.3.2.4 Random ForestThis constructs a randomized decision tree in each iteration of the algorithm and frequently creates excellent predictors. Every sub tree gives a classification and provides the tree votes for that class.Every tree in the ensemble is built using sample from the training set. With increase in number of trees in forest accuracy increases too. Random forest algorithm can use for both classification and the regression kind of problems. For classification issues, the ensemble of trees vote in favor of the most prominent class. In the regression problem, their responses are averaged to obtain an estimate of the dependent variable.Random forest prediction = Random Forests reduces high variance and bias present in decision tree by averaging the values o outcomes. 3.2.4 AdaBoostThe core principle of AdaBoost is to fit a succession of feeble learners (i.e., models that are just somewhat superior than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from every one of them are then combined through a weighted majority vote (or sum) to deliver the final prediction. 3.2.5 Simple Linear Regression modelSimple linear regression is a statistical method that empowers users to summarize and study relationships between two continuous variables. It is a model that expect a linear relationship between the input variables (x) and the single output variable (y). Here the y can be figured from a linear blend of the input variables (x). 3.2.6 Multilayer Percepteron(MLP) Multilayer perceptron which makes use of multiple layers of the neural network is created by using the set of various parameters which are selected to adjust the models with the help of correlation between parameters and prediction of the disease16. An MLP comprises of multiple layers of nodes in a directed graph, with each layer completely associated with the following one. Aside from the information nodes, every node is a neuron (or preparing component) with a nonlinear enactment work. MLP utilizes a supervised learning technique called back propagation for training the network.3.2.7 Logistic RegressionLogistic regression is a type of supervised technique that measures the connection between the dependent and independent variables by evaluating probabilities using a logistic function. Logistic regression predicts the likelihood of a result that can only have two values (i.e. a dichotomy). In this model categorical variables are used. 3.3 Performance Metrics 3.3.1 Accuracy : The ability of the model to correctly predict the class label of new or previously unseen dataAccuracy = Recall = TR/ T= TP / (TP + FN)Specificity = TN / N = TN / (TN + FP)Precision = TP / P = TP / (TP + FP)F Score = 2*(Recall * Precision) / (Recall + Precision) 3.3.2 Mean Absolute Error : Measure of difference between two continuous variables. MAE = 3.3.3 Root Mean Squared error : It follows an assumption that error are unbiased and follow a normal distribution. RMS = 3.3.4 ROC (Receive operating characteristic) : It is a graphical representation of performance of classifiers.3.3.5 Kappa Statistics : The kappa statistic is frequently used to test inter rater reliability that is it is used to compare accuracy of the system to the random system.K = P(A) ? P(E) /(1? P(E)P(A) – Agreement percentage P(E) – Agreement chances.If K = 1 – Agreement in tolerable range 4 DATA ANALYSIS AND RESULTSThe dataset utilized is the UCI Heart-disease dataset having total 303 instances. It comprise of aggregate 75 attributes from which 14 are used. Attributes utilized are of real, binary, nominal, and ordered type. The attributes and its descriptions are given as below Table -1 : Heart Disease DatasetSR NONAMETYPE/ VALUES1ageReal2sexBinary (1=M, 0=F)3Chest pain typeNominal (4 Values)4Resting blood pressureReal5Serum cholesterolReal6Fasting blood sugarBinary7Resting ECGNominal (0,1,2)8Max heart rate achievedReal9Exercise enduced anginaBinary(1= yes; 0= no)10oldpeakReal11Slope of peak exercise ST segmentOrdered12Number of major vesselsReal 13thalNominal (3,6,7)14classPresent / Absent 4.1 Interpretation And EvaluationAll tests are based on Ten-Fold Cross-Validation. This section looks at the classification accuracy of the seven supervised algorithms in particular Naïve Bayes, J48, and Random Forest, Adaboost, Bagging, MLP,Simple Logistic. As appeared in Table II all simulations were performed using WEKA machine learning environment which consists of collection of popular machine learning techniques that can be used for practical data mining. Classification metrics such as TP (True Positive), FP (False Positive), Precision, Recall, FMeasure and ROC Area are used to assess classifiers performance.Table -2 WEKA evaluation criteria PrecisionRecallFKappaStatArea underROCTime in secJ480.800.820.810.550.790.01N Bayes0.840.870.860.670.900.04RF0.840.870.850.660.900.26Adboost0.840.840.830.630.890.06Bagging0.830.850.840.630.880.08MLP0.820.810.810.580.860.59S.logist0.840.870.850.660.900.32 Fig. shows the accuracy levels for all classifiers. It demonstrate that both J48 and random forest, simple logistic have better accuracy whereas the J48 shows poor accuracy levels. Chart -1: Accuracy Of Algorithms 5 CONCLUSIONHeart Disease is a fatal disease by its nature. This disease makes a life threatening complexities, for example heart attack and death. The significance of Data Mining in the Medical Domain is acknowledged and steps are taken to apply relevant techniques in the Disease Prediction. The various research works with some effective techniques done by different people were studied. This work evaluates the disease categorization using diverse machine learning algorithms by WEKA Tool. There are fundamentally two motivations behind building a ensemble of classifier. i. Reduced variance: Results are less subject to the peculiarities of a single training set. ii. Reduced bias: A compounding of multiple classifiers may learn a more expressive concept class than a single classifier. 6 REFERENCES 1 Yanwei X, Wang J, Zhao Z, Gao Y., “Combination data mining models with new medical data to predict outcome of coronary heart disease”, Proceedings International Conference on Convergence Information Technology;2007. p. 868–72.2 Theresa Princy. R,J. Thomas, “Human Heart Disease Prediction System using Data Mining Techniques. International Conference on Circuit”, Power and Computing Technologies ICCPCT,2016.3 Weimin Xue, Yanan Sun, Yuchang Lu, “Research and Application of Data Mining in Traditional Chinese Medical Clinic Diagnosis”, In proc of IEEE 8th international Conference on Signal Processing, Vol. 4, ISBN: 0-7803-9736-3, 2006.4 Ranjit Abraham, Jay B.Simha, Iyengar, “A comparative analysis of discretization methods for Medical Datamining with Naïve Bayesian classifier”, In proc of IEEE international conference on information Technology, pp. 235 – 236, 2006, ISBN: 0-7695-2635-7.5 S. Palaniappan and R. Awang, “Intelligent heart disease prediction system using data mining techniques,” in Proc. of IEEE/ACS Int. Conf., Doha, 2008, pp. 108-115.6 Jagdeep Singh, Amit Kamra, Harbhag Singh. Prediction of Heart Diseases Using Associative Classification, Vol 978-1-5090-0893-3/16 IEEE 2016.7 Seyedamin Pouriyeh , Sara Vahid , Giovanna Sanninoy, Giuseppe De Pietroy, Hamid Arabnia, “A Comprehensive Investigation and Comparison of Machine Learning Techniques in the Domain of Heart Disease”, 22nd IEEE Symposium on Computers and Communication (ISCC 2017): Workshops – ICTS4eHealth 2017.8 M. Gudadhe, K. Wankhade, and S. Dongre, “Decision support system for heart disease based on support vector machine and artificial neural network,” in Computer and Communication Technology (ICCCT), 2010 International Conference on, 2010, pp. 741–745.9 Munaza Ramzan, “Comparing and Evaluating the Performance of WEKA Classifiers on Critical Diseases”, 978-1-4673-6984-8/16/$31.00 c ? 2016 IEEE.10 M.A.Jabbar, Shirina samreen, “Heart disease prediction system based on hidden naïve bayes classifier”, ICECIT, pp 183-192, Elsevier, vol 1(2012).11 Ajinkya kunjir, Harshal sawant, Nuzhat Sheikh, “Data mining and visualization for prediction of multiple diseases in healthcare”, IEEE 2007.12 Garima Singh, Kiran Bagwe, Shivani Shanbhag, Shraddha Singh, Sulochana Devi, “Heart disease prediction using Naïve Bayes”, International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017. 13 Prof. Mamta Sharma, Farheen Khan, Vishnupriya Ravichandran, “Comparing Data Mining Techniques Used For Heart Disease Prediction”, International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 06 June -2017.14 Munaza Ramzan, “Comparing and Evaluating the Performance of WEKA Classifiers on Critical Diseases”, 978-1-4673-6984-8/16/$31.00 2016 IEEE.15 C. M. Velu and K. R. Kashwan, “Visual Data Mining Techniques for Classification of Diabetic Patients”, IEEE, DOI: 10.1109/IAdCC.2013.6514375.16 Aditya A. Shinde, Rahul M. Samant, Atharva S. Naik, Shubham A. Ghorpade, Sharad N. Kale, “Heart Disease Prediction System using Multilayered Feed Forward Neural Network and Back Propagation Neural Network”, International Journal of Computer Applications (0975 – 8887) Volume 166 – No.7, May 2017.