Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Breast Cancer Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with

Return to Breast Cancer data set page.

Kaizhu Huang and Haiqin Yang and Irwin King and Michael R. Lyu and Laiwan Chan. Biased Minimax Probability Machine for Medical Diagnosis. AMAI. 2004.

Then we apply it to two real-world medical diagnosis datasets, the breast cancer dataset and the heart disease dataset. 4.1. A Synthetic Dataset A two-variable synthetic dataset is generated by the two-dimensional gamma distribution. Two classes of data are

Gavin Brown. Diversity in Neural Network Ensembles. The University of Birmingham. 2004.

critical to consider values for the strength parameter outside the originally specified range. Table 5.3 shows the classification error rates of two empirical tests, on the Wisconsin breast cancer dataset from the UCI repository (699 patterns), and the Heart disease dataset from Statlog (270 patterns). An ensemble consisting of two networks, each with five hidden nodes, was trained using NC. We use

Kristin P. Bennett and Ayhan Demiriz and Richard Maclin. Exploiting unlabeled data in ensemble methods. KDD. 2002.

experiments we used simple multilayer perceptrons with a single layer of hidden units. The networks were trained using backpropagation with a learning rate of 0.15 and a momentum value of 0.90. The datasets for the experiments are breast cancer wisconsin, pima-indians diabetes, and letter-recognition drawn from the UCI Machine Learning repository [3]. The number of units in the hidden layer for the

Andrs Antos and Balzs Kgl and Tams Linder and Gbor Lugosi. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research, 3. 2002.

attributes were binary coded in a 1-out-of-n fashion. Data points with missing attributes were removed. Each attribute was normalized to have zero mean and 1= p d standard deviation. The four data sets were the Wisconsin breast cancer (n = 683, d = 9), the ionosphere (n = 351, d = 34), the Japanese credit screening (n = 653, d = 42), and the tic-tac-toe endgame (n = 958, d = 27) database. 84

Michael G. Madden. Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. CoRR, csLG/0211003. 2002.

and all four are equally good on the Breast Cancer dataset. Nave TAN K2 MBBC Chess 87.63 1.61 91.68 1.09 94.03 0.87 97.03 0.54 WBCD 97.81 0.51 97.47 0.68 97.17 1.05 97.30 1.01 LED-24 73.28 0.70 73.18 0.63 73.14 0.73 73.14 0.73 DNA 94.80 0.44

Hussein A. Abbass. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine, 25. 2002.

well, compared to the previous studies. In another study, Setiono [26] used his rule extraction from ANNs algorithm [28, 29] to extract useful rules that can predict breast cancer from the Wisconsin dataset. He needed first to train an ANN using BP and achieved an accuracy level on the test data of approximately 94%. After applying his rule extraction technique, the accuracy of the extracted rule set

Fei Sha and Lawrence K. Saul and Daniel D. Lee. Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines. NIPS. 2002.

Kernel Polynomial Radial Data k=4 k=6 #=0.3 #=1.0 #=3.0 Sonar 9.6% 9.6% 7.6% 6.7% 10.6% breast cancer 5.1% 3.6% 4.4% 4.4% 4.4% Table 1: Misclassification error rates on the sonar and breast cancer data sets after 512 iterations of the multiplicative updates. 3.1 Multiplicative updates The loss function in eq. (6) is a special case of eq. (1) with A ij = y i y j K(x i , x j ) and b i =- 1. Thus, the

Robert Burbidge and Matthew Trotter and Bernard F. Buxton and Sean B. Holden. STAR - Sparsity through Automated Rejection. IWANN (1). 2001.

available from the UCI Machine Learning Data Repository [11], are as follows. The breast cancer Wisconsin data set has 699 examples in nine dimensions and is `noise-free', one feature has 16 missing values which are replaced with the feature mean. The ionosphere data set has 351 examples in 33 dimensions and is

Bernhard Pfahringer and Geoffrey Holmes and Richard Kirkby. Optimizing the Induction of Alternating Decision Trees. PAKDD. 2001.

Instances Missing Numeric Nominal values (%) attributes UCI Datasets breast cancer 699 0.2 9 0 cleveland 303 0.2 6 7 credit 690 0.6 6 9 diabetes 768 0.0 8 0 hepatitis 155 5.4 6 13 hypothyroid 3772 5.4 7 22 ionosphere 351 0.0 34 0 kr-vs-kp 3196 0.0 0 36 labor 57 33.6

Sally A. Goldman and Yan Zhou. Enhancing Supervised Learning with Unlabeled Data. ICML. 2000.

just the initial labeled data (i.e. round 0). Our cotraining procedure helped both algorithms to improve their performance. Figure 2 shows the results from one of our runs using the breast cancer data set. In this data set ID3 had the better performance. Again (as we generally see), both hypotheses were improved by the co-training. 0 1 2 3 Number of co-training rounds 0.21 0.22 0.23 0.24 0.25 Error

Justin Bradley and Kristin P. Bennett and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000.

the Johns Hopkins Ionosphere dataset and the Wisconsin Diagnostic Breast Cancer dataset (WDBC) [7]. The Ionosphere dataset contains 351 data points in R 33 and values along each dimension Contrained K-Means Clustering 6 0 5 10 15 20 25

Kristin P. Bennett and Ayhan Demiriz and John Shawe-Taylor. A Column Generation Algorithm For Boosting. ICML. 2000.

LPBoost has a well defined stopping criterion that is reached in a few iterations. It uses few weak learners. There are only 81 possible stumps on the Breast Cancer dataset (9 attributes having 9 possible values), so clearly AdaBoost may require the same tree to be generated multiple times. LPBoost generates a weak learner only once and can alter the weight on that

Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Improved Generalization Through Explicit Optimization of Margins. Machine Learning, 38. 2000.

chosen as the final solution. In some cases the training sets were reduced in size to makeoverfitting more likely (so that complexity regularization with DOOM could have an effect). In three of the datasets (Credit Application, Wisconsin Breast Cancer and Pima Indians Diabetes), AdaBoost gained no advantage from using more than a single classifier. In these datasets, the number of classifiers was

Petri Kontkanen and Petri Myllym and Tomi Silander and Henry Tirri and Peter Gr. On predictive distributions and Bayesian networks. Department of Computer Science, Stanford University. 2000.

3 we plot the performance of the methods, averaged over 100 independent test runs performed as described above, as a function of the number of the data vectors used for training in the Breast cancer dataset case. From this picture we see that in the logscore sense, the evidence-based EVU and EVJ approaches perform surprisingly well even in 15 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 0 50 100 150 200 250 300

Matthew Mullin and Rahul Sukthankar. Complete Cross-Validation for Nearest Neighbor Classifiers. ICML. 2000.

and Abalone-3 are twoand three-class versions of the problem, where the adjacent classes were grouped so that data was divided evenly. Abalone-3 was introduced in (Waugh, 1995). In the Breast Cancer dataset, the ID field was omitted, as was a field containing missing values. 7 Since the aim of these experiments was not to improve classification accuracy but rather to compare estimation variance and

Endre Boros and Peter Hammer and Toshihide Ibaraki and Alexander Kogan and Eddy Mayoraz and Ilya B. Muchnik. An Implementation of Logical Analysis of Data. IEEE Trans. Knowl. Data Eng, 12. 2000.

the housing value is above or below the median. Using training sets of 80% of the observations, [16] reports correct prediction rates ranging from 82% to 83.2%. Breast Cancer (Wisconsin). The dataset, compiled by O. Mangasarian and K.P. Bennett, is widely used in the machine learning community for comparing learning algorithms. It is, however, difficult to use it for rigorous comparisons since

Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.

are given in Table 10, and indicate that the three methods are very competitive. 4 Stacking performs better than both arcing and bagging in three datasets (Waveform, Soybean and Breast Cancer , and is better than arcing but worse than bagging in the Diabetes dataset. Note that stacking performs very poorly on Glass and Ionosphere, two small

Lorne Mason and Jonathan Baxter and Peter L. Bartlett and Marcus Frean. Boosting Algorithms as Gradient Descent. NIPS. 1999.

0% noise - AdaBoost 0% noise - DOOM II 15% noise - AdaBoost 15% noise - DOOM II Figure 2: Margin distributions for AdaBoost and DOOM II with 0% and 15% label noise for the breast cancer and splice data sets. Given that AdaBoost suffers from overfitting and minimizes an exponential cost function of the margins, this cost function certainly does not relate to test error. Howdoesthevalue of our proposed

Iaki Inza and Pedro Larraaga and Basilio Sierra and Ramon Etxeberria and Jose Antonio Lozano and Jos Manuel Pea. Representing the behaviour of supervised classification learning algorithms by Bayesian networks. Pattern Recognition Letters, 20. 1999.

1,055 cases, a sufficient amount to obtain a 'not-overfitted' Bayesian network. Figure 1 summarizes the explained process. As an example, the induced simplified Bayesian network for Breast cancer dataset can be seen in Figure 2. 3.4 Concepts for interpreting the joint behaviour Once the Bayesian networks are induced, our aim is to extract assertions on the joint behaviour of Machine Learning

David W. Opitz and Richard Maclin. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. (JAIR, 11. 1999.

ensemble. Also shown (results column 3) is the "best" result produced from all of the single network results run using all of the training data. 197 Opitz & Maclin Single Bagging Arcing Boosting Data Set Err SD Best Err SD Err SD Err SD breast cancer w 5.0 0.7 4.0 3.7 0.5 3.5 0.6 3.5 0.3 credit-a 14.9 0.8 14.2 13.4 0.5 14.0 0.9 13.7 0.5 credit-g 29.6 1.0 28.7 25.2 0.7 25.9 1.0 26.7 0.4 diabetes 27.8

Huan Liu and Hiroshi Motoda and Manoranjan Dash. A Monotonic Measure for Optimal Feature Selection. ECML. 1998.

with unknown relevant attributes, consists of WBC - the Wisconsin Breast Cancer data set, LED-7 - data with 7 Boolean attributes and 10 classes, the set of decimal digits (0..9), Letter - the letter image recognition data, LYM - the lymphography data, and Vote - the U.S. House of

Rudy Setiono and Huan Liu. NeuroLinear: From neural networks to oblique decision rules. Neurocomputing, 17. 1997.

A. Detailed analysis 1: The University of Wisconsin Breast Cancer Dataset. This data set has been used as the test data for several studies on pattern classification methods using linear programming techniques [1, 13] and statistical techniques [23]. Each pattern is

Pedro Domingos. Control-Sensitive Feature Selection for Lazy Learners. Artif. Intell. Rev, 11. 1997.

used in the empirical study, in particular M. Zwitter and M. Soklic of the University Medical Centre, Ljubljana, for supplying the lymphography, breast cancer and primary tumor datasets, and Robert Detrano, of the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation, for supplying the heart disease dataset. Please see the documentation in the UCI Repository for detailed

Kristin P. Bennett and Erin J. Bredensteiner. A Parametric Optimization Method for Machine Learning. INFORMS Journal on Computing, 9. 1997.

of the Federal Reserve Bank of Dallas [BS90], has 9 numeric features which range from 0 to 1. The data represent 4311 successful banks and 441 failed banks. Wisconsin Breast Cancer Database This dataset is used to classify a set of 682 patients with breast cancer [WM90]. Each patient is represented by nine integral attributes ranging in value from 1 to 10. The two classes represented are benign and

Jennifer A. Blue and Kristin P. Bennett. Hybrid Extreme Point Tabu Search. Department of Mathematical Sciences Rensselaer Polytechnic Institute. 1996.

(Liver); the PIMA Indians Diabetes dataset (Diabetes), the Wisconsin Breast Cancer Database (Cancer) [23], and the Cleveland Heart Disease Database (Heart) [9]. We used 5-fold cross validation. Each dataset was divided into 5 parts. The

Christophe Giraud and Tony Martinez and Christophe G. Giraud-Carrier. University of Bristol Department of Computer Science ILA: Combining Inductive Learning with Prior Knowledge and Reasoning. 1995.

Study Algorithm PA GR ILA 82.7 .49 ILA, T=2 73.9 .20 PDL2 79.7 .66 As expected, results with T=2 show a decrease in PA (about 10%), but also a significant decrease in GR (over 59%). For three of the datasets (zoo, breast cancer and soybean-small), the decrease in PA is less than 1.1% on average, while the decrease in GR is greater than 76%. The threshold T, though not part of the basic model, provides

Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall.

54. The lodgepole pine variety of tree happens to represent about 50% of the observations and so we merge all other tree types into a single category. The Wisconsin Diagnostic Breast Cancer (WDBC) data set consists of evaluation measurements (predictors) and final diagnosis for 569 patients. The goal is to predict the diagnosis using the measurements. The number of predictors is 30. The Thyroid Domain

Geoffrey I Webb. Learning Decision Lists by Prepending Inferred Rules. School of Computing and Mathematics Deakin University.

supported by the Australian Research Council. I am grateful to Mike Cammeron-Jones for discussions that helped refine the ideas presented herein. The Breast Cancer Lymphography and Primary Tumor data sets were compiled by M. Zwitter and M. Soklic at University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. The Audiology data set was compiled by Professor Jergen at Baylor College of

Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften.

balance-scale Compressed glyph visualization for dataset breast cancer Compressed glyph visualization for dataset breast-w Compressed glyph visualization for dataset colic Compressed glyph visualization for dataset credit-a Compressed glyph visualization

Paul D. Wilson and Tony R. Martinez. Combining Cross-Validation and Confidence to Measure Fitness. fonix corporation Brigham Young University.

at the bottom of Table 1, CVC had a significantly higher average generalization accuracy on this set of classification tasks than both the static and LCV methods at a 99% confidence level or higher. Dataset Anneal Australian Breast Cancer WI) Bridges Crx Echocardiogram Flag Glass Heart Heart(Cleveland) Heart(Hungarian) Heart(Long Beach) Heart(More) Heart(Swiss) Hepatitis Horse Colic Image Segmentation

Rudy Setiono and Huan Liu. Neural-Network Feature Selector. Department of Information Systems and Computer Science National University of Singapore.

are described below. 1. The University of Wisconsin Breast Cancer Diagnosis Dataset. The Wisconsin Breast Cancer Data (WBCD) is a large data set that consists of 699 patterns of which 458 are benign samples and 241 are malignant samples. Each of these patterns consists of nine

D. Randall Wilson and Roel Martinez. Improved Center Point Selection for Probabilistic Neural Networks. Proceedings of the International Conference on Artificial Neural Networks and Genetic Algorithms.

reduction in size can be even more dramatic when there are more instances available. This is especially true when the number of instances is large compared to the complexity of the decision surface. Dataset Anneal Audiology Australian Breast Cancer (WI) Bridges Crx Echocardiogram Flag Heart (Hungarian) Heart (More) Heart Heart (Swiss) Hepatitis Horse-Colic Iris Liver-Bupa Pima-Indians-Diabetes

Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski and Grzegorz Zal. A hybrid method for extraction of logical rules from data. Department of Computer Methods, Nicholas Copernicus University.

obtained from the UCI repository [14]. A. Wisconsin breast cancer data. The Wisconsin cancer dataset [17] contains 699 instances, with 458 benign (65.5%) and 241 (34.5%) malignant cases. Each instance is described by the case number, 9 attributes with integer value in the range 1-10 (for example,

Jarkko Salojarvi and Samuel Kaski and Janne Sinkkonen. Discriminative clustering in Fisher metrics. Neural Networks Research Centre Helsinki University of Technology.

and secondly through the density function estimate that generates the metric used to define the Fisherian Voronoi regions. IV. EXPERIMENTS Experiments were run with the Wisconsin breast cancer data set from the UCI machine learning repository [9]. The 569 samples consisted of 30 attributes, measured from malignant and benign tumors. We chose the ordinary k-means as the baseline reference method.

Ayhan Demiriz and Kristin P. Bennett and John Shawe and I. Nouretdinov V.. Linear Programming Boosting via Column Generation. Dept. of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute.

criterion for stopping when an optimal ensemble is found that is reached in relatively few iterations. It uses few weak hypotheses. There are only 81 possible stumps on the Breast Cancer dataset (nine attributes having nine possible values), so clearly AdaBoost may require the same tree to be generated multiple times. LPBoost generates a weak hypothesis only once and can alter the weight on

Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. PART FOUR: ANT COLONY OPTIMIZATION AND IMMUNE SYSTEMS Chapter X An Ant Colony Algorithm for Classification Rule Discovery. CEFET-PR, Curitiba.

2. The numbers after the "" symbol are the standard deviations of the corresponding accuracy rates. As shown in this table, Ant-Miner discovered rules with a better accuracy rate than C4.5 in four data sets, namely Ljubljana breast cancer Wisconsin breast cancer, Hepatitis and Heart disease. In two data sets, Ljubljana breast cancer and Heart disease, the difference was quite small. In the other two

M. A. Galway and Michael G. Madden. DEPARTMENT OF INFORMATION TECHNOLOGY technical report NUIG-IT-011002 Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. Department of Information Technology National University of Ireland, Galway.

and all four are equally good on the Breast Cancer dataset. Nave TAN K2 MBBC Chess 87.63 1.61 91.68 1.09 94.03 0.87 97.03 0.54 WBCD 97.81 0.51 97.47 0.68 97.17 1.05 97.30 1.01 LED-24 73.28 0.70 73.18 0.63 73.14 0.73 73.14 0.73 DNA 94.80 0.44

Nikunj C. Oza and Stuart J. Russell. Online Bagging and Boosting. Computer Science Division University of California.

Bagging and online bagging performed noticeably better than single decision trees on all except the Breast Cancer dataset. With Naive Bayes, bagging and online bagging never performed noticeably better than Naive Bayes, which we expected because of the stability of Naive Bayes [3]. Boosting and online boosting

John G. Cleary and Leonard E. Trigg. Experiences with OB1, An Optimal Bayes Decision Tree Learner. Department of Computer Science University of Waikato.

all the information in vote is contained in one attribute, and for iris two attributes contain all the class information (although most of this can be obtained using only one attribute). Some datasets, such as breast cancer and credit-g appear to contain very little class information. In general, we expect to see OB1 performance increase with tree depth up to a depth that captures the most

David Kwartowitz and Sean Brophy and Horace Mann. Session S2D Work In Progress: Establishing multiple contexts for student's progressive refinement of data mining.

version of WEKA became available. Students used this version to complete an end of semester project that asked them to compare and contrast three data mining techniques to analyze the Breast Cancer Data set. Students reported having little difficulty understanding how to use the software and spent most of their time making decisions about how to prepare the data for analysis and analyzing the results.

Karthik Ramakrishnan. UNIVERSITY OF MINNESOTA.

the number of output classes, and the number of continuous and discrete input features. Features Data set Cases Class Continuous Discrete breast cancer w 699 2 9 - credit-a 690 2 6 9 credit-g 1000 2 7 13 glass 214 6 9 - heart-cleveland 303 2 8 5 hypo 3772 5 7 22 ionosphere 351 2 34 - iris 159 3 4 -

Liping Wei and Russ B. Altman. An Automated System for Generating Comparative Disease Profiles and Making Diagnoses. Section on Medical Informatics Stanford University School of Medicine, MSOB X215.

profile instead of using all attributes in the original clinical data. The results remain the same. RESULTS We evaluated the system by applying it to heart disease, diabetes, and breast cancer All data sets were obtained from the UCI Repository of Machine Learning databases and domain theories. 7 Heart Disease Four clinical data sets were used. These sets consists of patients who had been referred for

M. V. Fidelis and Heitor S. Lopes and Alex Alves Freitas. Discovering Comprehensible Classification Rules with a Genetic Algorithm. UEPG, CPD CEFET-PR, CPGEI PUC-PR, PPGIA Praa Santos Andrade, s/n Av. Sete de Setembro.

in the medical domains of dermatology and breast cancer These data sets were obtained from the UCI (University of California at Irvine) - Machine Learning Repository [17]. These data sets have been used extensively for classification tasks using different paradigms,

Chris Drummond and Robert C. Holte. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. Institute for Information Technology, National Research Council Canada.

Cost Function PCF(+) Normalized Expected Cost 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.0 0.2 Figure 3. Credit: Comparing Sampling Schemes breast cancer data set from the Institute of Oncology, Ljubljana. It has 286 instances, 201 non-recurrences and 85 recurrences, with 9 nominal attributes. For this data set, C4.5 only marginally outperforms the cheapest

G. Ratsch and B. Scholkopf and Alex Smola and K. -R Muller and T. Onoda and Sebastian Mika. Arc: Ensemble Learning in the Presence of Outliers. GMD FIRST.

[17] explains the good generalization performance of AdaBoost in the low noise regime. However, AdaBoost performs worse on noisy tasks [10, 11], such as the iris and the breast cancer benchmark data sets [1]. On the latter tasks, a large margin on all training points cannot be achieved without adverse effects on the generalization error. This experimental observation was supported by the study of

K. A. J Doherty and Rolf Adams and Neil Davey. Unsupervised Learning with Normalised Data and Non-Euclidean Norms. University of Hertfordshire.

considered were the Ionosphere, Image Segmentation (training data), Wisconsin Diagnostic Breast Cancer (WDBC) and Wine data sets. These data sets were selected to show our approach on data with a range of classes, dimensionality and data distributions. The basic characteristics of each data set are shown in table 2. Tab l e

Return to Breast Cancer data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML