Adult Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Adult data set page.
Saharon Rosset. Model selection via the AUC. ICML. 2004.
to prefer the K-NN model most of the time. This illustrates the "bias" in using AUC to select classi#cation models, which we discuss in section 3.2 Finally, we performed experiments on a real-life data set. We used the Adult data-set available from the UCI repository (Blake & Merz, 1998). We used only the first ten variables in this data-set, to make a largescale experiment feasible, and compared
Rich Caruana and Alexandru Niculescu-Mizil. An Empirical Evaluation of Supervised Learning for ROC Area. ROCAI. 2004.
selection is done using the 1k validation sets, SVMs move slightly ahead of the neural nets.) Boosted stumps and plain decision trees are not competitive, though boosted stumps are best on the Adult data set. It is interesting to note that boosting weaker stump models is clearly inferior to boosting full decision trees on most of the test problems: boosting full decision trees yields better performance
Andrew W. Moore and Weng-Keen Wong. Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning. ICML. 2003.
if and only if an odd number of parents have value "True". The nodes are thus noisy exclusive-ors and so it is hard to learn a set of parents incrementally. Synth2 Synth3 Synth4 Figure 3. Synthetic datasets described in Section 3.1. R m AA adult 49K 15 7.7 Contributed to UCI by Ron Kohavi alarm 20K 37 2.8 Data generated from a standard Bayes Net benchmark (Beinlich et al., 1989). biosurv 150K 24 3.5
Alexander J. Smola and Vishy Vishwanathan and Eleazar Eskin. Laplace Propagation. NIPS. 2003.
# Ä # # # ##Ç # (15) with the joint minimizer being the average of the individual solutions. 5 Experiments To test our ideas we performed a set of experiments with the widely available Web and Adult datasets from the UCI repository . All experiments were performed on a 2.4 MHz 2 Note that we had to replace the equality with set inclusion due to the fact that Ü is not everywhere differentiable, hence
I. Yoncaci. Maximum a Posteriori Tree Augmented Naive Bayes Classifiers. O EN INTEL.LIG ` ENCIA ARTIFICIAL CSIC. 2003.
In the rest of the section we discuss and justify these assertions into more detail. 14 Dataset MAPTAN MAPTAN+BMA sTAN sTAN+BMA adult 17.18 ± 0.68 17.19 ± 0.71 17.60 ± 0.82 17.60 ± 0.80 australian 19.91 ± 1.14 19.62 ± 1.13 25.39 ± 1.18 24.96 ± 1.13 breast 17.23 ± 1.21 16.89 ± 1.28 8.73 ± 0.87
Nitesh V. Chawla and Kevin W. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR, 16. 2002.
is to distinguish between nasal (class 0) and oral sounds (class 1). There are 5 features. The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1. 3. The Adult dataset (Blake & Merz, 1998) has 48,842 samples with 11,687 samples belonging to the minority class. This dataset has 6 continuous features and 8 nominal features. SMOTE and SMOTE-NC (see Section 6.1)
Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. KDD. 2002.
# x### s # : the number of examples with score s that belong to class c divided by the total number of examples 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Adult Dataset NB Score Empirical class membership probability 8941 790 610 450 480 532 477 620 672 2710 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 The Insurance Company
Zhiyuan Chen and Johannes Gehrke and Flip Korn. Query Optimization In Compressed Database Systems. SIGMOD Conference. 2001.
are not compressed. TPC-H data contains 8 tables and 61 attributes, 23 of which are string-valued. The string attributes account for about 60% of the total database size. We also used a 4MB of dataset with US census data, the adult data set  for experiments on compression strategies. The adult dataset contains a single table with 14 attributes, 8 of them string-valued, accounting for about 80%
Bernhard Pfahringer and Geoffrey Holmes and Richard Kirkby. Optimizing the Induction of Alternating Decision Trees. PAKDD. 2001.
kr-vs-kp 3196 0.0 0 36 labor 57 33.6 8 8 mushroom 8124 1.3 0 22 promoters 106 0.0 0 57 sick-euthyroid 3163 6.5 7 18 sonar 208 0.0 60 0 splice 3190 0.0 0 61 vote 435 5.3 0 16 vote1 3 435 5.5 0 15 KDD Datasets coil 5822/4000 0.0 85 0 adult 32561/16281 0.2 6 8 art1 50000/50000 0.0 0 50 art2 50000/50000 0.0 25 25 art3 50000/50000 0.0 50 0 This section compares the performance of the original optimized
Gary M. Weiss and Haym Hirsh. A Quantitative Study of Small Disjuncts: Experiments and Results. Department of Computer Science Rutgers University. 2000.
were compared as the training set size was varied. Because disjuncts of a specific size for most concepts cover very few examples, statistically valid comparison were possible for only 4 of the 30 datasets (Coding, Move, Adult and Market2); with the other datasets the number of examples covered by disjuncts of a given size is too small. The results for the Coding dataset are shown in Figure 8.
Kristin P. Bennett and Ayhan Demiriz and John Shawe-Taylor. A Column Generation Algorithm For Boosting. ICML. 2000.
all points and # i measures the additional margin obtained by each point. AdaBoost also minimizes a margin cost function based on the margin obtained by each point. We ran experiments on two larger datasets: Forest and Adult from UCI(Murphy & Aha, 1992). Forest is a 54-dimension dataset with 7 possible classes. The data are divided into 11340 training, 3780 validation and 565892 testing instances.
Dmitry Pavlov and Jianchang Mao and Byron Dom. Scaling-Up Support Vector Machines Using Boosting Algorithm. ICPR. 2000.
one week of February 1998. The classification task, as we pose it, is to predict whether a user will visit the most popular site S based on his/her visiting pattern of all other sites. The Adult data set is available at UCI machine learning repository . The task is to predict if the income of a person is greater than 50K based on several census parameters, such as age, education, marital status
Jie Cheng and Russell Greiner. Comparing Bayesian Network Classifiers. UAI. 1999.
used in the experiments. Instances Dataset Attributes. Classes Train Test Adult 13 2 32561 16281 Nursery 8 5 8640 4320 Mushroom 22 2 5416 2708 Chess 36 2 2130 1066 Car 6 4 1728 CV5 Flare 10 3 1066 CV5 Vote 16 2 435 CV5 Brief descriptions of
John C. Platt. Using Analytic QP and Sparseness to Speed Training of Support Vector Machines. NIPS. 1998.
can be found in [8, 7]. The first test set is the UCI Adult data set . The SVM is given 14 attributes of a census form of a household and asked to predict whether that household has an income greater than $50,000. Out of the 14 attributes, eight are categorical
Luc Hoegaerts and J. A. K Suykens and J. Vandewalle and Bart De Moor. Subset Based Least Squares Subspace Regression in RKHS. Katholieke Universiteit Leuven Department of Electrical Engineering, ESAT-SCD-SISTA.
side our approach achieves overall a much smaller O(nm) memory cost, compared to the typical O(n 2 ) and a computational complexity of O(nm 3 ) compared to the typical O(n 2 ). The ADULT UCI data set  consists of 45222 cases having 14 input variables. The aim is to classify if the income of a person is greater than 50K based on several census parameters, such as age, education, marital
David R. Musicant and Alexander Feinberg. Active Set Support Vector Regression.
Census 30k, is a version of the US Census Bureau Adult dataset, which is publicly available from Silicon Graphics' website . This "Adult" dataset contains nearly 300,000 data points with 11 numeric attributes, and is used for predicting income levels based
Luc Hoegaerts and J. A. K Suykens and J. Vandewalle and Bart De Moor. Primal Space Sparse Kernel Partial Least Squares Regression for Large Scale Problems Special Session paper . Katholieke Universiteit Leuven Department of Electrical Engineering, ESAT-SCD-SISTA.
sample with added noise (dots). The subset consists of 5 points, marked with a full dot on the figure. The ADULT UCI data set  consists of 45222 cases having 14 input variables. The aim is to classify if the income of a person is greater than 50K based on several census parameters, such as age, education, marital
Luca Zanni. An Improved Gradient Projection-based Decomposition Technique for Support Vector Machines. Dipartimento di Matematica, Universitdi Modena e Reggio Emilia.
in . In order to analyze the behaviour of the two solvers within GPDT2 we consider three large test problems of the form (1) derived by training Gaussian SVMs on the well known UCI Adult data set , WEB data set  and MNIST data set . A detailed description of the test problems generation is reported in the Appendix. All the experiments are carried out with standard C code running
Shi Zhong and Weiyu Tang and Taghi M. Khoshgoftaar. Boosted Noise Filters for Identifying Mislabeled Data. Department of Computer Science and Engineering Florida Atlantic University.
in Table 1. Overall, BBF-I significantly outperforms BBF-II, except for low ( 20%) noise levels for the adult car, and nursery datasets. The reason BBF-II performs poorly may be that too many clean instances are weighted low. The noise filter constructed in the next round loses strong support from clean data instances, which are
William W. Cohen and Yoram Singer. A Simple, Fast, and Effective Rule Learner. AT&T Labs--Research Shannon Laboratory.
average ranks among these three are 1.8, 2.3, and 1.9. The largest ruleset produced by SLIPPER is 49 rules (for coding). Finally, we evaluated the scalability of the rule learners on several large datasets. We used adult blackjack, with the addition of 20 irrelevant noise variables; and market3, for which many examples were available. C4rules was not run, since it is known to have scalability
Grigorios Tsoumakas and Ioannis P. Vlahavas. Fuzzy Meta-Learning: Preliminary Results. Greek Secretariat for Research and Technology.
from the Machine Learning Repository at the University of Irvine, California (Blake & Merz, 1998). These were the adult and chess data sets, large enough (} 1000 examples) to simulate distributed environment. Only two domains were selected at this stage of our research to investigate the performance of the suggested methodology. The
Josep Roure Alcobe. Incremental Hill-Climbing Search Applied to Bayesian Network Structure Learning. Escola Universitria Politcnica de Mataro.
by means of a parameter nRSS. 3.2 Experimental Results In this section we compare the performance of repeatedly using the batch algorithms against the corresponding incremental approach. We used the datasets Adult 48.842 instances and 13 variables), Mushroom (8.124 inst. and 23 var.) and Nursery (12.960 inst. and 9 var.) from the UCI machine learning repository , the Alarm dataset (20.000 inst. and
Ayhan Demiriz and Kristin P. Bennett and John Shawe and I. Nouretdinov V.. Linear Programming Boosting via Column Generation. Dept. of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute.
of the final set of weak hypotheses. This is just a very simple method of boosting multiclass problems. Further investigation of LP multiclass approaches is needed. We ran experiments on larger datasets: Forest, Adult USPS, and Optdigits from UCI(Murphy & Aha, 1992). Forest is a 54-dimension dataset with seven possible classes. The data are divided into 11340 training, 3780 validation, and 565892