Course Description:
Find the useful information hidden in your data! This course surveys computer-intensive methods for inductive classification and estimation, drawn from Statistics, Machine Learning, and Data Mining. Dr. Elder will describe the key inner workings of leading algorithms, compare their merits, and (briefly) demonstrate their relative effectiveness on practical applications. We'll first review classical statistical techniques, both linear and nonparametric, then outline the ways in which these basic tools are modified and combined into more modern methods. The course pays particular attention to four powerful topics: modern algorithms (such as Neural Networks and Decision Trees), Resampling, Visualization, and Ensembles. Actual scientific and business examples illustrate practical techniques employed by expert analysts. Along the way, major relative strengths and distinctive properties of the leading commercial software products for Data Mining will be discussed.
Handout:
Included are comprehensive notes, annotated references, and the book chapter , "A Statistical Perspective on Knowledge Discovery in Databases", by John Elder & Daryl Pregibon .
Instructor:
John Elder is Chief Scientist of Elder Research Inc., a Data Mining consulting firm in Charlottesville, Virginia. He has over twenty years of experience developing and applying adaptive, data-driven techniques to practical problems - at an engineering consulting firm, an investment management company, Rice University, and the University of Virginia. Dr. Elder has written and spoken widely on pattern discovery topics, is active on statistical and engineering journals and boards, and has authored some influential data mining tools. His practical experience with commercial applications - including credit scoring, direct marketing, sales forecasting, market timing, and fraud detection - help illustrate the course concepts.
Intended Audience:
Those from industry and academia who work with data and wish to understand recent developments in pattern discovery, data mining, and inductive modeling. At the conclusion of this course, one should be able to discern the basic strengths of competing methods and select the appropriate tools for one's applications. Participants should have prior working experience with computers and interest in applied statistical techniques. (It helps, as well, to have a motivating application you wish to solve.)
Course Outline I. Pattern Discovery: An Overview - Inducing Models from Data: Benefits and Dangers
- The Data Mining Process
- Example Projects from Science and Business
- Team of Experts Required: Problem Domain, Statistics, Algorithms, and Database
- Leading Software Tools and Vendors
II. Classical Statistical Techniques (brief review) - Regression
- Discriminant Analysis
- Principle Components
- Scatterplot Smoothers
- Nearest Neighbors
- Kernels
III. Modern Methods - Neural Networks
- Polynomial Networks
- Decision Trees
- ASH (Average Shifted Histograms)
- MARS (Multivariate Adaptive Regression Splines)
- RBF (Radial Basis Functions)
IV. Key General Tools - Scientific Visualization: techniques, Grand Tour, Projection Pursuit, limitations
- Bootstrapping/Resampling: the single most important tool in your toolbox
- Bayes Rule
- Optimization: local and global
- Overfit Regularization: Complexity Penalties, Smoothing, Shrinking, Generalized Degrees of Freedom
| V. Data Trouble-Shooting - Case Diagnostics (Outlying, Influential, Leverage, & Missing points)
- Feature Creation and Selection
VI. Comparing and Combining Methods - Structure search
- Matching an algorithm to your application
- Theoretical and Empirical comparisons of algorithms
- Combining models to improve accuracy (Bundling)
- Bayesian Model Averaging
- Bagging
- Boosting
- Interpreting why Bundling/Bagging/Boosting work
VII. Top 10 Data Mining Mistakes (with real-world examples) - Lack data
- Focus on Training
- Rely on 1 technique
- Ask the wrong question
- Listen (only) to the data
- Accept leaks from the future
- Discount pesky cases
- Extrapolate (practically and theoretically)
- Answer every inquiry
- Sample without care
- Believe the best model.
|

A note about the course scope:
Each of the major topics discussed could clearly comprise a semester-long course if presented in full detail! What this (admittedly intensive) short course provides however, is a broad overview of the highlights, drawing connections between major developments in the diverse fields that contribute to the emerging discipline of Data Mining.
Previous participants have found this "big picture" to be particularly useful for identifying techniques to use immediately and avenues worthy of further exploration, whether for research or practical problem-solving.
Comments from previous attendees:
- "[Dr. Elder] provided examples shedding light on complex concepts. He gave the big picture all along the way."
- "Gave real practical insights from a practitioner's point of view."
- "Finally someone told me how things are done, not just how great Data Mining is."
- "Most valuable, were the insights into the essence of various methods, their relative strengths and weaknesses, and the important open research areas."
- "Very interesting, knowledgeable, and entertaining approach."