Clustering for Data Mining: A Data Recovery Approach
Product Description
Often considered more as an art than a science, the field of clustering has been dominated by learning through examples and by techniques chosen almost through trial-and-error. Even the most popular clustering methods–K-Means for partitioning the data set and Ward’s method for hierarchical clustering–have lacked the theoretical attention that would establish a firm relationship between the two methods and relevant interpretation aids. Rather than the traditiona… More >>



First, understand that the type of clustering being discussed in this book is the statistical technique of finding clusters of data in a collection, where the collection is typically a database. This is not about clustered micro computers being used to work on big computational tasks as though it is a supercomputer.
Clusters of customers is a key area in data mining and knowledge discovery. You are usually trying to find groups of people with similar buying patterns but not necessarily identical. For instance if you have a group of people that have purchased a book on PHP, you might want to try to sell them a book on MySQL, or Apache, or Linnux. These programs fit together, but are not identical. Still the customer who purchased the PHP book is more likely to want a MySQL book than he is to want an audio CD of a murder mystery.
In this book, two of the most popular clustering techniques, K-Means and Ward’s Method are presented. They are presented for a reader interested in the technical aspects of data mining as a theoretician or a practitioner. It is intended (the author says) that the material be useful to a reader with no mathematical background beyond high school. But the author also says, it might be of help if the reader is acquainted with basic notions of calculus, statistics, matrix algebra, graph theory and logic. (The author went to a different high school than I).
Clustering is described in this book to be used in a wide variety of applications, most of which are oriented to discovering social patterns, biological taxonomies, machine learning, etc. The book discusses the various techniques that have been developed and gives examples where they have been used in a wide variety of applications.
Rating: 5 / 5
This book gives a smooth, motivated and example-rich
introduction to clustering, which is innovative in many aspects.
Answers to important questions that are very rarely addressed if
addressed at all, are provided.
Examples:
(a) what to do if the user has no idea of the number
of clusters and/or their location – use what is called intelligent k-means;
(b) what to do if the data contain both numeric and categorical
features – use what is called three-step standardization procedure;
(c) how to catch anomalous patterns, (d) how to validate clusters, etc.
Some of these may be subject to criticism, however some motivation is always
supplied, and the results are always reproducible thus testable.
The book introduces a number
of non-conventional cluster interpretation aids derived from a data
geometry view accepted by the author and based on what is referred
the contribution weights – basically showing those elements of cluster
structures that distinguish clusters from the rest. These contribution
weights, applied to categorical data, appear to be highly compatible
with what statisticians such as A. Quetelet and K. Pearson were developing
in the past couple of centuries, which is a highly original and welcome
development. The book reviews a rich set of approaches being accumulated
in such hot areas as text mining and bioinformatics, and shows that
clustering is not just a set of naive methods for data processing but
forms an evolving area of data science.
I adopted the book as a text for my courses in data mining for bachelor
and master degrees.
Rating: 5 / 5