What's in this book? This book is a step backwards, to four classical methods for clustering in small, static data sets, which have all withstood the tests of time. The youngest of the four methods is now more than 40 years old:
Gaussian Mixture Decomposition (GMD, 1898) Hard c-means (HCM, 1956, often called "k-means") Fuzzy c-means (FCM, 1973, reduces to HCM in a certain limit)
SAHN Clustering (principally single linkage (SL, 1909))
The dates shown are the first known writing (to me, anyway) about these four models. There are many different algorithms that attempt to optimize the first three models, which all define good clusters as part of extrema of optimization problems defined by their objective functions. The SAHN models are deterministic, and operate in a very different way.
The expansion of cluster analysis into every corner of our modern lives (big data, social networks, streaming video, wireless sensor networks, ... the list is endless) still rests on the foundation provided by these four models. I am (with apologies to Marvel Comics) very comfortable in calling HCM, FCM, GMD and SL the fantastic four.
As in many branches of information technology, this is a vast topic. The overall picture in clustering is quite overwhelming, so any attempt to swim at the deep end of the pool in even a very specialized subfield requires a lot of training. But we all start out at the shallow end (or at least that's where we should start!), and this book is aimed squarely at teaching toddlers not to be afraid of the water. With the exception of Chapter 10, there is not a section of this book that, if explored in real depth, cannot be expanded into its own volume. So, if your needs are for an in-depth treatment of all the latest developments in any topic in this volume, the best I can do - what I will try to do anyway - is lead you to the pool, and show you where to jump in. If you are a graduate student, professor, or professional in computational science or engineering, you may already know more than this book contains, so it won't be very useful to you. My hope is that this volume will be useful to the real novice, who thinks that clustering might be a good thing to try, but who knows very little about it. To close this part of the introduction, I repeat a cautionary statement made by Marriott (1974) about the dangers of believing what a computer tells you about clusters your data:
"If the results disagree with informed opinion, do not admit a simple logical interpretation, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines automatically transforming bodies of numbers into packets of scientific fact."
Jim received the PhD in Applied Mathematics from Cornell University in 1973. Jim is past president of NAFIPS (North American Fuzzy Information Processing Society), IFSA (International Fuzzy Systems Association) and the IEEE CIS (Computational Intelligence Society): founding editor the Int'l. Jo. Approximate Reasoning and the IEEE Transactions on Fuzzy Systems: Life fellow of the IEEE and IFSA; and a recipient of the IEEE 3rd Millennium, CIS Fuzzy Systems Pioneer, and technical field award Rosenblatt medals, and the IPMU Kempe de Feret Award. Jim retired in 2007, and will be coming to a university near you soon (especially if there is fishing nearby).
Keywords: Cluster Analysis, Hard Clustering, K-Means, Fuzzy Clustering, Fuzzy C-Means, Probabilistic Clustering, Gaussian Mixture Decomposition, Single Linkage Clustering, Alternating Optimization, Clustering In Big Data