There are four major tasks for clustering:
Making simplification for further data processing. In this case, the data is split into different groups which then are processed individually. In business, for instance, we can find different groups of customers sharing some similar features using cluster analysis. Then, we can use this information to develop different marketing strategies and apply them to all these separate groups of customers. Or, we can cluster a marketplace in a specific niche to find what kinds of products are selling better than other ones to make a decision what kind of products to produce. Usually, clustering is one of the first techniques that help explore a dataset we are going to work with to get some sense of the structure of the data.
Compression of the data. We can implement cluster analysis on a giant data set. Then from each cluster, we can pick just several items. In this case, we usually lose much less information than in the case where we pick data points without preceding clustering. Clustering algorithms are being used to compress not only large data sets but also relatively small objects like images.
Picking out unusual data points from the dataset. This procedure is done, for example, for the detection of fraudulent transactions with credit cards. In medicine, similar procedures can be used, for example, to identify new forms of illnesses.
Building the hierarchy of objects. This is implemented for classification of biological organisms. It is also applied, for example, in search engines to group different text documents inside the search engines' datasets.
In an introductory chapter, you will find:
Different types of machine learning;
Features in datasets;
Dimensionality of datasets;
The 'curse' of dimensionality;
Dealing with underfitting and overfitting
In the following chapters, we will implement these concepts in practice, working with clustering algorithms.
This book provides detailed explanations of several widely-used clustering approaches with visual representations:
Hierarchical agglomerative clustering;
K-means;
DBSCAN;
Neural network-based clustering
You will learn different strengths and weaknesses of these algorithms as well as the practical strategies to overcome the weaknesses. In addition, we will briefly touch upon some other clustering methods.
The examples of the algorithms are presented in Python 3. We will work with several datasets, including the ones based on real-world data.
We will be primarily working with the Scikit-learn and SciPy libraries. But our neural network for clustering, we will build basically from scratch, just by using NumPy arrays.