The objectives of this section are:
define density-based clustering
explain the major parts
introduce the DBSCAN algorithm
list the limitations and advantages of this method
By the time you have completed this section you will be able to:
explain the basic DBSCAN algorithm
label points into the appropriate group type
determine which scenarios this algorithm would yield good results.
Algorithm
The DBSCAN algorithm involves grouping any two core points that are close enough into the same cluster. Border points that are close enough to a core point are put into the same cluster as the core point and noise points are discarded. Figure 2 shows a high-level description of this algorithm.
But how exactly is this done? How would I implement this? Do I go through the data points each point at a time or do I deal cluster multiple core points and border points at the same time? These are a few of the questions that might come to mind. This high-level description is silent on a couple of key details that the pseudo code in Figure 3 sheds more light on.
Figure 3 below shows pseudo code for the DBSCAN algorithm.
There are two user specified parameters that are crucial to the success of this algorithm and as was previously mentioned the radius (E) cannot be chosen lightly and neither can the MinPts (the threshold number of points). These parameters are not randomly selected but instead the basic approach is to analyze the behavior of the distance from a point to its kth nearest neighbor. This distance (k-dist) is small if the point belongs to that particular cluster and large if it doesn’t. We therefore calculate the k-dist for all the data points for some k, sort them in increasing order, plot the results and from the plot we will be able to see a sharp change at the value of k-distance that corresponds to a suitable value for E. For more information on how to determine these parameters consult a traditional data mining textbook.