Parameters of HDBSCAN
HDBSCAN has a number of parameters that can be adjusted to modify the clustering process to the specific dataset. Here are some of the main paramters:
- ‘min_cluster_size’: This parameter sets the minimum number of points required to form a cluster. Points that do not meet this criterion are consiered noise. Adjusting this paramater influences the granularity of the clusters found by the algoirthm.
- ‘min_samples’: It sets the minimum nmber of samles in a neighborhood for a point to be considered a core point.
- ‘cluster_selection_epsilon’: This parameter sets the epsilon value for selecting clusters based on the minimum spanning tree. It determines the maximum distances allowed between the points for them to be considered connected in the density-based clustering process.
- ‘metric’: The distance metric to use for computing mutual reachability distance.
- ‘cluster_selection_method’: This method is used to choose clusters from the condensed tree. It can be ‘eom'(Excess of Mass’), ‘leaf’ (cluster Tree Leaf) , ‘Leaf-dm'(Leaf with Distance Metric) or ‘flat'(Flat clustering).
- ‘alpha’: A parameter that influcences the linkage criterion for clustering merging.
- ‘gen_min_span_tree’: If the parameter is true, it generate the minimum spanning tree for later use.
- ‘metric_params’: It is an additional keyword arguments for the metric function.
- ‘algorithm’: The algorithm to use fo the mutual reachibility distance computation. Options include ‘best’, ‘generic’, ‘prime_kdtree’, and ‘boruvka_kdtree’.
- ‘core_distance_n_jobs’: The number of parallel jobs to run for core distance calculation.
- ‘allow_single_cluster’: A boolean indicating whether to allow single cluster outputs.
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
Clustering is a machine-learning technique that divides data into groups, or clusters, based on similarity. By putting similar data points together and separating dissimilar points into separate clusters, it seeks to uncover underlying structures in datasets.
In this article, we will focus on the HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) technique. Like other clustering methods, HDBSCAN begins by determining the proximity of the data points, distinguishing the regions with high density from sparse regions. But what distinguishes HDBSCAN from other methods is its capacity to dynamically adjust to the different densities and forms of clusters in the data, producing more reliable and adaptable clustering results.