Fuzzy Clustering in R on Medical Diagnosis dataset
1. Loading Required Libraries
R
# Loading Required Libraries # For fuzzy clustering library (e1071) library (ggplot2) |
2. Loading the Dataset
We are creating a fictional dataset about patient health parameters. Synthetic data is created for 100 patients, the parameters that are used here are : blood pressure, cholesterol and BMI(body mass index).
R
# Loading the Dataset set.seed (123) # for reproducibility patients <- data.frame ( patient_id = 1:100, blood_pressure = rnorm (100, mean = 120, sd = 10), cholesterol = rnorm (100, mean = 200, sd = 30), bmi = rnorm (100, mean = 25, sd = 5) ) |
3. Data Preprocessing
This step is important to ensure that all the variables are on the same scale, this is a common practice done in clustering.
R
# Data Preprocessing scaled_data <- scale (patients[, -1]) |
4. Data Selection for Clustering
This segment involve selecting relevant variables for clustering.
R
# Data Selection for Clustering selected_data <- scaled_data[, c ( "blood_pressure" , "cholesterol" , "bmi" )] |
5. Fuzzy C-means Clustering with FGK Algorithm
The Fuzzy Gustafson-Kessel (FGK) algorithm is a variant of the Fuzzy C-means (FCM) clustering algorithm which focuses on overlapping clusters. It works with dataset that are overlapping and have non-spherical clustering. he membership grades are determined based on the weighted Euclidean distance between data points and cluster centers. Euclidean Distance formula is used to measure straight line distance between two points in Euclidean space. The formula is given by:
d = √[ (x2– x1) 2 + (y2– y1 )2]
- where (x1, y1) are the coordinates of one point
- and (y1, y2) are the coordinates of other point.
- and d is the distance between them
R
# Fuzzy C-means Clustering with FGK algorithm set.seed (456) fgk_clusters <- e1071:: cmeans (selected_data, centers = 3, m = 2)$cluster |
selected_data refers to the selected columns we need for clustering. Number of centers here are 3 and High value of m shows fuzzier cluster.
Data Membership Degree Matrix and the Cluster Prototype Evolution Matrices
In fuzzy clustering each data point is assigned with a degree of membership which defines the degree of belongingness of that data point to a definite cluster whereas the cluster prototype evolution matrices are used to show the change in centroid position over the iteration.
R
# Fuzzy C-means Clustering with FGK algorithm set.seed (456) # for reproducibility fuzzy_result <- e1071:: cmeans (selected_data, centers = 3, m = 2) # Access the membership matrix and cluster centers membership_matrix <- fuzzy_result$membership cluster_centers <- fuzzy_result$centers # Print the membership matrix and cluster centers print ( "Data Membership Degree Matrix:" ) print (membership_matrix) print ( "Cluster Prototype Evolution Matrices:" ) print (cluster_centers) |
Output:
"Data Membership Degree Matrix:"
1 2 3
[1,] 0.15137740 0.15999978 0.68862282
[2,] 0.10702292 0.19489294 0.69808414
[3,] 0.71018858 0.18352624 0.10628518
[4,] 0.21623783 0.18849017 0.59527200
[5,] 0.70780116 0.14281776 0.14938109
[6,] 0.63998321 0.23731396 0.12270283
[7,] 0.82691960 0.10470764 0.06837277
[8,] 0.33246815 0.25745565 0.41007620
[9,] 0.08219287 0.10368827 0.81411886
[10,] 0.06659943 0.83694230 0.09645826....
[100,] 0.12656903 0.12155473 0.75187624
"Cluster Prototype Evolution Matrices:"
blood_pressure cholesterol bmi
1 0.6919000 -0.5087515 -0.4642972
2 -0.1031542 0.7724248 -0.3050143
3 -0.6279179 -0.3104457 0.8176061
The higher values show a strong relationship between the clusters and data points as given in our output. All the 100 rows are not represented here, you can get those values by following the code.
The values in the matrix show the movement of the cluster centroids in each dimension of each variable that is blood pressure, cholesterol and bmi.
6. Interpret the Clustering Results
In this step we are combining the clustering results with our original data with the help of cbind() function. summary() function gives us an insight of our data.
R
# Interpret the Clustering Results clustered_data <- cbind (patients, cluster = fgk_clusters) summary (clustered_data) |
Output:
patient_id blood_pressure cholesterol bmi cluster
Min. : 1.00 Min. : 96.91 Min. :138.4 Min. :16.22 Min. :1.00
1st Qu.: 25.75 1st Qu.:115.06 1st Qu.:176.0 1st Qu.:22.34 1st Qu.:1.00
Median : 50.50 Median :120.62 Median :193.2 Median :25.18 Median :2.00
Mean : 50.50 Mean :120.90 Mean :196.8 Mean :25.60 Mean :2.02
3rd Qu.: 75.25 3rd Qu.:126.92 3rd Qu.:214.0 3rd Qu.:28.82 3rd Qu.:3.00
Max. :100.00 Max. :141.87 Max. :297.2 Max. :36.47 Max. :3.00
The summary() shows the min, first quartile, median, 3rd quartile and max of different columns of our dataset. This information can be helpful for the researchers in studying the underlying patterns in the dataset for further decision making.
GAP INDEX
Gap index or Gap statistics is used to calculate the optimal number of clusters withing a dataset. It defines the optimal number of clusters after which adding the cluster number will not play any significant role in analysis.
R
# Function to calculate the gap statistic gap_statistic <- function (data, max_k, B = 50, seed = NULL ) { require (cluster) set.seed (seed) # Compute the observed within-cluster dispersion for different values of k wss <- numeric (max_k) for (i in 1:max_k) { wss[i] <- sum ( kmeans (data, centers = i)$withinss) } # Generate B reference datasets and calculate the within-cluster dispersion for each B_wss <- matrix ( NA , B, max_k) for (b in 1:B) { ref_data <- matrix ( rnorm ( nrow (data) * ncol (data)), nrow = nrow (data)) for (i in 1:max_k) { B_wss[b, i] <- sum ( kmeans (ref_data, centers = i)$withinss) } } # Calculate the gap statistic gap <- log (wss) - apply (B_wss, 2, mean) return (gap) } # Example usage of the gap_statistic function gap_values <- gap_statistic (selected_data, max_k = 10, B = 50, seed = 123) print (gap_values) |
Output:
[1] -286.82712 -209.32084 -163.01342 -131.98106 -112.70612 -98.07825 -87.90545
[8] -77.92460 -69.81373 -63.42550
This output suggests that the smaller clusters will be better to present this dataset. The negative values suggest that the observed within-cluster variation is less than the expected variation in the dataset.
Davies-Bouldin’s index
It assess the average similarity between the clusters, it deals with both the scatter within the clusters and the separation between the clusters covering a wide range and helping us in estimating the quality of the clusters.
R
# Function to calculate the Davies-Bouldin index davies_bouldin_index <- function (data, cluster_centers, membership_matrix) { require (cluster) num_clusters <- nrow (cluster_centers) scatter <- numeric (num_clusters) for (i in 1:num_clusters) { scatter[i] <- mean ( sqrt ( rowSums ((data - cluster_centers[i,])^2)) * membership_matrix[i,]) } # Calculate the cluster separation separation <- matrix (0, nrow = num_clusters, ncol = num_clusters) for (i in 1:num_clusters) { for (j in 1:num_clusters) { if (i != j) { separation[i, j] <- sqrt ( sum ((cluster_centers[i,] - cluster_centers[j,])^2)) } } } # Calculate the Davies-Bouldin index db_index <- 0 for (i in 1:num_clusters) { max_val <- - Inf for (j in 1:num_clusters) { if (i != j) { val <- (scatter[i] + scatter[j]) / separation[i, j] if (val > max_val) { max_val <- val } } } db_index <- db_index + max_val } db_index <- db_index / num_clusters return (db_index) } # Example usage of the Davies-Bouldin index function db_index <- davies_bouldin_index (selected_data, cluster_centers, membership_matrix) print ( paste ( "Davies-Bouldin Index:" , db_index)) |
Output:
"Davies-Bouldin Index: 0.77109024677212"
Based on our output result which is a lower value it shows that our clusters are well defined and these are well separated from each other.
7. Visualizing the Clustering Results
R
# Visualizing the Clustering Results ggplot (clustered_data, aes (x = blood_pressure, y = cholesterol, color = factor (cluster))) + geom_point (size = 3) + labs (title = "Clustering of Patients Based on Health Parameters" , x = "Blood Pressure" , y = "Cholesterol" ) + scale_color_manual (values = c ( "darkgreen" , "green3" , "lightgreen" )) + theme_minimal () |
Output:
In this graph each data point represents a patient defined by the cluster color. The different shades of cluster defines the difference between them.
Data Point Cluster Representation
This information is required since we simplify the complex structure into easier forms for better understanding. This representation can help understand the underlying trends, patterns and complex information that cannot be understood with high dimensional original dataset. It also helps in understanding the outliers that are not easily detectable in the original dataset.
R
# Load the required library library (cluster) # Create a data frame including the cluster assignment clustered_data$cluster <- as.factor (clustered_data$cluster) # Plot the clusters using clusplot clusplot (selected_data, clustered_data$cluster, color = TRUE , shade = TRUE , labels = 2, lines = 0) |
Output:
Different clusters are represented in different colors here and the clusters are also shaded to provide a clear view with each data points. 71.02% of the point variability explains the percentage of variance in the data. This means the two principal components of this data capture 71.02% of variance present in the original dataset.
Variable Relationships Visualization
To visualize the relationship between the variables we plot the pair scatter plot of our dataset. Here we are using pairs() function to create a scatter plot matrix.
R
# Load necessary libraries library (ggplot2) # Create a scatter plot matrix pairs (selected_data, main = "Scatter Plot Matrix of Health Parameters" ) |
Output:
The diagonal elements show the distribution of each variable. This scatter plot helps us to visualize the relationships between different variables, such as blood pressure, cholesterol, and BMI, in the dataset. In the context of patient health parameters, the scatter plot can help us understand how these variables are connected to each other. Understanding these patterns can help us asses the potential risk and help in decision making.
In this example, we created a fictional data set for medical diagnosis using Fuzzy Gustafson-Kessel (FGK) algorithm. We used different packages for clustering the results with the original dataset. Such kind of clustering helps the medical practitioner draw conclusions based on the medical history and similarities between multiple patients and their symptoms. This also makes treatment decisions easier.
Conclusion
In this article, we got to know about different algorithms and their base used for fuzzy clustering and how it helps in various fields such as medical, agriculture, traffic pattern analysis and customer segmentation. We applied this on various type of dataset from different sources. We also plotted the results of these clustering on graph for better visualization. These clustering data points help researchers identify how each of them belong or contribute to different factors and how they affect the study as a whole.
Fuzzy Clustering in R
Clustering is an unsupervised machine-learning technique that is used to identify similarities and patterns within data points by grouping similar points based on their features. These points can belong to different clusters simultaneously. This method is widely used in various fields such as Customer Segmentation, Recommendation Systems, Document Clustering, etc. It is a powerful tool that helps data scientists identify the underlying trends in complex data structures. In this article, we will understand the use of fuzzy clustering with the help of multiple real-world examples.