Tools
Explore new features and tools within Intel® products, communities, and platforms
77 Discussions

Perform Distributed k-Means Clustering with Intel® oneAPI Data Analytics Library (oneDAL)

Nikita_Shiledarbaxi
0 1 2,380

Authors: Nikita Sanjay Shiledarbaxi, Rob Mueller-Albrecht

Leverage the daal4py Python* API for accelerated Machine Learning with oneDAL

 

K-Means is an unsupervised machine learning algorithm for centroid-based clustering of data points. Distributed k-Means is a variant of a traditional data classification and sorting method designed for distributed computing systems such as cloud platforms.  The algorithm partitions large sets of observations into a set of clusters (k) around the nearest mean geometric center for these data points.

This blog discusses a code sample designed to perform distributed k-Means clustering using daal4py, a simplified Python* API to the Intel® oneAPI Data Analytics Library (oneDAL). The simple code implementation will help you in practical applications where an efficient clustering of large datasets is crucial, such as,

  • Big data analytics:  to cluster huge datasets that cannot fit into the memory of a single machine for tasks such as pattern recognition, customer segmentation, and anomaly detection

  • Financial data analysis: to cluster financial transactions and customers’ behavior for fraud detection, market segmentation, risk assessment, and mitigation.

  • Natural Language Processing (NLP): to cluster lengthy text documents or features extracted from textual data for document categorization, topic modeling, and sentiment analysis.

  • Genomics and bioinformatics: to cluster large genomic datasets for disease classification, biological data analysis, and gene function prediction.

  • Image and video processing: to cluster images and video frames for object recognition, video summarization, and content-based image retrieval.

  •  Network intrusion detection: to cluster network traffic data for detecting potential security threats

  • Climate and environmental data analysis: to cluster and analyze climatic conditions’ data for weather forecasting

Before we go into the details of the code sample available in the oneAPI GitHub repository, let us briefly look at the daal4py API, oneDAL, and the working mechanism of the distributed k-Means technique.

An Overview: Intel® oneAPI Data Analytics Library (oneDAL) and daal4py Python* API

oneDAL, available as a part of the Intel® oneAPI Base Toolkit (Base Kit), is a high-performance library designed for building optimized AI/ML and Data Science workloads on Intel’s CPUs and GPUs. It supports a wide range of machine learning algorithms including but not limited to k-Means clustering, random forest, Support Vector Machine (SVM), Gradient Boosted Trees (GBT), and linear and logistic regression. It optimizes various phases of an ML pipeline, such as data ingestion and making predictions, enabling faster development of compute-intense applications.

daal4py, included in the Intel® AI Analytics Toolkit (AI Kit), is a Python API to oneDAL designed to simplify the usage of high-performance ML algorithms and frameworks provided by the library. It allows you to leverage the constituents of oneDAL in a flexible and customizable manner. It also enables accelerated data analysis and processing via batch, distributed, and streaming processing modes.

How Distributed k-Means Clustering Works

The distributed k-Means clustering algorithm goes through the following sequence of steps:

  1. The dataset is partitioned into smaller subsets, each assigned to a distinct computing node.
  2. All the nodes simultaneously and independently compute centroids and form clusters of their local data using the traditional k-Means algorithm. Such parallel processing across the nodes expedites the overall clustering process.
  3. The cluster centroids computed on different nodes are periodically exchanged among the nodes to ensure that all the nodes converge to a global solution.
  4. The information from all the nodes is then aggregated to compute the global cluster centroids.
  5. Step 3 and step 4 are iteratively repeated until the algorithm converges, i.e., the centroids do not significantly change between iterations.
  6. Once the final centroids have been calculated, the final clusters are formed by assigning each data point to the cluster of its nearest global.

About the Code Sample

The distributed k-Means code sample demonstrates how to train a distributed k-Means model and make predictions using the daal4py package to oneDAL on Intel’s CPUs such as Intel® Core™ processors, Intel® Xeon® processors, Intel® Xeon® Scalable processors, and Intel Atom® processors. It uses the Intel® MPI Library for efficient message-passing across the clusters in a distributed environment. Through the code sample, you will learn to utilize the daal4py package for direct usage and easy integration of oneDAL into your development framework. The Intel® Distribution for Python included in the AI Kit delivers high-performance Python packages such as Intel® Extension for Scikit-learn*. Such optimized Python modules accelerate the training and prediction phases of the distributed k-Means model.

The code sample uses kmeans_init and kmeans classes of daal4py for centroids’ initialization and clusters computation, respectively. Here is what the k-Means implementation using daal4py looks like:

 

# computing initial centroids
init_result = d4p.kmeans_init(nClusters = 3, method = "plusPlusDense", distributed=True).compute(X)

# compute the clusters/centroids
kmeans_result = d4p.kmeans(nClusters = 3, maxIterations = 5, assignFlag = True).compute(X, init_result.centroids)

 

 

Sample Output

An example output of executing the code on an Intel® CPU looks like:

Here are our centroids:

 [[ 5.46000000e+02 -3.26170648e+00 -6.15922494e+00]
 [ 1.80000000e+01 -1.00432059e+01 -8.38198798e+00]
 [ 4.10000000e+02  3.78330964e-01  8.29073839e+00]]

Here are our centroids loaded from file:

 [[ 5.46000000e+02 -3.26170648e+00 -6.15922494e+00]
 [ 1.80000000e+01 -1.00432059e+01 -8.38198798e+00]
 [ 4.10000000e+02  3.78330964e-01  8.29073839e+00]]
Here are our cluster assignments for the first 5 datapoints:

 [[1]
 [1]
 [1]
 [1]
 [1]]
[CODE_SAMPLE_COMPLETED_SUCCESFULLY]

 

What’s Next?

We encourage you to implement the distributed k-Means code sample today! Get started with oneDAL and develop high-performance optimized AI/ML solutions on Intel® architectures. Check out other oneAPI code samples for AI and analytics on GitHub. Also, explore other AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio.

Useful Resources

Get the Software

You can download the Intel® oneAPI Data Analytics Library (oneDAL) as part of the Intel AI Analytics Toolkit.

Both the Intel AI Analytics Toolkit and Intel oneAPI Base Toolkit are available for free!  Moreover, Intel® Developer Cloud provides a cloud compute platform where you can build and test optimized workloads using these toolkits and several other Intel oneAPI tools on Intel’s latest hardware.

1 Comment
lasrty432
Beginner

Really Useful Information