Purpose

This assessment task is for student to apply skills for data clustering and dimensionality reduction. Students will be required to demonstrate ability in data representation, and competency in applying suitable clustering/dimensionality reduction techniques in a real-world scenario.

Instructions

Students should insert Python code or text responses into the cell followed by the question into the supplied ipynb (Jupyter Notebook) file. For answers regarding discussion or explanation, maximum five sentences are suggested. Rename this Jupyter notebook file appending your student ID. For example, for student ID 1234, the submitted file name should be A2_1234.ipynb. Insert your student ID and name in the appropriate cell inside that file.

Part-1: Clustering (15 marks) Dataset and the ipynb files are provided in zip format (available in ‘Assessment 2 – T2 2020 - Dataset’ link) in the assessment section (Assessment->Assessment 2) of the unit site.

1. Download the attached clustering.csv file. Read the file and separate the class and feature matrix. (2 marks)

2. Determine the number of clusters from the dataset. Is this same as the actual number of classes in the dataset? (1 marks)

3. Perform K-Means clustering on the complete dataset and report purity score. (2 marks)

4. There are several distance metrics for K-Means such as Euclidean, Squared Euclidean, Manhattan, Chebyshev, Minkowski. [Hints: See the pyclustering library for python.]

Your job is to compare the purity score of k-means clustering for different distance metrics. (5 marks)

Select the best distance metric and explain why this distance metric is best for the given dataset. (2 marks)

• Posted on : September 05th, 2018

