Documentation

Getting started
Installation

Project involves two major modules to cluster data into groups of similar objects. (1) Implements the DBSCAN algorithm to perform density based clustering, and (2) uses the k-nearest neighbor algorithm to determine the optimal parameters for a given dataset to run on DBSCAN.

This thesis builds on the two concepts and further improves on the clustering results by focusing on the limitations of the original algorithm in terms of identifying accurate cluster labels for "Border" objects within datasets.

Current implementation is a hybrid of Python and C++ languages. The first preprocessing stage uses python libraries from the machine learning repository scikit learn. The actual clustering process is implemented on C++. A subprocess module is used after preprocess phase to link the clustering operation.

Setting up environment

Users must ensure their local environment is set up with python and C++ compiler. For setting up python environment, refer to next section "Third-party packages" which provides a guide for python developement environment through Anaconda. For setting up a C++ compiler on a Linux machine should install the GNU GCC compiler. Run the following command on terminal window to set it up.

sudo apt-get update
sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install build-essential
Third-party packages

This project requires installation of third-party packages for implementation. The scikit-learn library is used for running k-nearest neighbor computations. The scipy library is used to build for interpolation of B-Spline curves during epsilon estimation phase. Other common packages such as numpy and pandas are also imported.

For ease of implementation, it is recommended users work with the Anaconda distribution system. Run the following commands to build a conda environment with all the necessary packages for this project.

  • Create conda environment
  • conda create --name autoscan python=3.9
    conda activate autoscan
  • Install all the necessary packages.
  • conda install scikit-learn scipy pandas
  • Install optional packages for visualization
  • conda install matplotlib
Parameters

The following are arguments passed by the user into the code for computation. The parameters are passed during execution of the program. All the information of the parameters can also be found by running python main.py --help on the command line.

  • Dataset (Required)
  • --data

    Directory path for the dataset passed into the system for clustering. The file must be in .csv (comma-separated values) format, in the shape of (n x m).
    n: number of data samples, m: number of features (dimension size)

  • Output (Optional)
  • --output

    Name for the final output of the clustering. The final output document will be found in the {root}/output/[name] directory.
    If not invoked, a final output file will still be available in the output directory. The new filename will be set to current date and time.