Documentation
Getting started
Installation
Project involves two major modules to cluster data into groups of similar objects. (1) Implements the DBSCAN algorithm to perform density based clustering, and (2) uses the k-nearest neighbor algorithm to determine the optimal parameters for a given dataset to run on DBSCAN.
This thesis builds on the two concepts and further improves on the clustering results by focusing on the limitations of the original algorithm in terms of identifying accurate cluster labels for "Border" objects within datasets.
Current implementation is a hybrid of Python and C++ languages. The first preprocessing stage uses python libraries from the machine learning repository scikit learn. The actual clustering process is implemented on C++. A subprocess
module is used after preprocess phase to link the clustering operation.
- Repository
wget https://bigdata.dongguk.edu/autoscan/autoscan.zip
cd autoscan
python main.py --data [data directory]
Setting up environment
Users must ensure their local environment is set up with python and C++ compiler. For setting up python environment, refer to next section "Third-party packages" which provides a guide for python developement environment through Anaconda. For setting up a C++ compiler on a Linux machine should install the GNU GCC compiler. Run the following command on terminal window to set it up.
sudo apt-get update
sudo apt-get install gcc
sudo apt-get install g++
sudo apt-get install build-essential
Third-party packages
This project requires installation of third-party packages for implementation. The scikit-learn
library is used for running k-nearest neighbor computations. The scipy
library is used to build for interpolation of B-Spline curves during epsilon estimation phase. Other common packages such as numpy
and pandas
are also imported.
For ease of implementation, it is recommended users work with the Anaconda distribution system. Run the following commands to build a conda environment with all the necessary packages for this project.
- Create conda environment
conda create --name autoscan python=3.9
conda activate autoscan
conda install scikit-learn scipy pandas
conda install matplotlib
Parameters
The following are arguments passed by the user into the code for computation. The parameters are passed during execution
of the program. All the information of the parameters can also be found by running python main.py --help
on the command line.
- Dataset (Required)
--data
Directory path for the dataset passed into the system for clustering. The file must be in .csv (comma-separated values) format, in the shape of (n x m).
n: number of data samples, m: number of features (dimension size)
--output
Name for the final output of the clustering. The final output document will be found in the {root}/output/[name]
directory.
If not invoked, a final output file will still be available in the output
directory. The new filename will be set to current date and time.