AutoSCAN - Automatic Detection of DBSCAN Parameters and Efficient Clustering

Samples

Input data

The algorithm takes comma separated values (.csv) files as input of a size (n x m). In this given dimension of data, n is assumed to be the number of instances present in the dataset, while m is the feature size. Datasets that store records column-wise need to be transposed before being run on this algorithm. Furthermore, the input data is assumed to have a header column on the first row, hence the first row data is skipped when the .csv file is read.

Output data

After completion of the clustering operation, a summary result of the execution is reported on the CLI. The number of clusters detected and amount of outliers is shown along with the total execution time. The directory of the final output file is also given on the summary. To examine the results of the clustering result, head over to the /output/<output-file/> directory, where a final .csv file is given containing the original dataset with an additional feature (column) added for the newly computed cluster labels. As such, for a given input file of shape (n x m), the output file will therefore be (n x (m+1)).

Sample Datasets Collection

Fundamental Clustering and Projection Suite (FCPS)

The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering problems any algorithm shall be able to handle when facing real world data. FCPS serves as an elementary benchmark for clustering algorithms.

FCPS consists of data sets with known a priori classifications that are to be reproduced by the algorithm. All data sets are intentionally created to be simple and might be visualized in two or three dimensions. Each data sets represents a certain problem that is solved by known clustering algorithms with varying success. This is done in order to reveal benefits and shortcomings of algorithms in question. Standard clustering methods, e.g. single-linkage, ward und k-means, are not able to solve all FCPS problems satisfactorily.

Download and use the FCPS datasets from here and further discussion on the dataset collection is available here.

Name	Instances	Dimensions	Properties
Atom	800	3	Different inner class distances
Chainlink	1000	3	Linearly not separable
GolfBall	4002	3	No cluster structure
Hepta	212	3	Different inner class variances
LSun	400	2	-
Target	770	2	Presence of outliers
Tetra	400	3	Small inter class distances
TwoDiamonds	800	2	Touching classes
WingNut	1016	2	Density variation within classes

Example run

cd autoscan 
python main.py --data <dataset-directory> --output <output-file-name>

After the execution of the algorithm is finished, navigate to /autoscan/output/<output-file-name>.csv and open the output file. For instructions on how to download and setup the local Linux environment, read the documentation.