Samples

Input data

The algorithm takes comma separated values (.csv) files as input of a size (n x m). In this given dimension of data, n is assumed to be the number of instances present in the dataset, while m is the feature size. Datasets that store records column-wise need to be transposed before being run on this algorithm. Furthermore, the input data is assumed to have a header column on the first row, hence the first row data is skipped when the .csv file is read.

Output data

After completion of the clustering operation, a summary result of the execution is reported on the CLI. The number of clusters detected and amount of outliers is shown along with the total execution time. The directory of the final output file is also given on the summary. To examine the results of the clustering result, head over to the /output/<output-file/> directory, where a final .csv file is given containing the original dataset with an additional feature (column) added for the newly computed cluster labels. As such, for a given input file of shape (n x m), the output file will therefore be (n x (m+1)).

Sample Datasets Collection
Fundamental Clustering and Projection Suite (FCPS)

The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering problems any algorithm shall be able to handle when facing real world data. FCPS serves as an elementary benchmark for clustering algorithms.

FCPS consists of data sets with known a priori classifications that are to be reproduced by the algorithm. All data sets are intentionally created to be simple and might be visualized in two or three dimensions. Each data sets represents a certain problem that is solved by known clustering algorithms with varying success. This is done in order to reveal benefits and shortcomings of algorithms in question. Standard clustering methods, e.g. single-linkage, ward und k-means, are not able to solve all FCPS problems satisfactorily.

Download and use the FCPS datasets from here and further discussion on the dataset collection is available here.

Name Instances Dimensions Properties
Atom 800 3 Different inner class distances
Chainlink 1000 3 Linearly not separable
GolfBall 4002 3 No cluster structure
Hepta 212 3 Different inner class variances
LSun 400 2 -
Target 770 2 Presence of outliers
Tetra 400 3 Small inter class distances
TwoDiamonds 800 2 Touching classes
WingNut 1016 2 Density variation within classes
  • Example run
  • cd autoscan 
    python main.py --data <dataset-directory> --output <output-file-name>

After the execution of the algorithm is finished, navigate to /autoscan/output/<output-file-name>.csv and open the output file. For instructions on how to download and setup the local Linux environment, read the documentation.