Samples
Input data
The algorithm takes comma separated values (.csv)
files as input of a size (n x m)
. In this given dimension of data, n
is assumed to be the number of instances present in the dataset, while m
is the feature size.
Datasets that store records column-wise need to be transposed before being run on this algorithm. Furthermore, the input data is assumed to have a header column on the first row, hence the first row data is skipped when the .csv
file is read.
Output data
After completion of the clustering operation, a summary result of the execution is reported on the CLI. The number of clusters detected and amount of outliers is shown along with the total execution time. The directory of the final output file is also given on the summary.
To examine the results of the clustering result, head over to the /output/<output-file/>
directory, where a final .csv
file is given containing the original dataset with an additional feature (column) added for the newly computed cluster labels.
As such, for a given input file of shape (n x m)
, the output file will therefore be (n x (m+1))
.
Sample Datasets Collection
Fundamental Clustering and Projection Suite (FCPS)
The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering problems any algorithm shall be able to handle when facing real world data. FCPS serves as an elementary benchmark for clustering algorithms.
FCPS consists of data sets with known a priori classifications that are to be reproduced by the algorithm. All data sets are intentionally created to be simple and might be visualized in two or three dimensions. Each data sets represents a certain problem that is solved by known clustering algorithms with varying success. This is done in order to reveal benefits and shortcomings of algorithms in question. Standard clustering methods, e.g. single-linkage, ward und k-means, are not able to solve all FCPS problems satisfactorily.
Download and use the FCPS datasets from here and further discussion on the dataset collection is available here.
Name | Instances | Dimensions | Properties |
Atom | 800 | 3 | Different inner class distances |
Chainlink | 1000 | 3 | Linearly not separable |
GolfBall | 4002 | 3 | No cluster structure |
Hepta | 212 | 3 | Different inner class variances |
LSun | 400 | 2 | - |
Target | 770 | 2 | Presence of outliers |
Tetra | 400 | 3 | Small inter class distances |
TwoDiamonds | 800 | 2 | Touching classes |
WingNut | 1016 | 2 | Density variation within classes |
- Example run
cd autoscan
python main.py --data <dataset-directory> --output <output-file-name>
After the execution of the algorithm is finished, navigate to /autoscan/output/<output-file-name>.csv
and open the output file. For instructions on how to download
and setup the local Linux environment, read the documentation.