Clusterfun for Data Exploration

The first rule of Karpathy's Recipe for Training Neural Networks is to become one with the data. The full quote is as follows:

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns. Luckily, your brain is pretty good at this. One time I discovered that the data contained duplicate examples. Another time I found corrupted images / labels. I look for data imbalances and biases. I will typically also pay attention to my own process for classifying the data, which hints at the kinds of architectures we’ll eventually explore. As an example - are very local features enough or do we need global context? How much variation is there and what form does it take? What variation is spurious and could be preprocessed out? Does spatial position matter or do we want to average pool it out? How much does detail matter and how far could we afford to downsample the images? How noisy are the labels?

In addition, since the neural net is effectively a compressed/compiled version of your dataset, you’ll be able to look at your network (mis)predictions and understand where they might be coming from. And if your network is giving you some prediction that doesn’t seem consistent with what you’ve seen in the data, something is off.

Once you get a qualitative sense it is also a good idea to write some simple code to search/filter/sort by whatever you can think of (e.g. type of label, size of annotations, number of annotations, etc.) and visualize their distributions and the outliers along any axis. The outliers especially almost always uncover some bugs in data quality or preprocessing.
Clusterfun allows you to do the above with ease. As long as you have a column in your dataframe that contains the path to the media you want to display, you can use the different plotting functions to explore your data, across all other columns in your dataframe. No need to write code to filter/search/sort your data - you can do this interactively in clusterfun.

In case you do not have numerical values that you can transform into a plot, you can use the grid function to display your media in a grid. From there, you can use the filter options to filter your data based on the values in your dataframe, and discover patterns in your data.

Another common way of exploring your data is by using a UMAP. This is a dimensionality reduction technique that allows you to cluster your data based on some features. In the example below, we use untrained CLIP model features to cluster the MNIST dataset. We then create a scatter plot with the UMAP coordinates as x and y, and the media as the image path. This way, we can explore clusters in the data, which gives us an idea of how the data is distributed.

This results in the following plot: