In computer vision research, ML model predictions are related to images. You often want to plot these predictions to get a sense of how well your model is doing on your data. For example, in fraud detection, you might want to plot a histogram of your model scores for both your genuine and fraudulent data, to see if and how these two distributions overlap. In such a context, it can be useful to be able to quickly inspect the images related to the scores. For example, you might want to have a look at the images that were not predicted correctly, i.e. fraudulent images that overlap with the genuine distribution, and genuine images that overlap with the fraudulent distribution. However, most common plotting libraries don't allow you to do that. You'd end up first creating a histogram like this:
After creating this plot, you'd have to manually single out the images that you want to inspect in a separate process. In the most extreme, laborious cases this is done by saving images in a directory and inspecting them that way. Some libraries, such as
bokeh make this a little bit easier, but the UX is not mainly made for inspecting images - you can hover over images, but that's about it. However, understanding data is not only about looking at an image, one at a time, at a glance. Understanding data means being able to zoom in on an image, view multiple images next to each other and see aggregate statistics of such a selection. Clusterfun allows you to do that. Let's dive into each of these components:
How is this different from model analysis platforms such as Scale, Encord, etc.?
pandasdataframe with, per image, a location (S3 or local) and a score.