Materialistic: Selecting Similar Materials in Images


Prafull Sharma1,2, Julien Philip2, Michaël Gharbi2, William T. Freeman1, Fredo Durand1, Valentin Deschaintre2

1MIT CSAIL, 2Adobe Research


Separating an image into meaningful underlying components is a crucial first step for both editing and understanding images. We present a method capable of selecting the regions of a photograph exhibiting the same material as an artist-chosen area. Our proposed approach is robust to shading, specular highlights, and cast shadows, enabling selection in real images. As we do not rely on semantic segmentation (different woods or metal should not be selected together), we formulate the problem as a similarity-based grouping problem based on a user-provided image location. In particular, we propose to leverage the unsupervised DINO features coupled with a proposed Cross-Similarity module and an MLP head to extract material similarities in an image. We train our model on a new synthetic image dataset, that we release. We show that our method generalizes well to real-world images. We carefully analyze our model's behavior on varying material properties and lighting. Additionally, we evaluate it against a hand-annotated benchmark of 50 real photographs. We further demonstrate our model on a set of applications, including material editing, in-video selection, and retrieval of object photographs with similar materials.

Material selection evaluation on real data

An array of results on the real image dataset for our method and the baselines and ablatations are presented in this webpage:

Material selection in videos

Given a user selection on the first frame of the video, our method can be applied to select the material at the query point in each frame. Note how the selections are robust to lighting variations including specularity and shadows.

First frame with query in red
Input video
Predicted material selection
Output scores

Demo video: Enabling multiple material selection

To empower artists, in an interactive demo, we allow users to select multiple positive (first video) and negative (second video) query points. The resulting score map for positive query points are combined by taking the maximum of the individually predicted similarity scores for each pixel, and thresholded by user defined value in [0, 1]. The predicted scores corresponding to negative samples are combined by computing a per-pixel maximum across all predicted scores of negative regions, and are then thresholded by the user using a separate threshold value. The intersection of the resulting mask with the mask computed using positive query points is removed from the final selection.

Multiple selection

Negative selection

Image editing

Outputs from our method can be used as input for material selection based image editing using Photoshop, Image-Based Material Editing (Khan et al. [2006]), and Stable Diffusion Inpainting [Rombach et al . 2021].

Results on grayscale images

We further evaluate our method on grayscale images and see that if textures are clearly distinct, our method can select the relevant regions despite the lack of color, showing it also considers texture to make its selection.

Selection consistency

The method produces consistent segmentation masks for different pixel selections within image regions that belong to the same material. In the query image, we show 5 different pixel selections (marked in different colors) with the resulting masks overlayed with the respective color.

Robustness to lighting variation

Our method is robust to lighting variations including specularity and shadows. The first row shows all the input images. There is a query selected in the first image with a selection marked with a red square. The selected pixel is at the center of the red square. The query embedding at the selection at the red square is used to select materials in subsequent images. The results show the robustness of our method to different lighting scenarios.

We also present the same experiment to demonstrate the robustness of our method to different lighting scenarios using the Multi-Illumination Dataset by Murmann et al.

Analysis: Albedo analysis

To analyze the behavior of the model with respect to changing albedo, we change the hue and saturation value on a diffuse sphere. The hue is sampled in range [0, 2*pi] and saturation is sampled in [0, 1]. We select the central pixel on a grid of spheres with varying albedo. The scores are thresholded at 0.5 which results in selections of regions in spheres with neighboring albedo. As expected, our model selects sphere with colors closer to the selected one first. As we vary the threshold, the selection becomes limited to the central sphere (higher threshold > 0.9), or extends to further spheres (lower threshold).

Analysis: Blending between materials

We now evaluate the sensitivity of our model to gradual material changes. To do so we render nine spheres covered by a blend of two different materials, a stone wall and roof tiles. We use the Blender "shader mix" node to interpolate between the SVBRDFs and apply a different mixing factor for each sphere, this mixing is shown below. Where 1.0 means 100% wall and 0.0 means 100% tiles.

We then proceed to select similar materials based on a query pixel on each extreme case. As we can see our method selects materials that are close to the query but not exactly similar and discriminates well when the mix of materials is visually far from the query.