Detecting Similar Images

Use Case

I want to know programmatically whether listings posted at Mudah.My are similar or not even though it is posted by different persons and different date.

To do this, I think the best way is to detect whether the photos are similar or not.

Why To Know Same Property Is Advertised Over Period of Time?

I would like to know whether the property
1) price change over time, signalling time to purchase it
2) possibility owner becomes desperate to let it go if advertised for quite some time. So I can get better price

Same Property But Advertised by Different Agents and Different Date

All listings refer to same property by evaluating using naked eyes.
So now, I want to detect programmatically that all the listings are referring to same property by detecting those photos are similar.

Listing AgentPosted Date
Listing 1
URL
Price: RM260,000
Dilla05/07/2019
Listing 2
URL
Price: RM260,000
Aiman05/07/2019
Listing 3
URL
Price:RM260,000
Norhayati05/07/2019
Listing 4
URL
Price:RM260,000
Fahana05/07/2019
Listing 5
URL
Price:RM260,000
Fazri22/07/2019
Listing 6:
URL
Price:RM270,000
Marina12/07/2019

Kitchen Photos

Seroja Apartment Listing 1 - Kitchen
Seroja Apartment Listing 1 – Kitchen
Seroja Apartment Listing 2 - Kitchen
Seroja Apartment Listing 2 – Kitchen
Seroja Apartment Listing 3 - Kitchen
Seroja Apartment Listing 3 – Kitchen
Seroja Apartment Listing 4 - Kitchen
Seroja Apartment Listing 4 – Kitchen
Seroja Apartment Listing 6 - Kitchen
Seroja Apartment Listing 6 – Kitchen

Bedroom Images

Seroja Apartment Listing 2 - Bedroom
Seroja Apartment Listing 2 – Bedroom
Seroja Apartment Listing 5 - Bedroom
Seroja Apartment Listing 5 – Bedroom

Technique Used

    1. Step 1: Fingerprinting the Photos

Fingerprinting the photos is using image hashing technique. In this case, DHash will be used.

    1. Step 2: Compare the Photos

After fingerprinting, image hash will be compared among listings. If the image hash are same or Levenshtein distance is less or equal to 2, then we can consider as the listings are referring to same property.

Results

PhotosSizeNoticeable FeaturesImage Hash
(DHash)
Listing 1 - Kitchen18KB
480x480
Kitchen - listing 1 ,listing 3 and listing 6 are similarcc6c7a727e7c3e77
Listing 2 - Kitchen19KB
640x480
Has door and wall fan3f37333333333339
Listing 3 - Kitchen18KB
480x480
cc6c7a727e7c3e77
Listing 4 - Kitchen17KB
480x480
Kitchen - Listing 4 color is lighter compared to listing 1, 3 & 6cc6e7a727e7c7e77
Listing 6 - Kitchen18KB
480x480
cc6c7a727e7c3e77
Listing 2 - Bedroom22KB
640x480
similar with bedroom listing 5 only size is different.e1b90d0c8ccc84c1
Listing 5 - Bedroom10KB
320x240
e1b52d0c8ccc84c1

Kitchen Photos

If we take Listing 1 as a base, we can say easily that it has same image hash with Listing 3 & Listing 6.

Listing 1 has Levenshtein distance of 2 with Listing 4.
Listing 1 has Levenshtein distance of 15 with Listing 5.

Bedroom Photos

Listing 2 and Listing 5 has Levenshtein distance of 2.

Conclusion

We can conclude photos are similar if their image hash is the same or their Levenshtein distance is less or equal to 2.

False Positive

What happens if different properties use same photos such as signage or building block? The algorithm will detect it as same property even though it is not.

To avoid this, we should establish a database of signage or building block to remove this false positive.

Photos Example

Apartment Signage
Apartment Signage
Apartment Block
Apartment Block

References:

Fingerprinting Images For Near Duplicate Detection
Python Code & Images Used
DHash Algorithm
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5
Listing 6