I want to know programmatically whether listings posted at Mudah.My are similar or not even though it is posted by different persons and different date.
To do this, I think the best way is to detect whether the photos are similar or not.
Why To Know Same Property Is Advertised Over Period of Time?
I would like to know whether the property
1) price change over time, signalling time to purchase it
2) possibility owner becomes desperate to let it go if advertised for quite some time. So I can get better price
Same Property But Advertised by Different Agents and Different Date
All listings refer to same property by evaluating using naked eyes.
So now, I want to detect programmatically that all the listings are referring to same property by detecting those photos are similar.
- Step 1: Fingerprinting the Photos
Fingerprinting the photos is using image hashing technique. In this case, DHash will be used.
- Step 2: Compare the Photos
After fingerprinting, image hash will be compared among listings. If the image hash are same or Levenshtein distance is less or equal to 2, then we can consider as the listings are referring to same property.
|Photos||Size||Noticeable Features||Image Hash
|Listing 1 - Kitchen||18KB|
|Kitchen - listing 1 ,listing 3 and listing 6 are similar||cc6c7a727e7c3e77|
|Listing 2 - Kitchen||19KB|
|Has door and wall fan||3f37333333333339|
|Listing 3 - Kitchen||18KB|
|Listing 4 - Kitchen||17KB|
|Kitchen - Listing 4 color is lighter compared to listing 1, 3 & 6||cc6e7a727e7c7e77|
|Listing 6 - Kitchen||18KB|
|Listing 2 - Bedroom||22KB|
|similar with bedroom listing 5 only size is different.||e1b90d0c8ccc84c1|
|Listing 5 - Bedroom||10KB|
If we take Listing 1 as a base, we can say easily that it has same image hash with Listing 3 & Listing 6.
Listing 1 has Levenshtein distance of 2 with Listing 4.
Listing 1 has Levenshtein distance of 15 with Listing 5.
Listing 2 and Listing 5 has Levenshtein distance of 2.
We can conclude photos are similar if their image hash is the same or their Levenshtein distance is less or equal to 2.
What happens if different properties use same photos such as signage or building block? The algorithm will detect it as same property even though it is not.
To avoid this, we should establish a database of signage or building block to remove this false positive.