You are here:  » Fuzzy Matching using 'levenshtein' function


Fuzzy Matching using 'levenshtein' function

Submitted by ChrisNBC on Tue, 2014-10-14 14:26 in

Hi David,

Hope all is going well. I wondered if you might be able to give me some advice on how I might improve product matching.

In summary, I now have access to nearly all the merchant feeds I need for the site I have been working on. I had hoped to use UID matching but only a handful of the feeds include EAN or UPC. Matching on the product names directly does not yield many matches either. The product names generally contain similar components..... The majority start with a brand, end with a colour and sandwiched in the middle is usually a style and there may or may not be a product name.

I have been trawling through the forum looking for something that might work. I know using PT product mapping is an option but I'm hoping to implement something that will be sustainable in the long term and will maximise automation.

I noticed one post mentions 'levenshtein' fuzzy matching and the 'similar_text' function. The 'levenshtein' function looks particularly interesting and I wondered if you might have any thoughts as to whether it might work in the scenario I describe above?

Thanks in advance.

Regards
Chris

Submitted by support on Wed, 2014-10-15 15:09

Hello Chris,

The problem with levenshtein / similar_text is that they require a reference rather than returning a value based on a single string.

One thing I was experimenting with recently was a method to handle matching (by way of creating a virtual uid field) products where the title contains all the same words, ignoring case and "special" characters e.g. brackets, but in a different order, for example

Panasonic TX-55AS650B LED TV
Led TV Panasonic TX-55AS650B

The method creates a virtual UID being the MD5 hash of the normalised version of the product name (e.g. all special characters removed), made all lower case, words re-arranged alphabetically and with all spaces removed, in other words in the case of both of the above;

md5("55AS650Bledpanasonictvtx")

I know that you already have an `ean` field registered and are using this with uidmap.php so a quick way to try the above would be to create the virtual ean if there is no value from the feed. To try this, edit includes/admin.php and look for the following comment around line 367:

  /* create dupe_hash value */

...and REPLACE with:

  if (!$importRecord["ean"])
  {
    $words = explode(" ",strtolower($normalisedName));
    asort($words);
    $importRecord["ean"] = md5(implode("",$words));
  }
  /* create dupe_hash value */

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Thu, 2014-10-16 16:22

Hi David,

Thanks for your quick response. I tried the above but it didn't yield many matches. I think part of the issue is that the product names are sometimes one word different. I have been toying with the idea of running a series of filters against the product names at import to clean them up a bit but I'm not sure that I can identify sufficient rules to make this worth while also in doing that I'm going to create a 'moving target' which will require constant updating. I wondered, using the above, is there any way the matching could be done to a percentage confidence level? or could you suggest any other solutions?

Thanks in advance.

Regards
Chris

Submitted by support on Sat, 2014-10-18 08:30

Hi Chris,

You're right about the "moving target" - it's a very complex area - one of those things that is obvious to a human but almost impossible for a computer to achieve!

Watch this space - I am experimenting with regular expression and price range matching (seeded with a list of model numbers for a particular product category) and if it works out I'll post the details but it's not straight forward i'm afraid!

Cheers,
David.
--
PriceTapestry.com

Submitted by davidre on Tue, 2014-12-23 20:25

As you say it really is a moving target... I would be really keen to tryout even experimental code :)

Thanks

Submitted by support on Tue, 2014-12-23 20:28

Hi David,

Look out for Product Mapping by Regular Expression coming very shortly....!

Release Announcement

Cheers,
David.
--
PriceTapestry.com