You are here:  » Scan and Set v2 - Regex Mod


Scan and Set v2 - Regex Mod

Submitted by ChrisNBC on Mon, 2016-10-17 15:30 in

Hi David,

Hope all is well.

I wondered if you might be able to suggest how I could modify the filter below… I need to run it a couple of times on the same field. The problem I have is that when it runs the second and subsequent times if a match is not found the field is emptied of anything that was in it. I wondered if you could suggest a way to modify the filter so that anything already in the field is left as is if no match is made? Additionally, would it be possible to change the filter into a regex style ‘scan and set’ to give it greater flexibility?

So for example I could search: (word1|word2|word3|word4$)/i

  /*************************************************/
  /* scanSet2 */
  /*************************************************/
  $filter_names["scanSet2"] = "Scan and Set v2";
  function filter_scanSet2Configure($filter_data)
  {
    print "Scan:<br />";
    print "<select name='scan'>";
    print "<option value='self' ".($filter_data["scan"]=="self"?"selected='selected'":"").">Self</option>";
    print "<option value='name' ".($filter_data["scan"]=="name"?"selected='selected'":"").">Name</option>";
    print "<option value='namedesc' ".($filter_data["scan"]=="namedesc"?"selected='selected'":"").">Name and Description</option>";
    print "</select>";
    widget_errorGet("scan");
    print "<br />";
    print "Values:<br />";
    print "<input type='text' name='values' value='".widget_safe($filter_data["values"])."' />";
    widget_errorGet("values");
  }
  function filter_scanSet2Validate($filter_data)
  {
    if (!$filter_data["values"])
    {
      widget_errorSet("values","required field");
    }
  }
  function filter_scanSet2Exec($filter_data,$text)
  {
    global $admin_importFeed;
    global $filter_record;
    switch($filter_data["scan"])
    {
      case "self":
        $scan = $text;
        break;
      case "name":
        $scan = $filter_record[$admin_importFeed["field_name"]];
        break;
      case "namedesc":
        $scan = $filter_record[$admin_importFeed["field_name"]];
        $scan .= " ".$filter_record[$admin_importFeed["field_description"]];
        break;
    }
    $values = explode(",",$filter_data["values"]);
    foreach($values as $value)
    {
      if (@preg_match("/\\b(".preg_quote($value).")\\b/i",$scan))
      {
        return $value;
      }
    }
    return $text;
  }

Thanks in advance.

Best regards
Chris

Submitted by support on Mon, 2016-10-17 17:53

Hi Chris,

The above should not affect the field value if no match is made as the ...Exec() function ends with;

  return $text;

However, if using the "self" option for the Scan parameter, subsequent invocations would only see the modified version. To resolve this and always use the feed value instead of the value modified by any previous filter, first edit html/admin.php and look for the following code around line 142:

    global $filter_record;

...and REPLACE with:

    global $filter_record;
    global $filter;

And then in your filter's ...Exec() function, where you have the following code:

      case "self":
        $scan = $text;
        break;

...REPLACE with:

      case "self":
        global $filter;
        $scan = $filter_record[$admin_importFeed["field_".$filter["field"]]];
        break;

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Tue, 2016-10-18 09:56

Hi David,

Thanks for the quick reply and the solution above. I wondered if you could possibly suggest how I could modify the filter so that it can handle regex in the field value so for example I could use:

(word1|word2|word3|word4$)/i

Thanks in advance.

Best regards
Chris

Submitted by support on Tue, 2016-10-18 10:53

Hello Chris,

The above would already support multiple words as scan can be comma separated, so if you use:

Word1,Word2,Word3,Word4

..the filter will scan for Word1 (and set if found), then Word2 etc. The test is already using preg_match with the "/i" modifier so it will be case insensitive. Apologies if mis-understood, let me know if you're still not sure of course...

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Tue, 2016-10-18 13:14

Hi David,

Thanks for your message re the filter. I didn’t realise the filter was case-insensitive but now you mentioned that, I can see it is. I have used this filter a lot on all my sites already (it’s probably one of my favourites!). I think I may not have explained well what I’m using the filter for on this site….the merchant categories declared in the feeds are too broad, so I’ ‘scanning’ product names and descriptions for key words that I can add to the ‘categories’ field to make it more granular. I’m doing this in a way that results in the longest match by matching the shortest match first and then matching longest match last (which intentionally overwrites the shorter version).

For example
word1
word1 word2
word1 word2 word3
word1 word2 word3 word4

Some products might have keyword 'word1' and 'word4' present but using the ‘scan and set’ filter as it is, I will just end up with word 1 in the field I append to the end of the category rather than 'word1 word4'. Another scenario is where 'word1' and 'word2' have another word between them which results in no match.

I think if I could specify regex style matching for the filter values the result would be much more granular and I would need a much reduced set of matching values/rules since I could match

word1.*word2.$
word1.*word4.$

I searched the forum and couldn’t spot any alternative filters which seemed to do what I’m looking for. Does the above sound feasible or is there another way to go to achieve the same result?

Would value your thoughts on this.

Thanks in advance.

Best regards
Chris

Submitted by support on Tue, 2016-10-18 13:51

Hello Chris,

That would require a new filter but is straight forward to implement, have a go with:

  /*************************************************/
  /* scanSetRegexp */
  /*************************************************/
  $filter_names["scanSetRegExp"] = "Scan and Set RegExp";
  function filter_scanSetRegExpConfigure($filter_data)
  {
    print "Scan:<br />";
    print "<select name='scan'>";
    print "<option value='self' ".($filter_data["scan"]=="self"?"selected='selected'":"").">Self</option>";
    print "<option value='name' ".($filter_data["scan"]=="name"?"selected='selected'":"").">Name</option>";
    print "<option value='namedesc' ".($filter_data["scan"]=="namedesc"?"selected='selected'":"").">Name and Description</option>";
    print "</select>";
    widget_errorGet("scan");
    print "<br />";
    print "RegExp:<br />";
    print "<input type='text' name='regexp' value='".widget_safe($filter_data["regexp"])."' />";
    widget_errorGet("regexp");
    print "<br />";
    print "Set:<br />";
    print "<input type='text' name='set' value='".widget_safe($filter_data["set"])."' />";
    widget_errorGet("set");
  }
  function filter_scanSetRegExpValidate($filter_data)
  {
    if (!$filter_data["regexp"])
    {
      widget_errorSet("regexp","required field");
    }
    if (!$filter_data["set"])
    {
      widget_errorSet("set","required field");
    }
  }
  function filter_scanSetRegExpExec($filter_data,$text)
  {
    global $admin_importFeed;
    global $filter_record;
    switch($filter_data["scan"])
    {
      case "self":
        $scan = $text;
        break;
      case "name":
        $scan = $filter_record[$admin_importFeed["field_name"]];
        break;
      case "namedesc":
        $scan = $filter_record[$admin_importFeed["field_name"]];
        $scan .= " ".$filter_record[$admin_importFeed["field_description"]];
        break;
    }
    if (@preg_match($filter_data["regexp"],$scan))
    {
      return $filter_data["set"];
    }
    return $text;
  }

This will add a new "Scan and Set RegExp" filter, with the same Scan options as your Scan and Set 2 filter (self, name, name+description) and will the run preg_match() with the configured RegExp, and if matched, return the Set value...

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Tue, 2016-10-18 14:45

Hi David,

Thanks for the attached. Quick question though.....please could you confirm what set value I should use to return the matched words? I tried $1 but that just put $1 in all fields!

Thanks in advance.

Best regards
Chris

Submitted by support on Tue, 2016-10-18 16:40

Hi Chris,

I've given this a good run-through on my test server - to add back-reference support to the "Set" value, where you have the following code in the ...Exec() function of the above:

    if (@preg_match($filter_data["regexp"],$scan))
    {
      return $filter_data["set"];
    }

...REPLACE with:

    if (@preg_match($filter_data["regexp"],$scan,$matches))
    {
      $retval1 = $filter_data["set"];
      $retval2 = $filter_data["set"];
      foreach($matches as $k => $match)
      {
        if (!$k) continue;
        $search = "\$".$k;
        $replace = $matches[$k];
        $retval2 = str_replace($search,$replace,$retval2);
      }
      if ($retval2 != $retval1)
      {
        return $retval2;
      }
    }

How this will work, is that if there is a match (preg_match() returns TRUE), a copy of the "Set" value will be taken. Then, any $n back-references within the Set value will be swapped out with the matched value. If the modified Set value is different to the original then the modified version will be returned. Else, the field will remain unchanged...

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Tue, 2016-10-18 17:47

Hi David,

I've just run the filter above but nothing seems to be matched. I set the regexp field value to:

(word1|word2|word3) and I tried $0 and $1 (in separate runs) as the return value.

I'm think maybe I have specified either the field value regexp or return value wrongly.

Would be grateful if you could confirm if I'm using the correct syntax.

Thanks in advance.

Best regards
Chris

Submitted by support on Tue, 2016-10-18 18:02

Hi Chris,

RegExp field must be a full RegExp including delimiters and any flags required - have a go with:

/(Word1|Word2|Word3)/i

In this case, $1 should return Word1 or Word2 or Word3. $0 would refer to the entire matched string, so should not be used at all...

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Wed, 2016-10-19 09:59

Hi David,

Thanks for the clarification. The filter works perfectly.

Best regards
Chris

Submitted by ChrisNBC on Wed, 2016-10-19 15:21

Hi David,

Sorry to keep coming back to this one. I thought when I replied this morning the above was working perfectly but I have been trying to expand the search criterial using regexp but the words I expected to match are not returned.

I first tried the filter below:

/(word3|word1.*word2.$|word4)/i

that didn’t return the words which should have been matched by the word1 word2 portion so I tried the filter below but this didn’t return anything either.

/( word1.*word2.$)/i

Would be grateful if you could confirm the regex I’m using is correct as I’m not sure now if it’s a regexp issue or something is up with the filter.

Thanks in advance.

Bets regards
Chris

Submitted by support on Wed, 2016-10-19 16:19

Hi Chris,

Considering on its own;

word2.$

...would match word2 + any character + end of line (scan value), so I'm wondering if you meant to use just

/(word1.*word2$)/i

This would still require word2 to be at the end of the scan value, whereas

/(word1.*word2)/i

...would match "word1 [anything] word2" anywhere in the scan value...

Hope this helps!

Cheers,
David.
--
PriceTapestry.com