You are here:  » preg replace All if else leave as is - Filter syntax/rules


preg replace All if else leave as is - Filter syntax/rules

Submitted by ChrisNBC on Thu, 2013-11-14 14:43 in

Hi David,

Hope all is going well.

The site I have been working on is now almost finished but I'm having some difficulty 'cleaning up' some of the feeds for use in pt. One of the issues is some merchants have used csv with quote text identifiers but then within a single field they have then placed a number of pieces of data which I need to split in order to use them. I have already successfully applied a number of filters but I have got stuck and wondered if you may be able to advise me how to use the "preg replace All if else leave as is" filter. To give you an example, I need to separate the xxGB from the field below...

Example data:

EE 1M SIMO Unl Mins 1GB Data

Using a commercially available regex tool I generated the expression below:

([0-9]+)(GB) Data

I have tried using the expression in the 'preg expression' box in the filter but when I run the import a page of errors is generated. I wondered if you could explain how I should enter the attached expression and also what I should put in the 'Replace' box to replace the fields contents with the "xxGB" part of field?

Thanks in advance.

Regards

Chris

Submitted by support on Thu, 2013-11-14 15:12

Hi Chris,

I'm not entirely sure which filter you are using - if it was a custom one documented on the forum if you could let me know, or alternatively copy the code from you includes/filter.php for the specific filter in your reply so I can see the exact code in use if necessary; but otherwise the expression to match the n of nGB would be:

/.*([1-9])+GB.*/

..with a Return Index of 1, or to return nGB (including the "GB") then

/.*([1-9]+GB).*/

...again with a Return Index of 1....

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Thu, 2013-11-14 15:52

Hi David,

Thanks for the quick response.

The filter that I'm using is off the forum and is:

{code saved}

I have just tried the regex you suggest above and get and error message:

Warning: preg_match() [function.preg-match]: Delimiter must not be alphanumeric or backslash in /[link removed]]/pt/includes/filter.php on line 479

I have not used the filter above for anything else so I guess it could be the filter or the regex above.

Would be grateful if you could let me know what might be going wrong.

Am I correct in thinking that the regex used by PT is the Perl varient?

Thanks in advance.

Regards
Chris

Submitted by support on Thu, 2013-11-14 16:10

Hi Chris,

Thanks - found the code, I think that covers a different scenario however - if a field contains any string that matches the regular expression then replace with the replacement, regardless of the regular expression matched.

But in this scenario, just to confirm, given an input of:

EE 1M SIMO Unl Mins 1GB Data

...should return

1

...however

EE 1M SIMO Unl Mins Data

...should return

EE 1M SIMO Unl Mins Data

...because there was no matching nGB in the string?

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Thu, 2013-11-14 16:37

Hi David,

The scenario you describes matches the requirement perfectly!

In summary,

I need to check the full field for "xxGB Data" and if found the "xxGB" should be used to replace the contents of the field. If "xxGB" is not found the field should be left as it is. Also, I can't rely on tests just of "GB" or "Data" as I have noticed the merchant also placing other information in the field such as "WiFi Data".

Could you suggest any filters that I could use to extract the data I need?

Thanks in advance.

Regards

Chris

Submitted by support on Thu, 2013-11-14 17:02

Hi Chris,

Got it - here's a filter to apply a regular expression to the input text, and if matched, return the n'th matched bracketed sub-expression, else return the original text unmodified. Here's the code to add to your includes/filter.php

  /*************************************************/
  /* RegExp Return or Ingore */
  /*************************************************/
  $filter_names["regExpReturnOrIgnore"] = "RegExp Return or Ingore";
  function filter_regExpReturnOrIgnoreConfigure($filter_data)
  {
    print "Regular Expression:<br />";
    print "<input type='text' name='match' value='".widget_safe($filter_data["match"])."' />";
    widget_errorGet("match");
    print "<br /><br />";
    print "Return Index: (zero based)<br />";
    print "<input type='text' size='3' name='index' value='".widget_safe($filter_data["index"])."' />";
    widget_errorGet("index");
  }
  function filter_regExpReturnOrIgnoreValidate($filter_data)
  {
    if (!$filter_data["match"])
    {
      widget_errorSet("match","required field");
    }
    if (!$filter_data["index"])
    {
      widget_errorSet("index","required field");
    }
  }
  function filter_regExpReturnOrIgnoreExec($filter_data,$text)
  {
    if (preg_match($filter_data["match"],$text,$matches))
    {
      return $matches[$filter_data["index"]];
    }
    else
    {
      return $text;
    }
  }

With that in place, for this scenario add a new "RegExp Return or Ingore" filter to your field using the following Regular Expression:

/.*([0-9]+GB).*/

...and a Return Index of 1.

For the input value "EE 1M SIMO Unl Mins 1GB Data" this will return "1GB", or if not matched, return the input value unmodified.

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Fri, 2013-11-15 00:43

Hi David,

Many thanks for the new filter above which works perfectly!...

I wondered if you could confirm where the filters are applied...am I correct in thinking they are applied before the data is 'transferred' into the products table?

Thanks in advance.
Regards
Chris

Submitted by support on Fri, 2013-11-15 10:46

Hello Chris,

That's correct - in includes/admin.php in the import record handler function - look for the comment "apply user filters" around line 218. Between user filters and the product actually being imported the product and category mappings are applied, and the price field "decimalised"...

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Fri, 2013-11-15 12:08

Hi David,

Thanks for the update and for your help above. The reason I asked is that I have some merchant feeds where a decimal value is contained within a text field along with other product data. I wondered if you might be able to suggest the best way to extract decimal data from a text field so that it can be imported into a decimal filed within the PT product table. I have a similar issue with other fields where a number of pieces of (text) information are contained within a single feed field, so I just import it into a number of PT fields within the product table and then trim it with various filters. The problem I have with the decimal feed data is how best to extract it to place just the decimal value into the correct field. I had thought I might be able to use the regex filter you created for me above but I don't think it will work as I would need to have a text formatted field in the feed table and decimal formatted field in the products table and I don't think PT allows this? I wondered if there are any filters which would allow me to load the full contents of the field and then run a filter on that field which would extract the 24 (using a regex as in the filter above) but then allow me to specify into which field the return value should be placed?

An example of the data is:

"24 Month Your Plan 11 100 Minutes"

Thanks in advance.

Regards
Chris

Submitted by support on Fri, 2013-11-15 13:20

Hi Chris,

Assuming the decimal fields don't have any content originally then it should work using a sequence of "Text Before" using the %fieldname% placeholder specifying the field from which to copy the content from; and then the "RegExp Return or Ignore" from this thread.

So if you have a `minutes` field, and the text in your example is in a field in the feed called DESCRIPTION (note the %placeholders% refer to the exact feed field names, not the internal products table field names) then you would start with Text Before:

%DESCRIPTION%

Followed by RegExp Return or Ignore with the Regular Expression:

/.*(.*) Minutes.*/

...and Return Index of 1

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Mon, 2013-11-18 11:12

Hi David,

Hope you had a good weekend. Thanks for the solution above.

Just one question...please could you confirm where I need to place the return index. In some of the filters I can see there is an input box but for the 'text before' filter there is not.

Thanks in advance.

Regards
Chris

Submitted by support on Mon, 2013-11-18 12:31

Hi Chris,

The suggestion (if suitable) is to apply a sequence of filters, the first being a "Text Before" filter, and in the text box on the configuration page for this filter you would enter %DESCRIPTION% to initially populate the field with the full text of the description.

Then, a second filter, "RegExp Return or Ignore" should be added, and this is where there are settings on the configuration page, the expression to apply:

/.*(.*) Minutes.*/

...and Return Index:

1

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Wed, 2013-11-20 15:42

Hi David,

Thanks for the clarification. The import is now working perfectly!

Regards
Chris

Submitted by ChrisNBC on Wed, 2013-11-20 17:38

Hi David,

I wonder if you could possibly help me out with the regex for picking out the cashback value from a field such as the one below:

£300 Auto Cashback

I have spent the last two hours hacking around with different expressions but without success. A few of the variations I have tried are:

/.*([0-9]+ Auto Cashback).*/
and
/.*(([0-9]+) [a-z]{2}.[a-z].[a-z]{8}).*/

Thanks in advance.
Regards
Chris

Submitted by support on Wed, 2013-11-20 17:55

Hi Chris,

Brackets must always completely enclose any specific return value you want - to match the first sequence of numbers only (0-9) in a string, try an expression of:

/([0-9]+)/

..and a return index of 1 - should do the trick.

To make it precede only the text "Auto Cashback" then use:

/([0-9]+) Auto Cashback/

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Wed, 2013-11-20 18:27

Thanks David....the above did the trick!

Regards
Chris

Submitted by ChrisNBC on Mon, 2014-01-06 18:16

Hi David,

Hope all is going well.

I'm using the filter above (RegExp Return or Ingore) to extract the xxxMB value from a field. The filter below is already in place an working

/.*([1-9]+GB).*/
with return index of 1

But when I modify it to:

/.*([1-9]+MB).*/
with return index of 1

Nothing is found. I have double checked everything including the data which contains many instances which should be found, such as:

Orange 24 Month 500 Minutes 250MB Data

I have tried variations of the regexp but without success. I wondered if you might have any suggestions why the filter does not work?

Thanks in advance.

Regards
Chris

Submitted by support on Mon, 2014-01-06 18:42

Hi Chris,

In the GB range a regexp restricted to 1-9 is probably OK, but in the MB range zero should also be permitted - have a go with:

/.*([0-9]+MB).*/

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Tue, 2014-01-07 10:25

Hi David,

Thanks as always for your quick response.

I changed the regexp to the one you suggested and reloaded the data. The fields containing a MB value appear are updated with 0MB. Mosst of the values in the feed are 100's so for example 250MB, 500MB etc. Please could you confirm if the regexp above just takes the last digit+MB? I'm wondering if I need to add something to the regexp to handle the larger values?

Thanks in advance.
Regards
Chris

Submitted by support on Tue, 2014-01-07 10:36

Sorry Chris, the "U"ngreedy modifier is required - have a go with:

/.*([0-9]+MB).*/U

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Tue, 2014-01-07 10:49

Thanks David, that works perfectly!

Regards
Chris

Submitted by ChrisNBC on Tue, 2014-01-07 16:48

Hi David,

Sorry to bother you again...

I wondered if you could tell me if there are any existing filters in PT like the /* RegExp Return or Ingore */ filter above but which would delete the contents of the field if no match could be made?

The reason I ask is that I have a feed containing a field where I want to extract the GBP value from something like 'Save £225 cashback by redemption'. I am using the regexp:

/.*([0-9]+ cashback by redemption).*/U

This seems to work but since the dB field the feed loads into is formatted as a decimal the non matched data held in other records e.g the 360 from 'X Box 360 is also loaded'.

Thanks in advance.
Regards
Chris

Submitted by support on Wed, 2014-01-08 09:26

Hi Chris,

In the current filter based on the above code, where you have:

    if (preg_match($filter_data["match"],$text,$matches))
    {
      return $matches[$filter_data["index"]];
    }
    else
    {
      return $text;
    }

...simply REPLACE with:

    if (preg_match($filter_data["match"],$text,$matches))
    {
      return $matches[$filter_data["index"]];
    }
    else
    {
      return "";
    }

And that would have the effect of either matching the expression and returning the return index, or returning nothing which will clear out the field. If you need as a completely separate filter - here's the same code with the above modification with a different filter name "RegExp Return or Delete":

  /*************************************************/
  /* RegExp Return or Delete */
  /*************************************************/
  $filter_names["regExpReturnOrReturn"] = "RegExp Return or Delete";
  function filter_regExpReturnOrReturnConfigure($filter_data)
  {
    print "Regular Expression:<br />";
    print "<input type='text' name='match' value='".widget_safe($filter_data["match"])."' />";
    widget_errorGet("match");
    print "<br /><br />";
    print "Return Index: (zero based)<br />";
    print "<input type='text' size='3' name='index' value='".widget_safe($filter_data["index"])."' />";
    widget_errorGet("index");
  }
  function filter_regExpReturnOrReturnValidate($filter_data)
  {
    if (!$filter_data["match"])
    {
      widget_errorSet("match","required field");
    }
    if (!$filter_data["index"])
    {
      widget_errorSet("index","required field");
    }
  }
  function filter_regExpReturnOrReturnExec($filter_data,$text)
  {
    if (preg_match($filter_data["match"],$text,$matches))
    {
      return $matches[$filter_data["index"]];
    }
    else
    {
      return "";
    }
  }

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Wed, 2014-01-08 15:49

Hi David,

Thanks for the attached which I have added to the filters. I'm not sure if there is a problem with the filter or my regexp (most likely my regexp!) but I'm having a few issues getting rid of the numeric values that I don't need. In summary, I'm loading one of the merchant feed fields into two new PT fields which I have added. One field flags the type of cash back, either 'automatic' or 'redemption' and the other should hold the GBP value of the cash back.

I previously thought the filter I had added to the numeric field (formatted as Decimal 10,2 in the dB) was extracting the numeric values. However, after some experimentation, I have realised that SQL must be recognising the numeric value in the feed and is loading it.

I created the regexp below to use with the filter you kindly wrote for me above but all that happens is all of the numeric values are deleted.

/.*([0-9])+ Automatic Cashback.*/U

I wondered if you might have any suggestions how to resolve this issue or if you could confirm if I'm using the correct regexp?

Thanks in advance.
Regards
Chris

Submitted by support on Thu, 2014-01-09 11:41

Hi Chris,

What I would do is first make sure that your cashbackamount field is DECIMAL(10,2) and your cashbacktype field is VARCHAR(255). If that's not already the case, you can change the field types with the following dbmod.php script:

<?php
  set_time_limit
(0);
  require(
"includes/common.php");
  
$sql "ALTER TABLE `".$config_databaseTablePrefix."products`
            CHANGE `cashbackamount` DECIMAL(10,2),
            CHANGE `cashbacktype` VARCHAR(255)
            "
;
  
database_queryModify($sql,$result);
  print 
"Done.";
?>

(replacing cashbackamount and cashbacktype with your actual field names of course)

Then for the cashback amount field, create a new Decimalise filter with the following code:

  /*************************************************/
  /* Decimalise */
  /*************************************************/
  $filter_names["decimalise"] = "Decimalise";
  function filter_decimaliseConfigure($filter_data)
  {
    print "<p>There are no additional configuration parameters for this filter.</p>";
  }
  function filter_decimaliseValidate($filter_data)
  {
  }
  function filter_decimaliseExec($filter_data,$text)
  {
    return tapestry_decimalise($text);
  }

...and apply to the cashbackamount field. This will run the value through the tapestry_decimalise() function which strips all non numeric characters and should return the value (or 0.00).

Then for cashbacktype, if your data is suitably consistent, perhaps the Scan and Set filter from this thread will do the trick using a scan list of

Automatic,Redemption

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Thu, 2014-01-09 12:53

Hi David,

Thanks for your quick response. I double checked both the fields and they are set correctly. Apologies, I think maybe my message above was not clear....I already have a solution in place that sets the cashbacktype to either 'Automatic' or 'Redemption'. The cashbackamount field is also populated and appears to be decimalised. The issue I have is that in the feed the field containing the value sometimes contain cashback information but can also contain other information such as a free gift. In some instances the PT cashbackamount field is populated correctly with the cashback GBP value (decimalised) and in some instances it's populated with other values contained within the feed field, for example when the feed field contains something like "Marks & Spencer Gift Card £50" PT populates the cashbackamount field with 50.00, similarly for a field containing "Xbox 360 4GB" the PT field is populated with 3604.00. When there is cashback in the field it's always in the format "Save £336 cashback by redemption" or "£220 Automatic Cashback". What I need to do is populate the cashbackamount field when there is a 'real' cashback amount and delete the content when it's not. I had hoped my regexp above together with the filter (RegExp Return or Delete) would allow me to only populate the cashbackamount field with a cashback amount. Would be grateful for any advice you could give me on how I might make this work.

Thanks in advance.
Regards

Chris

Submitted by support on Thu, 2014-01-09 16:22

Hi Chris,

Have a go with something like this, an Extract Cashback filter by adding the following code to includes/filter.php:

  /*************************************************/
  /* Extract Cashback */
  /*************************************************/
  $filter_names["extractCashback"] = "Extract Cashback";
  function filter_extractCashbackConfigure($filter_data)
  {
    print "<p>There are no additional configuration parameters for this filter.</p>";
  }
  function filter_extractCashbackValidate($filter_data)
  {
  }
  function filter_extractCashbackExec($filter_data,$text)
  {
    $cashbackBeforeRegexp = "(cashback|save)";
    $cashbackAfterRegexp = "(automatic|cashback)";
    if (preg_match('/'.$cashbackBeforeRegexp.' ([\.0-9]*)/i',$text,$matches))
    {
      return tapestry_decimalise($matches[2]);
    }
    if (preg_match('/([\.0-9]*) '.$cashbackAfterRegexp.'/i',$text,$matches))
    {
      return tapestry_decimalise($matches[1]);
    }
    return "0.00";
  }

Edit the 2 regular sub-expressions $cashbackBeforeRegexp and $cashbackAfterRegexp as required, containing a list of strings to match either before or after the cashback amount...

Hope this helps!
Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Fri, 2014-01-10 12:07

Hi David,

Thanks for the new filter above. I've added it to the installation and set the strings to:

$cashbackBeforeRegexp = "(Automatic Cashback|cashback by redemption)";
$cashbackAfterRegexp = "(automatic|cashback)";

However, when I import records the cashback value field is just set to '0.00'. Just to test the filter, I also applied it to another PT field (which is formatted as text) which I set to be populated with the same feed data and I get the same result.

I wondered if you might be able to suggest what the problem might be?

Thanks in advance.
Regards
Chris

Submitted by support on Fri, 2014-01-10 13:40

Hi Chris,

There was an error in this line:

  if (preg_match('/([\.0-9]*) '.$cashbackAfterRegexp.'/i',$desc,$matches))

...which should be:

  if (preg_match('/([\.0-9]*) '.$cashbackAfterRegexp.'/i',$text,$matches))

...corrected above. However if that's not it, it occurred to me that there may be currency characters getting in the way of the regexp, e.g. the following would not match:

Cashback £100.00

If that could be the case, delete the Extract Cashback filter and sequence in before it any Search and Replace filters necessary to remove the currency, for example:

£

&pound;

(replacing with nothing of course)

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by ChrisNBC on Fri, 2014-01-10 15:01

Hi David,

Thanks for that, it's done the trick! and now works beautifully.

Have a great weekend.

Regards
Chris

Submitted by richard on Wed, 2016-06-29 17:53

Hi David,

I am looking to extract the height, width and depth dimensions and populate three columns with the data.

The data can be with or without decimal places .e.g.

H90cm x W40cm x D60cm or
H89 x W59.5 x D52.2cm

Using the regex /.*(W+[\.0-9]{2}).*/U for width I only extract W59 as oppsoed to W59.5. I did try /.*(W+[\.0-9]{4}).*/U but that was not the right solution for integers.

I have used the above filter

  /*************************************************/
  /* RegExp Return or Ingore */
  /*************************************************/
  $filter_names["regExpReturnOrIgnore"] = "RegExp Return or Ingore";
  function filter_regExpReturnOrIgnoreConfigure($filter_data)
  {
    print "Regular Expression:<br />";
    print "<input type='text' name='match' value='".widget_safe($filter_data["match"])."' />";
    widget_errorGet("match");
    print "<br /><br />";
    print "Return Index: (zero based)<br />";
    print "<input type='text' size='3' name='index' value='".widget_safe($filter_data["index"])."' />";
    widget_errorGet("index");
  }
  function filter_regExpReturnOrIgnoreValidate($filter_data)
  {
    if (!$filter_data["match"])
    {
      widget_errorSet("match","required field");
    }
    if (!$filter_data["index"])
    {
      widget_errorSet("index","required field");
    }
  }
  function filter_regExpReturnOrIgnoreExec($filter_data,$text)
  {
    if (preg_match($filter_data["match"],$text,$matches))
    {
      return $matches[$filter_data["index"]];
    }
    else
    {
      return "";
    }
  }

How can I allow for both scenarios?

Best regards,

Richard

Submitted by support on Thu, 2016-06-30 16:44

Hello Richard,

Unusually, its the "U"ngreedy flag getting in the way here, which prevents matching of decimalised or non decimalised values. For extraction of Wnn or "Wnn.nn" have a go with the regex:

/.*(W[0-9]+[\.0-9]*).*/

(and similarly for H / D as required)

Cheers,
David.
--
PriceTapestry.com

Submitted by richard on Thu, 2016-06-30 17:00

Hi David,

Many thanks. I tried working through http://regexr.com/ but I would never have solved it!

Best regards,

Richard