You are here:  » Drop if Duplicate


Drop if Duplicate

Submitted by jmrwv on Tue, 2006-10-03 02:56 in

Hello,

I'm trying to think of a way to create a filter to drop records that have duplicate fields, like if the description field is identical for multiple products, then only import one of those products. I feel like it's probably hurting more than helping in the search engines to have a whole group of products with the same description, as if often the case with CJ merchant's feeds.

I could do this within the scripts (after importing) by doing a SELECT DISTINCT(description) type of thing, but it seems inefficient to import so much data that will never be used. So I'm wondering if it sounds reasonable to do a "Drop if Duplicate" filter on a selected field. Any thoughts on this? I'm more of a code hacker than a code writer, but if you point me in the right direction I might be able to work something out.

Best Regards,
Jim

Submitted by support on Tue, 2006-10-03 07:16

Hi Jim,

You can add the description field to the dupe filter; although this will only detect duplicates in conjuction with the product name. Look for the following code starting at line 240 of includes/admin.php:

    /* create dupe_hash value */
    $dupe_key = $admin_importFeed["merchant"];
    // uncomment any additional fields that you wish to filter duplicates on (description not recommended)
    $dupe_key .= $record[$admin_importFeed["field_name"]];
    // $dupe_key .= $record[$admin_importFeed["field_description"]];
    // $dupe_key .= $record[$admin_importFeed["field_image_url"]];
    // $dupe_key .= $record[$admin_importFeed["field_buy_url"]];
    // $dupe_key .= $record[$admin_importFeed["field_price"]];
    $dupe_hash = md5($dupe_key);

To add the description field, uncomment the following line from that block:

    // $dupe_key .= $record[$admin_importFeed["field_description"]];

If you are confident that your merchants don't have duplicate products , the easiest thing to do would be to make the changes described above but then to comment out the product name field from the dupe filter, leaving you with this:

    /* create dupe_hash value */
    $dupe_key = $admin_importFeed["merchant"];
    // uncomment any additional fields that you wish to filter duplicates on (description not recommended)
    // $dupe_key .= $record[$admin_importFeed["field_name"]];
    $dupe_key .= $record[$admin_importFeed["field_description"]];
    // $dupe_key .= $record[$admin_importFeed["field_image_url"]];
    // $dupe_key .= $record[$admin_importFeed["field_buy_url"]];
    // $dupe_key .= $record[$admin_importFeed["field_price"]];
    $dupe_hash = md5($dupe_key);

If this is not sufficient, you need to establish a unique index against the description field - that way, MySQL will silently drop any duplicate records (the INSERT will fail). Unfortunately, you cannot create a UNIQUE index against a field of the TEXT data type; so you would have to establish a second dupe hash field and give it a value that is the MD5 hash of the description.

To do this, you will need to re-install Price Tapestry (or at least De-Register all your feeds) so that you have an empty database. The first SQL statement needed to add a the new field is:

ALTER TABLE `products` ADD `desc_hash` VARCHAR( 32 ) NOT NULL

Secondly, to make that field UNIQUE:

ALTER TABLE `products` ADD UNIQUE (`desc_hash`)

The easiest way to do this is to make the changes through phpMyAdmin or other MySQL admin script; however you can write a script to do this:

dbupdate.php:

<?php
  set_time_limit
(0);
  
ignore_user_abort();
  require(
"includes/common.php");
  require(
"includes/header.php");
  
$sql "ALTER TABLE `products` ADD `desc_hash` VARCHAR( 32 ) NOT NULL";
  
database_queryModify($sql,$rows);
  
$sql "ALTER TABLE `products` ADD UNIQUE (`desc_hash`)";
  
database_queryModify($sql,$rows);
  print 
"<p>Done.</p>";
  require(
"includes/footer.php");
?>

Finally, you need to modify includes/admin.php to setup and insert the value for the new desc_hash field.

Just below that block of code described above; add the following line (so it will be inserted into line 249 of the distribution):

  $desc_hash = md5($record[$admin_importFeed["field_description"]]);

Then finally, modify the line that constructs the SQL as follows, adding the new desc_hash field:

    $sql = sprintf("INSERT INTO `".$config_databaseTablePrefix."products` SET
                    merchant='%s',
                    name='%s',
                    description='%s',
                    image_url='%s',
                    buy_url='%s',
                    price='%s',
                    search_name='%s',
                    category='%s',
                    brand='%s',
                    dupe_hash='%s',
                    desc_hash='%s'
                    ",
                    database_safe($admin_importFeed["merchant"]),
                    database_safe($record[$admin_importFeed["field_name"]]),
                    database_safe(isset($record[$admin_importFeed["field_description"]])?$record[$admin_importFeed["field_description"]]:""),
                    database_safe(isset($record[$admin_importFeed["field_image_url"]])?$record[$admin_importFeed["field_image_url"]]:""),
                    database_safe($record[$admin_importFeed["field_buy_url"]]),
                    database_safe($record[$admin_importFeed["field_price"]]),
                    database_safe($searchName),
                    database_safe($category),
                    database_safe($brand),
                    $dupe_hash,
                    $desc_hash
                    );

Hope this helps,
Cheers,
David.

Submitted by jmrwv on Tue, 2006-10-03 08:04

although this will only detect duplicates in conjuction with the product name

Just to clarify, are you saying it would only detect duplicate descriptions if it had already detected a duplicate product name? Both would have to be satisfied in order to drop the record?

If I comment out the product name field from the dupe filter, what happens in the event there is a duplicate product name?

Thanks,
Jim

Submitted by support on Tue, 2006-10-03 08:32

> Just to clarify, are you saying it would only detect duplicate descriptions if it had already detected a duplicate product name? Both would have to be satisfied in order to drop the record?

Correct. If you have products of different names but with the same descriptions then BOTH would be imported and you would still be left with duplicate descriptions.

> If I comment out the product name field from the dupe filter, what happens in the event there is a duplicate product name?

The results would be somewhat unpredictable i'm afraid - so you could only do this if you were sure that you have high quality feeds that do not have duplicate product names. The effects would include things like showing the same merchant twice in the prices list for example.

I would not really recommend the first method, it's not a perfect solution to what you want to do. The database mods however will guarantee that there can be no duplicate descriptions in your database.

Cheers,
David.

Submitted by jmrwv on Tue, 2006-10-03 19:48

I did the database mods and that seemed to go fine, but I'm not sure this is working right. With one of my larger feeds, the exact same number of records is being imported, as expected, because the description fields are all unique. So that's ok. But in another feed where there are numerous repeated description fields, about half the number of records are being imported. That would be ok, but as soon as I look at the resulting import I'm seeing duplicate description fields. So lots of stuff is getting taken out, but I don't know what. Any suggestions?

Regards,
Jim

Submitted by support on Tue, 2006-10-03 19:56

Hi Jim,

That sounds strange - i'll have to run through the mods on my test server with your feed to figure this one out. Can you let me know the URL of the feed that is causing the duplicates (e.g. http://www.yoursite.com/feeds/merchant.xml) - if you reply to your registration code email or forum signup email it will get to me - and i'll then download the feed and take a look for you...

If you could also let me know how you've registered the feed (field mapping) - and an example of two products that are being imported with the same description that would help....

Cheers,
David.

Submitted by Milano on Fri, 2014-11-07 14:41

Hi,

I have another question... how to disable this duplicate filtering in newest PT :) I want to import all items (also with the same name) from csv without this duplicate filter.
After this, if items will heve the same name, to url structure will be added some number?

Submitted by support on Fri, 2014-11-07 14:46

Hi Milano,

In fact, the best way to permit duplicate results per (merchant + product) combination is to drop the dupe_filter index - the following dbmod.php script will do the trick:

<?php
  set_time_limit
(0);
  require(
"includes/common.php");
  
$sql "DROP INDEX dupe_filter ON `".$config_databaseTablePrefix."products`";
  
database_queryModify($sql,$result);
  print 
"Done.";
?>

Restoring the index requires a TRUNCATE of the products table, since creating a unique index will fail if any of the values to be indexed are not unique. The following dbmod.php script will restore the dupe_filter index:

<?php
  set_time_limit
(0);
  require(
"includes/common.php");
  
$sql "TRUNCATE `".$config_databaseTablePrefix."products`";
  
database_queryModify($sql,$result);
  
$sql "CREATE UNIQUE INDEX dupe_filter ON `".$config_databaseTablePrefix."products` (dupe_hash)";
  
database_queryModify($sql,$result);
  print 
"Done.";
?>

Cheers,
David.
--
PriceTapestry.com

Submitted by Milano on Mon, 2014-11-10 13:53

Hi David,

I use this first script and then re-register product feed but import only ~20% products from csv. And still only one product with duplicate name was showing. I don't know what I did but since then PT stop importing any products. In panic, I decided to delete database/files and install clean PT... but nothing change - PT register new feed but can't import any product :(

What a day. Do you know what I did wrong or how to fix it?

Submitted by support on Mon, 2014-11-10 13:59

Hi Milano,

Did you add any custom fields that were still configured in config.advanced.php ($config_fieldSet) but not set-up on the database after re-installation?

The following dbmod.php script will scan all fields and apply the ALTER TABLE SQL required for all custom fields:

<?php
  
require("includes/common.php");
  
$ignore = array("name","description","image_url","buy_url","price","category","brand");
  foreach(
$config_fieldSet as $field => $v)
  {
    if (
in_array($field,$ignore)) continue;
    
$sql "ALTER TABLE `".$config_databaseTablePrefix."feeds`
              ADD `field_"
.$field."` VARCHAR(255) NOT NULL";
    
database_queryModify($sql,$result);
    
$sql "ALTER TABLE `".$config_databaseTablePrefix."products`
              ADD `"
.$field."` VARCHAR(255) NOT NULL";
    
database_queryModify($sql,$result);
  }
  print 
"Done.";
?>

Another thing to try if you're not sure what's gone wrong is database debug mode which you can enable by changing line 6 in config.advanced.php as follows;

  $config_databaseDebugMode = FALSE;

...then attempt to register / import a feed. If an extended MySQL error message is displayed but you're not sure from the text where the problem lies let me know what is displayed and I'll check it out...

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by Milano on Wed, 2014-11-12 12:33

Hi David,

my fault... all products didn't was importing correctly because some products had empty field 'buy_url'. After deleting require filds (except 'name') now all products are importing ok:) But still don't display all products with the same name - only last product is displaying.

Yes, I added few custom fields. I set up it on config.advanced.php and in database by script:

<?php
  
require("includes/common.php");
  
$sql "ALTER TABLE `".$config_databaseTablePrefix."feeds`
            ADD `field_keywords` VARCHAR(255) NOT NULL"
;
  
database_queryModify($sql,$result);
  
$sql "ALTER TABLE `".$config_databaseTablePrefix."products`
            ADD `keywords` VARCHAR(255) NOT NULL"
;
  
database_queryModify($sql,$result);
  print 
"Done.";
?>

Any another way to completely remove this duplicate filter?
Thanks!

Submitted by support on Wed, 2014-11-12 14:18

Hello Milano,

An alternative way to generate the dupe hash value that will permit all products would be to key it off the buy_url field instead of merchant. To try this, if you edit includes/admin.php and look for the following code at line 360:

  $dupe_key = $importRecord["merchant"];

...and REPLACE with:

  $dupe_key = $importRecord["buy_url"];

...and then re-import, and that should also yeild all products (with unique Buy URL of course) as an alternative to dropping the dupe_filter unique index...

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by Milano on Wed, 2014-11-12 14:46

Well... the problem is that I don't have unique buy_url or any other fields. I use PT not as price comparison script but as directory script :)
One thing... in datebase I see column 'dupe_hash' with unique key.

Submitted by support on Wed, 2014-11-12 15:50

Hi Milano,

Ah OK, could you perhaps try rolling back to the previous method of dropping the unique key on dupe_hash would then permit duplicate product names to be imported - it's this dbmod script from above...

<?php
  
require("includes/common.php");
  
$sql "DROP INDEX dupe_filter ON `".$config_databaseTablePrefix."products`";
  
database_queryModify($sql,$result);
  print 
"Done.";
?>

Don't forget to re-import all feeds afterwards...

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by Milano on Thu, 2014-11-13 10:55

Well... all products (also with the same name) correctly imported to database. But in search results still display only one product - rest of duplicate name product just don't display it. I also try to change includes/admin.php in section /* create dupe_hash value */ but no effect :(

Submitted by support on Thu, 2014-11-13 11:15

Hello Milano,

search.php uses a GROUP BY search_name in the SELECT SQL as part of the comparison functionality, so if you actually want multiple results of the same name to show up in search results you can just remove this. In that file, simply perform a Search and Replace using your text editor of

GROUP BY search_name

...and replace with nothing.

That should get you close - and if necessary, you'll be able to have unique product pages by combining merchant and product name in the URL, let me know if that was what you wanted to do...

Cheers,
David.
--
PriceTapestry.com

Submitted by Milano on Thu, 2014-11-13 13:01

David, after removing GROUP BY search_name now in every categories/merchants/brands show me only one (last) product :(

Submitted by support on Thu, 2014-11-13 13:26

Hello Milano,

An additional replacement of the summary clauses (before the WHERE in the SQL) will also be required - I've just tried this on my test server for you)... using your text editor's Search and Replace feature make the following replacements in search.php

1. Search:

COUNT(id) AS numMerchants

Replace:

1 AS numMerchants

2. Search:

MIN(price) as minPrice

Replace:

price as minPrice

That should give you all results de-grouped as single results per product. At this point, if you're not intending to have product pages and just want to link directly to the merchant, simply look for the following code around line 393:

$searchresults["products"][$k]["productHREF"] = tapestry_productHREF($searchresults["products"][$k]);

...and REPLACE with:

$searchresults["products"][$k]["productHREF"] = tapestry_buyURL($searchresults["products"][$k]);

Hope this gets you close to what you're looking for!

Cheers,
David.
--
PriceTapestry.com

Submitted by Milano on Mon, 2014-11-17 12:49

Hi,

you'll be able to have unique product pages by combining merchant and product name in the URL

I have added product id to product url structure (product-id.html). Everything is ok but I've noticed that in sitemap.php don't display this unique id. How to add product id also in sitemap?

Submitted by support on Mon, 2014-11-17 13:13

Hello Milano,

The `id` field isn't included in the field list used by the SELECT query in sitemap.php so won't be available for tapestry_productHREF() to use. To add the field, look for the following code at line 16:

$sql = "SELECT normalised_name FROM `".$config_databaseTablePrefix."products` WHERE filename='".database_safe($_GET["filename"])."' LIMIT ".$start.",".$limit;

...and REPLACE with:

$sql = "SELECT id,normalised_name FROM `".$config_databaseTablePrefix."products` WHERE filename='".database_safe($_GET["filename"])."' LIMIT ".$start.",".$limit;

Cheers,
David.
--
PriceTapestry.com

Submitted by tobyhage on Tue, 2017-06-13 18:53

David,

I want to try to prevent duplicate products with the same image url. In includes/admin.php i have added / changed following code.

    /* create dupe_hash value */
    $dupe_key = $importRecord["merchant"];
    $dupe_key .= tapestry_mb_strtolower($searchName);
    $dupe_key .= $importRecord["image_url"];
    $dupe_hash = md5($dupe_key);

But it is not working, what is wrong?

Submitted by support on Wed, 2017-06-14 07:37

Hi Toby,

With different product names adding the image_url value to the dupe_key will still result in a different hash - what you would need to do is put a UNIQUE key on the image_url field. The following dbmod.php script will do the trick - run once from the top level of your Price Tapestry installation and then DELETE the file since it is necessary to TRUNCATE (empty) the products table before adding a unique key as the process would fail if there are duplicate values in the database...

<?php
  set_time_limit
(0);
  require(
"includes/common.php");
  
$sql "TRUNCATE `".$config_databaseTablePrefix."products`";
  
database_queryModify($sql,$result);
  
$sql "CREATE UNIQUE INDEX image_url ON `".$config_databaseTablePrefix."products` (image_url)";
  
database_queryModify($sql,$result);
  print 
"Done.";
?>

...and then re-import all feeds to restore.

Cheers,
David.
--
PriceTapestry.com

Submitted by tobyhage on Wed, 2017-06-14 08:19

Thank you!