Hello,
I'm trying to think of a way to create a filter to drop records that have duplicate fields, like if the description field is identical for multiple products, then only import one of those products. I feel like it's probably hurting more than helping in the search engines to have a whole group of products with the same description, as if often the case with CJ merchant's feeds.
I could do this within the scripts (after importing) by doing a SELECT DISTINCT(description) type of thing, but it seems inefficient to import so much data that will never be used. So I'm wondering if it sounds reasonable to do a "Drop if Duplicate" filter on a selected field. Any thoughts on this? I'm more of a code hacker than a code writer, but if you point me in the right direction I might be able to work something out.
Best Regards,
Jim
although this will only detect duplicates in conjuction with the product name
Just to clarify, are you saying it would only detect duplicate descriptions if it had already detected a duplicate product name? Both would have to be satisfied in order to drop the record?
If I comment out the product name field from the dupe filter, what happens in the event there is a duplicate product name?
Thanks,
Jim
> Just to clarify, are you saying it would only detect duplicate descriptions if it had already detected a duplicate product name? Both would have to be satisfied in order to drop the record?
Correct. If you have products of different names but with the same descriptions then BOTH would be imported and you would still be left with duplicate descriptions.
> If I comment out the product name field from the dupe filter, what happens in the event there is a duplicate product name?
The results would be somewhat unpredictable i'm afraid - so you could only do this if you were sure that you have high quality feeds that do not have duplicate product names. The effects would include things like showing the same merchant twice in the prices list for example.
I would not really recommend the first method, it's not a perfect solution to what you want to do. The database mods however will guarantee that there can be no duplicate descriptions in your database.
Cheers,
David.
I did the database mods and that seemed to go fine, but I'm not sure this is working right. With one of my larger feeds, the exact same number of records is being imported, as expected, because the description fields are all unique. So that's ok. But in another feed where there are numerous repeated description fields, about half the number of records are being imported. That would be ok, but as soon as I look at the resulting import I'm seeing duplicate description fields. So lots of stuff is getting taken out, but I don't know what. Any suggestions?
Regards,
Jim
Hi Jim,
That sounds strange - i'll have to run through the mods on my test server with your feed to figure this one out. Can you let me know the URL of the feed that is causing the duplicates (e.g. http://www.yoursite.com/feeds/merchant.xml) - if you reply to your registration code email or forum signup email it will get to me - and i'll then download the feed and take a look for you...
If you could also let me know how you've registered the feed (field mapping) - and an example of two products that are being imported with the same description that would help....
Cheers,
David.
Hi,
I have another question... how to disable this duplicate filtering in newest PT :) I want to import all items (also with the same name) from csv without this duplicate filter.
After this, if items will heve the same name, to url structure will be added some number?
Hi Milano,
In fact, the best way to permit duplicate results per (merchant + product) combination is to drop the dupe_filter index - the following dbmod.php script will do the trick:
<?php
set_time_limit(0);
require("includes/common.php");
$sql = "DROP INDEX dupe_filter ON `".$config_databaseTablePrefix."products`";
database_queryModify($sql,$result);
print "Done.";
?>
Restoring the index requires a TRUNCATE of the products table, since creating a unique index will fail if any of the values to be indexed are not unique. The following dbmod.php script will restore the dupe_filter index:
<?php
set_time_limit(0);
require("includes/common.php");
$sql = "TRUNCATE `".$config_databaseTablePrefix."products`";
database_queryModify($sql,$result);
$sql = "CREATE UNIQUE INDEX dupe_filter ON `".$config_databaseTablePrefix."products` (dupe_hash)";
database_queryModify($sql,$result);
print "Done.";
?>
Cheers,
David.
--
PriceTapestry.com
Hi David,
I use this first script and then re-register product feed but import only ~20% products from csv. And still only one product with duplicate name was showing. I don't know what I did but since then PT stop importing any products. In panic, I decided to delete database/files and install clean PT... but nothing change - PT register new feed but can't import any product :(
What a day. Do you know what I did wrong or how to fix it?
Hi Milano,
Did you add any custom fields that were still configured in config.advanced.php ($config_fieldSet) but not set-up on the database after re-installation?
The following dbmod.php script will scan all fields and apply the ALTER TABLE SQL required for all custom fields:
<?php
require("includes/common.php");
$ignore = array("name","description","image_url","buy_url","price","category","brand");
foreach($config_fieldSet as $field => $v)
{
if (in_array($field,$ignore)) continue;
$sql = "ALTER TABLE `".$config_databaseTablePrefix."feeds`
ADD `field_".$field."` VARCHAR(255) NOT NULL";
database_queryModify($sql,$result);
$sql = "ALTER TABLE `".$config_databaseTablePrefix."products`
ADD `".$field."` VARCHAR(255) NOT NULL";
database_queryModify($sql,$result);
}
print "Done.";
?>
Another thing to try if you're not sure what's gone wrong is database debug mode which you can enable by changing line 6 in config.advanced.php as follows;
$config_databaseDebugMode = FALSE;
...then attempt to register / import a feed. If an extended MySQL error message is displayed but you're not sure from the text where the problem lies let me know what is displayed and I'll check it out...
Hope this helps!
Cheers,
David.
--
PriceTapestry.com
Hi David,
my fault... all products didn't was importing correctly because some products had empty field 'buy_url'. After deleting require filds (except 'name') now all products are importing ok:) But still don't display all products with the same name - only last product is displaying.
Yes, I added few custom fields. I set up it on config.advanced.php and in database by script:
<?php
require("includes/common.php");
$sql = "ALTER TABLE `".$config_databaseTablePrefix."feeds`
ADD `field_keywords` VARCHAR(255) NOT NULL";
database_queryModify($sql,$result);
$sql = "ALTER TABLE `".$config_databaseTablePrefix."products`
ADD `keywords` VARCHAR(255) NOT NULL";
database_queryModify($sql,$result);
print "Done.";
?>
Any another way to completely remove this duplicate filter?
Thanks!
Hello Milano,
An alternative way to generate the dupe hash value that will permit all products would be to key it off the buy_url field instead of merchant. To try this, if you edit includes/admin.php and look for the following code at line 360:
$dupe_key = $importRecord["merchant"];
...and REPLACE with:
$dupe_key = $importRecord["buy_url"];
...and then re-import, and that should also yeild all products (with unique Buy URL of course) as an alternative to dropping the dupe_filter unique index...
Hope this helps!
Cheers,
David.
--
PriceTapestry.com
Well... the problem is that I don't have unique buy_url or any other fields. I use PT not as price comparison script but as directory script :)
One thing... in datebase I see column 'dupe_hash' with unique key.
Hi Milano,
Ah OK, could you perhaps try rolling back to the previous method of dropping the unique key on dupe_hash would then permit duplicate product names to be imported - it's this dbmod script from above...
<?php
require("includes/common.php");
$sql = "DROP INDEX dupe_filter ON `".$config_databaseTablePrefix."products`";
database_queryModify($sql,$result);
print "Done.";
?>
Don't forget to re-import all feeds afterwards...
Hope this helps!
Cheers,
David.
--
PriceTapestry.com
Well... all products (also with the same name) correctly imported to database. But in search results still display only one product - rest of duplicate name product just don't display it. I also try to change includes/admin.php in section /* create dupe_hash value */ but no effect :(
Hello Milano,
search.php uses a GROUP BY search_name in the SELECT SQL as part of the comparison functionality, so if you actually want multiple results of the same name to show up in search results you can just remove this. In that file, simply perform a Search and Replace using your text editor of
GROUP BY search_name
...and replace with nothing.
That should get you close - and if necessary, you'll be able to have unique product pages by combining merchant and product name in the URL, let me know if that was what you wanted to do...
Cheers,
David.
--
PriceTapestry.com
David, after removing GROUP BY search_name now in every categories/merchants/brands show me only one (last) product :(
Hello Milano,
An additional replacement of the summary clauses (before the WHERE in the SQL) will also be required - I've just tried this on my test server for you)... using your text editor's Search and Replace feature make the following replacements in search.php
1. Search:
COUNT(id) AS numMerchants
Replace:
1 AS numMerchants
2. Search:
MIN(price) as minPrice
Replace:
price as minPrice
That should give you all results de-grouped as single results per product. At this point, if you're not intending to have product pages and just want to link directly to the merchant, simply look for the following code around line 393:
$searchresults["products"][$k]["productHREF"] = tapestry_productHREF($searchresults["products"][$k]);
...and REPLACE with:
$searchresults["products"][$k]["productHREF"] = tapestry_buyURL($searchresults["products"][$k]);
Hope this gets you close to what you're looking for!
Cheers,
David.
--
PriceTapestry.com
Hi,
you'll be able to have unique product pages by combining merchant and product name in the URL
I have added product id to product url structure (product-id.html). Everything is ok but I've noticed that in sitemap.php don't display this unique id. How to add product id also in sitemap?
Hello Milano,
The `id` field isn't included in the field list used by the SELECT query in sitemap.php so won't be available for tapestry_productHREF() to use. To add the field, look for the following code at line 16:
$sql = "SELECT normalised_name FROM `".$config_databaseTablePrefix."products` WHERE filename='".database_safe($_GET["filename"])."' LIMIT ".$start.",".$limit;
...and REPLACE with:
$sql = "SELECT id,normalised_name FROM `".$config_databaseTablePrefix."products` WHERE filename='".database_safe($_GET["filename"])."' LIMIT ".$start.",".$limit;
Cheers,
David.
--
PriceTapestry.com
David,
I want to try to prevent duplicate products with the same image url. In includes/admin.php i have added / changed following code.
/* create dupe_hash value */
$dupe_key = $importRecord["merchant"];
$dupe_key .= tapestry_mb_strtolower($searchName);
$dupe_key .= $importRecord["image_url"];
$dupe_hash = md5($dupe_key);
But it is not working, what is wrong?
Hi Toby,
With different product names adding the image_url value to the dupe_key will still result in a different hash - what you would need to do is put a UNIQUE key on the image_url field. The following dbmod.php script will do the trick - run once from the top level of your Price Tapestry installation and then DELETE the file since it is necessary to TRUNCATE (empty) the products table before adding a unique key as the process would fail if there are duplicate values in the database...
<?php
set_time_limit(0);
require("includes/common.php");
$sql = "TRUNCATE `".$config_databaseTablePrefix."products`";
database_queryModify($sql,$result);
$sql = "CREATE UNIQUE INDEX image_url ON `".$config_databaseTablePrefix."products` (image_url)";
database_queryModify($sql,$result);
print "Done.";
?>
...and then re-import all feeds to restore.
Cheers,
David.
--
PriceTapestry.com
Hi Jim,
You can add the description field to the dupe filter; although this will only detect duplicates in conjuction with the product name. Look for the following code starting at line 240 of includes/admin.php:
/* create dupe_hash value */
$dupe_key = $admin_importFeed["merchant"];
// uncomment any additional fields that you wish to filter duplicates on (description not recommended)
$dupe_key .= $record[$admin_importFeed["field_name"]];
// $dupe_key .= $record[$admin_importFeed["field_description"]];
// $dupe_key .= $record[$admin_importFeed["field_image_url"]];
// $dupe_key .= $record[$admin_importFeed["field_buy_url"]];
// $dupe_key .= $record[$admin_importFeed["field_price"]];
$dupe_hash = md5($dupe_key);
To add the description field, uncomment the following line from that block:
// $dupe_key .= $record[$admin_importFeed["field_description"]];
If you are confident that your merchants don't have duplicate products , the easiest thing to do would be to make the changes described above but then to comment out the product name field from the dupe filter, leaving you with this:
/* create dupe_hash value */
$dupe_key = $admin_importFeed["merchant"];
// uncomment any additional fields that you wish to filter duplicates on (description not recommended)
// $dupe_key .= $record[$admin_importFeed["field_name"]];
$dupe_key .= $record[$admin_importFeed["field_description"]];
// $dupe_key .= $record[$admin_importFeed["field_image_url"]];
// $dupe_key .= $record[$admin_importFeed["field_buy_url"]];
// $dupe_key .= $record[$admin_importFeed["field_price"]];
$dupe_hash = md5($dupe_key);
If this is not sufficient, you need to establish a unique index against the description field - that way, MySQL will silently drop any duplicate records (the INSERT will fail). Unfortunately, you cannot create a UNIQUE index against a field of the TEXT data type; so you would have to establish a second dupe hash field and give it a value that is the MD5 hash of the description.
To do this, you will need to re-install Price Tapestry (or at least De-Register all your feeds) so that you have an empty database. The first SQL statement needed to add a the new field is:
ALTER TABLE `products` ADD `desc_hash` VARCHAR( 32 ) NOT NULL
Secondly, to make that field UNIQUE:
ALTER TABLE `products` ADD UNIQUE (`desc_hash`)
The easiest way to do this is to make the changes through phpMyAdmin or other MySQL admin script; however you can write a script to do this:
dbupdate.php:
<?php
set_time_limit(0);
ignore_user_abort();
require("includes/common.php");
require("includes/header.php");
$sql = "ALTER TABLE `products` ADD `desc_hash` VARCHAR( 32 ) NOT NULL";
database_queryModify($sql,$rows);
$sql = "ALTER TABLE `products` ADD UNIQUE (`desc_hash`)";
database_queryModify($sql,$rows);
print "<p>Done.</p>";
require("includes/footer.php");
?>
Finally, you need to modify includes/admin.php to setup and insert the value for the new desc_hash field.
Just below that block of code described above; add the following line (so it will be inserted into line 249 of the distribution):
$desc_hash = md5($record[$admin_importFeed["field_description"]]);
Then finally, modify the line that constructs the SQL as follows, adding the new desc_hash field:
$sql = sprintf("INSERT INTO `".$config_databaseTablePrefix."products` SET
merchant='%s',
name='%s',
description='%s',
image_url='%s',
buy_url='%s',
price='%s',
search_name='%s',
category='%s',
brand='%s',
dupe_hash='%s',
desc_hash='%s'
",
database_safe($admin_importFeed["merchant"]),
database_safe($record[$admin_importFeed["field_name"]]),
database_safe(isset($record[$admin_importFeed["field_description"]])?$record[$admin_importFeed["field_description"]]:""),
database_safe(isset($record[$admin_importFeed["field_image_url"]])?$record[$admin_importFeed["field_image_url"]]:""),
database_safe($record[$admin_importFeed["field_buy_url"]]),
database_safe($record[$admin_importFeed["field_price"]]),
database_safe($searchName),
database_safe($category),
database_safe($brand),
$dupe_hash,
$desc_hash
);
Hope this helps,
Cheers,
David.