You are here:  » Importing large feeds


Importing large feeds

Submitted by AD_Mega on Tue, 2007-02-13 01:37 in

I get a an Internal server error when I attempt to import a large feed. I have 1and1 and have been trying for weeks to solve the problem. they sent me a phi.ini file to put in my root directory but it still doesn't work. I don not have access to ssh. It seems that it does import some od the data but it does not give me the date imported or file size. This is what I change the php.ini file to:

upload_max_filesize = 20M
max_execution_time = 300
memory_limit = 40M
post_max_size = 8M
SMTP = localhost
global_registers = Off

I have adjusted the max size but it still gives an internel server error.

Submitted by support on Tue, 2007-02-13 05:25

Hi,

I'm not sure about using the max_execution_time = 300 setting as this will limit the amount of time that PHP has to run, and for a large feed you really need this to be unlimited, which would be a value of zero. Can you try setting this to zero? Normally, Apache / PHP would need to be restarted in order for new php.ini settings to take effect - do you have a method to "restart" your server?

Please see the following page for more info on handling this...

http://www.pricetapestry.com/node/582

Hope this helps,
Cheers,
David.

Submitted by AD_Mega on Tue, 2007-02-13 05:55

I tried setting max_execution_time to 0 but still get the error. I do not have a way to restart the server. I don't have access to ssh so I can't upload it manually.

Submitted by support on Tue, 2007-02-13 05:58

Hi,

Is the error occurring after almost exactly the same time (30 seconds?) each time, or does it vary; and if so - by how much?

Cheers,
David.

Submitted by AD_Mega on Tue, 2007-02-13 06:23

Well its the same but if I change the value for post_max_size the time varies as well as how many product that gets in the database.

Submitted by support on Tue, 2007-02-13 06:38

Hi,

I'm not sure why post_max_size has a bearing. Have you tried setting that to zero (unlimited)?

Cheers,
David.

Submitted by AD_Mega on Tue, 2007-02-13 06:53

I just tried that I still get the Internal server error. Where should this file be? In the root directory or in the folder that the import.php file is?

Submitted by support on Tue, 2007-02-13 06:59

Hi,

I actually think these settings need to be in .htaccess rather than php.ini, preceeded by php_value.

In .htaccess in the top level of your site; try adding the following, based on what your host has asked you to put in php.ini:

php_value upload_max_filesize = 0
php_value max_execution_time = 0
php_value memory_limit = 0
php_value post_max_size = 0

Cheers,
David.

Submitted by AD_Mega on Tue, 2007-02-13 07:37

I'm still getting the internal server error.

Submitted by support on Tue, 2007-02-13 08:41

Hi,

It's just occured to me that internal server errors are normally recorded in the error_log. Do you have, or can you create an error log on your account and see if this provides any more information?

Cheers,
David.

Submitted by AD_Mega on Wed, 2007-02-14 08:14

I just got an answer from my hosting company 1 and 1 as to why I'm getting the error. Here is their reply:

"It seems your website is taking too much time to process the import of your xml file. With this account we only give you 6 seconds 100% CPU if you need addition time to process these database imports
you must upgrade your package to a developer package which gives you
additional CPU time. "

I have the Home package that gives me all the space I need and don't want to paymore money for space I don't need. Does any one know of a host that allow more than 6 second od upload time?

Submitted by support on Wed, 2007-02-14 08:18

Hi,

Have a look at the following thread where users have recommended hosts...

http://www.pricetapestry.com/node/173

Cheers,
David.

Submitted by bwhelan1 on Tue, 2007-03-06 03:42

David,

I am trying to import a 42M feed using a cron. It should have produced 47635 products but stops consistently at 14447. I get the email that the job completed and the admin panel shows the date and time that the feed uploaded. The only indication that something went wrong (besides the missing products) is the error log:

[Mon Mar 5 19:15:51 2007] [error] PHP Warning: mysql_num_rows(): supplied argument is not a valid MySQL result resource in /home/mysite/public_html/includes/database.php on line 26
[Mon Mar 5 19:15:51 2007] [error] PHP Warning: mysql_fetch_array(): supplied argument is not a valid MySQL result resource in /home/mysite/public_html/includes/database.php on line 21

The host is a Linux box with Apache version 1.3.37 (Unix), PHP version 4.4.3 and MySQL version 4.0.27-standard. Any ideas?

Bill Whelan

Computer Store

Submitted by support on Tue, 2007-03-06 09:10

Hi Bill,

If the product count is consistently 14447 and the date and time the feed was imported then the process will have finished correctly. Therefore, what this probably means is that the majority of records were either missing required fields (name, buy URL and price) or were duplicates. By default, duplicates are considered to be the same product name from the same merchant.

I'm not sure why you are seeing the error report. To understand this i'd need to know a little more about how you have set CRON up and what else you are calling in your automation script. Does it always happen? One thing to test would be to remove the feed that is only importing 14447 products and run the import again, although I don't think this will make any difference...

Cheers,
David.

Submitted by bwhelan1 on Tue, 2007-03-06 15:29

Here's the string that's set to run in cron:

php /home/ebooksb/public_html/scripts/import.php @ALL

Bill Whelan

Computer Store

Submitted by support on Tue, 2007-03-06 15:35

Thanks Bill.

Are you able to login to the command line in execute the feed that is imported the limited number of products manually, e.g:

$cd /home/ebooksb/public_html/scripts/
$php import.php merchant.xml

This should display a progress report, and will also generate the same error message if the problem is as a result of this feed. If it completes to the "Done." message, that almost certainly indicates corrupt, invalid or dupliate records in the feed...

Cheers,
David.

Submitted by bwhelan1 on Tue, 2007-03-06 17:31

David,

I don't have command line access currently. I did try importing the feed after varying the registered fields for "Product Name" and "Brand". With every variation, I still ended up with 14447 products. Is there anything that would cause a record to be discarded besides the duplicate records or a "drop record filter"?

I also imported the file into Excel then sorted by each column used for Name, Brand and Merchant and didn't spot many duplicates.

Bill Whelan

Computer Store

Submitted by support on Tue, 2007-03-06 17:39

Hi Bill,

The requirements are:

* Valid, Unique Product Name
* Valid Buy URL
* Valid Price

It could also be caused by corruption in the feed - this happens if the process of creating the feed by the merchant went wrong. Thank's for the URL - i've downloaded the feed from your website and taking a look now...

Cheers,
David.

Submitted by support on Tue, 2007-03-06 18:09

Hi again,

The e-books feed has imported 46495 products on my test server. Now, the imported time on the admin page is only updated once the import has completed. Can you confirm that this _is_ being updated on your server?

This implies that the script has not aborted, and processed the entire feed (altough it sounds like some of the INSERT queries are failing) before updating the record count.

One possibility is that you are exceeding the maximum queries allowed threshold on your MySQL server connection, although I would never expect this to result in exactly the same number of products being imported each time. You would need to find out from your host if a limit is set, but again I don't think this is the cause - otherwise the final query to set the imported time and product count would fail also (see previous question).

I registered the feed (based on the autodetected file type) selecting the fields as follows:

Product Name: NAME
Product Description: DESCRIPTION
Image URL: IMAGEURL
Buy URL: BUYURL
Price: PRICE

If this differs from how you have registered the feed let me know and I'll try again with your settings. Note that a zero price in the format as found in this feed in the SALEPRICE and RETAILPRICE columns would cause a record not to be imported...

Cheers,
David.

Submitted by bwhelan1 on Tue, 2007-03-06 18:35

Thanks for all your help David.

Submitted by bwhelan1 on Tue, 2007-03-06 21:16

David,

The hosting company was kind enough to run the command for me. Their response is below:

Hi Bill,
The task has been completed and the results were:
root@csx002 [/home/ebooksb/public_html/scripts]# php import.php 10281551-eBooks_com.txt
importing 10281551-eBooks_com.txt...[done]
backfilling reviews...[done]
Regards,
Mark,
WebHSP Support

Any ideas?

Bill Whelan

Computer Store

Submitted by support on Tue, 2007-03-06 21:24

Hi Bill,

No errors when they ran the command manually; and again it's only imported 14440 products. This is somewhat baffling!

Is this a brand new un-modified installation of Price Tapestry?

Cheers,
David.

Submitted by support on Tue, 2007-03-06 21:51

Hi Bill,

I've got the details from your last post and i'll be able to login and investigate first thing tomorrow...

Cheers,
David.

Submitted by bwhelan1 on Wed, 2007-03-07 00:01

I really appreciate it David.

Submitted by support on Wed, 2007-03-07 10:06

Hi Bill,

I've investigated this through the following process;

- Downloaded your modified files to my test server - imported all products
- Uploaded a fresh control copy of Price Tapestry to your server in /test/ using table prefix "test" - imported all products
- Copied your modified files into the control copy - imported all products
- Modified your config.php to use a new table prefix (ebooksb) and rerun setup.php. I then re-registered the feed, and setup the same filters - imported all products!

Conclusion:

Your original products table had become corrupted. This will occasionally happen with any database server, and could be down to a disk fault or something like that.

The situation now is that your site is working with the eBooks feed having imported all products (46,495 products). Your cron job should work in the same way. For now, I would leave your site using the ebooksb_ prefix as it is currently configured...

Cheers,
David.

Submitted by bwhelan1 on Wed, 2007-03-07 14:00

Thanks David

Bill Whelan

Computer Store

Submitted by bwhelan1 on Mon, 2007-03-26 15:11

David,

When importing a 500MB feed from the command line I get the following error:

# /usr/bin/php /home/mysite/public_html/scripts/import.php barnesandnoble.txt
X-Powered-By: PHP/4.4.6
Content-type: text/html
importing barnesandnoble.txt...[2800/0]Segmentation Fault in 17170, waiting for debugger

Do you think it's a problem with the feed or something else? I tried it multiple times and it stops at the same place.

Bill Whelan

Computer Store

Submitted by support on Mon, 2007-03-26 15:25

Hi Bill,

I've not come across that error before, but then again very few users try and work with 500MB feeds! How long does it take to reach this point, and is that the same every time? Have you considered the other aspects above regarding script execution time etc.?

Cheers,
David.

Submitted by bwhelan1 on Mon, 2007-03-26 20:19

David,

I got the following response from WebHSP support:

Hi Bill,
Thank you for your patience. After attaching a debugger to the script and analyzing the results we traced the problem back to the script.
Due to the sheer number of records it has to import (430MB file of txt), your script steps up and tries to utilize as much memory resources as possible. As a result of this tactic the server starts to run dangerously low on available memory and the kernel then kills whatever program is eating the excess usage in order to prevent the whole system from going down.
You can contact the developers of the script to see if they have a way around their script from doing trying to use so much memory.
Regards,
Mark,
WebHSP Support

Bill Whelan

Computer Store

Submitted by bwhelan1 on Sat, 2007-03-31 01:27

Any ideas how to deal with the above mentioned problem?

Bill Whelan

Computer Store

Submitted by support on Sat, 2007-03-31 07:59

Hi Bill,

Whilst Price Tapestry does not have any preset limit as to the size of feed that can be imported; ultimately, there has to be one. 430MB is a huge feed and very few users will ever encounter or try to work with feeds of that size.

With the amount of processing involved; the memory leak is most likely nothing to do with the Price Tapestry coding; but actually elsewhere in the chain such as the MySQL library, so at this time i'm afraid it is impossible to provide a solution.

I understand that this is a book feed that you are working with - and as they often contain several hundred thousand products it gets to the stage where i'm afraid I have to conceed that it is beyond the capabilities of Price Tapestry. Even if you were able to import that many products, the limitations of PHP and MySQL as a platform would cause further problems, with search queries either failing or taking an unacceptable time.

Apologies for the inconvenience. I would of course offer to refund your money in this situation; but I know that you have other sites successfully running Price Tapestry in a less complex environment. If you're happy not to run those sites then I will of course provide a refund, but i'm afraid I can't provide a solution for this size of feed.

Cheers,
David.

Submitted by bwhelan1 on Sat, 2007-03-31 20:36

David,

I understand that the B&N feed is quite large but I am also having the same issue with one that is a little over 100M. I am not asking for a refund because, as you mention, your script is serving me very well on other sites.

I have not found anything else that comes close to your script and the time and effort you pour into this support forum exceeds what I could expect for the price.

However, if you are a bit curious about the 100M feed not working, it is the HomeClick feed available on SAS. I have it up here if you have the time to test if on your server: (link saved)

Bill Whelan

Computer Store

Submitted by support on Mon, 2007-04-02 08:10

Hello Bill,

Thanks for the large feed URL. I will download it to my test server and give it a go...

Cheers,
David.

Submitted by madison on Tue, 2007-04-03 10:29

just for info , i have 1 feed that is over 500 meg , all works fine

its on a dedicated server though

Submitted by bwhelan1 on Tue, 2007-04-03 14:10

Thanks for the info. I suspect the webhost is implementing the Linux version of application pooling on the box and your post pretty much reassured me of that fact.

Bill Whelan

Computer Store

Submitted by bageshsingh on Sat, 2007-05-12 06:09

hello devid i got an error no 500

what is the solution

i can't filter it because it dosn't import fully so i can't make changes

Give me reply

Thanx

Submitted by support on Sat, 2007-05-12 07:30

Hello Bagesh,

500 is an internal server error, which sounds like a timeout issue. Check the following thread for discussion of this:

http://www.pricetapestry.com/node/582

Cheers,
David.

Submitted by bwhelan1 on Sun, 2007-05-13 19:19

David,

Moving to a dedicated server fixed alot of problems however, I am still experiencing issues with those big feeds. They import fine but whenever you click on that merchant from the "Merchants" link or a category where I have placed all the products from that merchant into, CPU usage goes up and "/tmp" fills up. Any ideas or best practices when dealing with this type of thing?

Bill Whelan

Computer Store

Submitted by support on Sun, 2007-05-13 19:22

Hi Bill,

It sounds like you want to allocate more memory to mysqld. Have a look at the following thread which discusses MySQL configuration for millions of products...

http://www.pricetapestry.com/node/732

Cheers,
David.

Submitted by Deanh01 on Wed, 2008-04-09 00:17

Hi David
I was also getting the 500 error I tried the code given for .htaccess but then just error all the time. I called the hosting company and was told to add a php.ini file and use a ext_time script but they could not tell me the script. Have you heard of this?
Thanks Dean

Submitted by support on Wed, 2008-04-09 08:21

Hi Dean,

Some hosts allow you to specify your own PHP ini settings by putting a php.ini file in the top level directory of your site. I think they mean to use the max_execution_time setting; so if you just create a file called php.ini in your main directory, containing simply:

max_execution_time = 0

(a value of 0 means no limit)

...that is I think what they are asking you to do...

Cheers,
David.

Submitted by Deanh01 on Wed, 2008-04-09 10:41

David
I tried that but got the same results

Thanks
Dean

Submitted by support on Wed, 2008-04-09 11:12

Hello Dean,

It sounds like a web server timeout (500 Internal Server Error) rather than PHP, but it was worth trying as your host is the one most familiar with your installation.

A couple of things; do you have access to an error log for your server, as this might give more information? Sometimes when you FTP into your site; you in a directory one level above the web root; from where you may also see a folder called "logs" - this may contain an error log. If you are able to download it, you should then be able to open it in notepad and scroll to the end to see if there is any additional information.

Secondly; further to the main thread with info on import timeouts; is logging in via Telnet or SSH an option for you?

Another option that can help in some cases is an alternative version of admin/feeds_import.php which you can download from here (unzip, upload to your /admin/ folder). This version of the file generates a continuous output during the import; and if the error is occuring because of lack of output from the script (some Apache web servers are configured in this way) then this should help...

Cheers,
David.

Submitted by jonny5 on Wed, 2008-04-09 11:18

This happened all the time when i set my large feeds up , tried all the things posted and it kept happening.

i get round this very easily (think David may have told me as i wouldnt have known myself)

what i do is upload the feed to the server , register it and trial import so 10 records import , you should be able to do this with any big feed, now the feed is registered but not imported , now if you then click import you will get the timeout error.

just leave the feed registered and then when your import script runs during the night all the records should then be imported with no problem.

Submitted by richard on Fri, 2008-04-25 14:33

Hi David

I am running http://currentoffers.co.uk on a dedicated business server at 1and1. I have set up a cron job to fetch zipped feeds from the various networks, unzip and then import any modified records - used notes from http://www.pricetapestry.com/node/198 (these worked first time for a novice!)

Everything appears to work ok except the import function. Whenever the file exceeds 8,000 products the feed is not imported with a cron job. However, if I run the file using putty I have no problems what so ever.

I have read through these feeds and tried all options:
http://www.pricetapestry.com/node/1556
http://www.pricetapestry.com/node/1230
http://www.pricetapestry.com/node/874

but none resolve my issue.

I have spoken to 1and1 technical help and they say they have no restrictions on processing times.

Any ideas?

Many thanks

Richard

Submitted by support on Fri, 2008-04-25 16:30

Hi Richard,

Is your cron process setup to send you an email with the output of the various commands? This might indicate what is happening.

If a feed process aborts mid-way through; the effect should be results to the number imported when doing a merchant search. Does this reveal approx. 8000? The number of products count on the admin page will remain as the original value as this is not updated until the import is completed. Is this the case on your server?

I'm assuming that it starts to import but does not complete....

Cheers,
David.

Submitted by richard on Fri, 2008-04-25 18:18

Hi David,

I got to the 8,000 through trial and error today, but now using the extra bit of knowledge through your post the import stopped at 8,858 for John Lewis

So yes you are right it starts but does not complete

Regards

Richard

Ps I have just added path/to/logfile.txt and changed to @ALL - so far it has not stopped despite importing feed from BT with 18,000 records. I will let you know if I encounter a problem but I spent along time today keep testing this problem. Oh well looks like it is working now

Submitted by support on Fri, 2008-04-25 18:21

Hi Richard,

If you are not able to see the output of a cron process via email; another option is to pipe the output of the command that appears to be timing out to a file which can then be studied to see if it contains any clues as to why the script aborted.

As you followed the instructions in the guide; you will presumably have a line similar to the following in your fetch.sh script (or whatever you have called it)...

/path/to/php /path/to/pt/scripts/import.php @MODIFIED

To pipe the results to a temporary file; try changing this as follows:

/path/to/php /path/to/pt/scripts/import.php @MODIFIED > /tmp/import.txt

After the cron process has completed; login to your server (via Putty), and then view the contents of the temp file using:

cat /tmp/import.txt

If that has worked, if you could post the last few lines of the output that should help identify the cause of the timeout...

Cheers,
David.

Submitted by richard on Fri, 2008-04-25 18:57

Hi David

The only thing I did was to empty the database, use @ALL and appended the path/to/logfile.txt to import.php

It worked error free! I do not understand why but it works now.

Thanks for your time looking into the problem

Regards

Richard