You are here:  » How to only fetch and process feeds that are updated.

Support Forum



How to only fetch and process feeds that are updated.

Submitted by Convergence on Fri, 2013-04-19 15:06 in

Greetings,

We currently "fetch" hundreds and hundreds of feeds, across multiple networks, 18 hours a day, from networks that do not "send" updated feeds. Only one network is left that sends you updated feeds - Commission Junction (GAN/Google Affiliate Network is shutting down).

We run separate Automation Tools on multiple installs. One processes feeds sent (by networks) to our central feed repository on the server and the others fetch feeds directly from a half-dozen networks. Problem is, this is a waste of time and resources to "fetch" feeds that have not been updated. There is no need to download an outdated feed and unzip it when it's still in the feeds directory.

Is there a way we can determine if the feed has been updated? Know this is a stretch because each network is different. We just want to eliminate unnecessary bandwidth and processing resources.

Thanks!

Submitted by support on Mon, 2013-04-22 08:43

Hi,

This would actually be relatively straight forward if a significant number of the networks that you are working with support the HTTP 304 (Not Modified) response! Before diving into the code changes, I've created a test script that fetches one feed per unique host with a job configured in the Automation Tool, and reports the HTTP response code returned when the request includes an If-Modified-Since header.

<?php
  
require("../includes/common.php");
  
header("Content-Type: text/plain");
  
$tmp $config_feedDirectory."304testtmp";
  
$sql "SELECT * FROM `".$config_databaseTablePrefix."jobs` WHERE status='OK' ORDER BY filename";
  if (
database_querySelect($sql,$jobs))
  {
    
$tested = array();
    foreach(
$jobs as $job)
    {
      
$parts parse_url($job["url"]);
      if (
in_array($parts["host"],$tested)) continue;
      
$tested[] = $parts["host"];
      print 
$job["url"];
      
$fp fopen($tmp,"w");
      
$ch curl_init($job["url"]);
      
curl_setopt($chCURLOPT_TIMEVALUEtime());
      
curl_setopt($ch,CURLOPT_HEADER,0);
      
curl_setopt($ch,CURLOPT_FILE,$fp);
      
$retval curl_exec($ch);
      
fclose($fp);
      print 
":".$code curl_getinfo($chCURLINFO_HTTP_CODE);
      
curl_close($ch);
      print 
"\n";
    }
  }
  
unlink($tmp);
?>

Create and upload the script as admin/304test.php to an installation with a number of jobs configured in the Automation Tool, and browse to admin/304test.php. The output will be a list <url>:<HTTP response code> e.g.

http://www.example.com/feed.asp?id=1234:200
http://www.example.net/getfeed.php:304

If you see anything return 304, we're in business!

Cheers,
David.
--
PriceTapestry.com

Submitted by Convergence on Mon, 2013-04-22 15:01

Hi David,

Returned one 304 from each network we have in the automation tool (out of several dozen feeds)!

Now what do we do?

Submitted by support on Mon, 2013-04-22 16:23

Hi,

That sounds promising, but before we get too excited, it would be worth performing the corollary test to ensure that when sending an historic date, that the feed is returned as expected. In your 304test.php script, look for this line:

      curl_setopt($ch, CURLOPT_TIMEVALUE, time());

...and REPLACE with;

      curl_setopt($ch, CURLOPT_TIMEVALUE, 0);

This will request each feed with an "If-Modified-Since" header of the UNIX epoch (1st Jan 1970), so every test case (it will test one feed from each network as before) should return the full feed and an HTTP 200 (OK) response...

Cheers,
David.
--
PriceTapestry.com

Submitted by Convergence on Mon, 2013-04-22 16:46

Hi David,

There is no difference in the results - identical to the first.

Submitted by support on Mon, 2013-04-22 22:52

Hi,

That's interesting, all my experiments so far are returned only 200s!

Would it be OK for you to email me a couple of example URLs so that I can check them out?

Thanks,
David.
--
PriceTapestry.com