You are here:  » Split extra large data feed files into smaller sections by Category?


Split extra large data feed files into smaller sections by Category?

Submitted by BobL on Sun, 2013-06-16 13:37 in

Hi David,

Was wondering if there is a way to use PT to split huge data feeds into smaller feeds per category.
What I've been trying to do here local is append the category on the front of the merchant feed as I read through it and write each record to the appropriate cat_merchant.txt file.
Examples would be:
Accessories_Merchant.txt
Appliances_Merchant.txt
Electronics_Merchant.txt
Home-Decor_Merchnt.txt
etc.

Small feeds my idea works fine, but huge files they never seem to finish.
I just don't have your Midas Touch David.

If this isn't possible just delete this?
But had to ask. One feed I tried to unzip was 18gigs -> gziped.
With 32gigs of ram in the LianLi wouldn't unzip it even...
Thanks in Advance
Bob L.

Submitted by support on Mon, 2013-06-17 08:26

Hello Bob,

That'a BIG feed! It would almost certainly be impractical to manage a feed of that size through the web interface, so ideally one would use the command line scripts to fetch / import. Before that however, Register and Trial Import the feed.

If you're not sure whether or not you have command line access, it's working giving it a go - in most cases it will be the same access details as you use for FTP. A popular client for Windows is PuTTY. Launch PuTTY and then enter your FTP host name in the Host or IP Address box, and click to connect. You will then be asked to accept the public key (first time only) and then receive a Login As: prompt. Enter your FTP username, and then your password when indicated.

If that all looks good, the next thing to do is change directory into the /scripts/ folder of your Price Tapestry installation, so if you normally change directory into a folder called public_html when using FTP, it is almost certainly the same e.g.

cd public_html/scripts/

And from here, you can run import.php to import Merchant.txt without any time / resource / proxy server timeout limitations using

php import.php Merchant.txt

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by BobL on Mon, 2013-06-17 10:41

Bob L.

I do have command line access, but I stay away from that.
My thoughts were to have feeds that are large broken down into the categories so that each niche didn't have to parse the same large file over and over.
I gave up on the 18 gig feed when I couldn't extract it. Took like 11 hours to just retrieve it from LinkShare. lol

Submitted by support on Mon, 2013-06-17 11:21

Hello Bob,

This is something I've helped a few users with, it's a simple script to parse the file and output each row to separate files names [category]_Merchant.txt. If you'd like to give it a go, if you could email me a link to the installation where you have so far saved the individual feed and i'll fetch just the first few MB (I won't fetch the whole thing to save all our bandwidth!) and from that I'll be able to modify my splitter script accordingly...

Cheers,
David.
--
PriceTapestry.com

Submitted by BobL on Wed, 2013-06-19 02:13

Bob L.

Great Support David.
That works really cool.
I knew you had that Midas Touch. :)

Thanks for all the great support you have given everyone here.

Submitted by BobL on Fri, 2013-06-28 11:35

Bob L.

Hi David hope all is well.
Ran into a merchant that has 4 and some time 5 feeds.
example merchant.txt merchant-canada.txt, merchant-top-10.txt, merchant-hot-products.txt.
Was trying to combine them into one feed as google doesn't seem to like merchants that only have a couple of items looking in WMT. Niche mod....Latest version of PT.
This is what I tried, but gave me errors.

<?php
  header("Content-Type: text/plain");
  require("../includes/common.php");
  require("../includes/MagicParser.php");
  $outputDirectory = (isset($config_feedDirectory)?$config_feedDirectory:"../feeds");
  if (!is_writable($outputDirectory)) die($outputDirectory." not writable!");
    global $fps;
    global $filename;
    global $outputDirectory;
$file1 = "";
$file2 = "";
$file3 = "";
$file = "";
$file = "Musicians-Friend_CJ_1496477.txt";
$fhandle = fopen($outputDirectory.$file, 'a');
$file1 = "Musicians-Friend_Canadian_CJ_1496477.txt";
$fhandle1 = file_get_contents($outputDirectory.$file1);
// if($file1){print $fhandle1; exit();}
fwrite($fhandle,$file1);
fclose($fhandle1);
$File2 = "Musicians-Friend_Hot_CJ_1496477.txt";
$fhandle2 = file_get_contents($outputDirectory.$file2);
fwrite($fhandle,$file2);
fclose($fhandle2);
$File3 = "Musicians-Friend_Top_CJ_1496477.txt";
$fhandle3 = file_get_contents($outputDirectory.$file3);
fwrite($fhandle,$file3);
fclose($fhandle3);
fclose($fhandle);
?>

I just don't have your Golden Touch....:)
Thanks In Advance...

Submitted by support on Fri, 2013-06-28 12:02

Hi Bob,

file_get_contents() returns a string rather than a file handle, so probably all you need to do is something like this:

<?php
  header
("Content-Type: text/plain");
  require(
"../includes/common.php");
  require(
"../includes/MagicParser.php");
  
$outputDirectory = (isset($config_feedDirectory)?$config_feedDirectory:"../feeds");
  if (!
is_writable($outputDirectory)) die($outputDirectory." not writable!");
  
$file "Musicians-Friend_CJ_1496477.txt";
  
$fhandle fopen($outputDirectory.$file'a');
  if (!
$fhandle) die("Could not open ".$outputDirectory.$file);
  
$file1 "Musicians-Friend_Canadian_CJ_1496477.txt";
  
fwrite($fhandle,file_get_contents($outputDirectory.$file1));
  
$file1 "Musicians-Friend_Hot_CJ_1496477.txt";
  
fwrite($fhandle,file_get_contents($outputDirectory.$file1));
  
$file1 "Musicians-Friend_Top_CJ_1496477.txt";
  
fwrite($fhandle,file_get_contents($outputDirectory.$file1));
  
fclose($fhandle);
?>

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by BobL on Fri, 2013-06-28 12:59

Bob L.

Worked like a charm David. As all ways.
Best support on the net.
Thank you very much. I worked several different versions of my code yesterday trying not to bother you.
And got nowhere.

Always nice to have someone you can count on.

Submitted by Rocket32 on Wed, 2013-07-10 00:10

Hello Dave. Do you have the code to the splitter script that we can use on large feeds? We may be able to separate by category, brand or similar?

Submitted by support on Wed, 2013-07-10 07:41

Hi Rocket32,

I'll look at creating a generic version for general download but in the mean time, sure - if you'd like to email me a link to your installation and the filename of the feed you would like to split I'll fetch it from your /feeds/ folder so that as above, I can modify the script to work with your feed and then forward the files to you.

Cheers,
David.
--
PriceTapestry.com

Submitted by BobL on Mon, 2015-03-16 10:15

Bob L.

Hi David,

Hope all is well.
Was wondering if there's a way to have the header row from the original feed in each output category?
As it stands I have to edit the feeds table and change the field names to field numbers and csv format..

Thanks in advance.
New version rocks...

Submitted by support on Mon, 2015-03-16 11:36

Hello Bob,

Thank you for your comments!

That should be straight forward to do, it's been a while since I sent you the script by email (the code sample above I believe refers to an opposite requirement), so if you could perhaps email me the latest version of the category extraction script that you're using I'll show you how to modify it to include the header row in each extract...

Cheers,
David.
--
PriceTapestry.com

Submitted by BobL on Wed, 2015-03-18 13:42

Bob L.

Just Awesome!

Mr. David.

Just Awesome!

You rock and your support is "Just Awesome!"

Submitted by philstone on Thu, 2015-12-17 00:03

Hi David

I have a feed that has around 800,000 products, I was interested to look at the script you use to split large feeds, this one is approx 1.6gb

{link saved} is where the feed can be found

For this feed I think either ATTRIBUTECLASS/AUTHOR or ATTRIBUTECLASS/PUBLISHER should be used to split it as [category]_Merchant.xml as the Category feed for all entries is 'Book'

do You think this would be possible?

Thanks
Phil

Submitted by support on Thu, 2015-12-17 09:00

Hello Phil,

To be reliably split XML files would need to be parsed and each record reconstructed (which isn't a problem) however in this case there are potentially thousands of distinct author / publisher values so if there is not actually any problem handling the feed I would suggest leaving "as-is" in this case. For registration, use only "Register and Trial Import" so that only 10 records are imported, and for a full import wait for the next scheduled CRON process, use the Slow Import Tool or import from the command line e.g.

cd public_html/scripts
php import.php Hive.xml

Cheers,
David.
--
PriceTapestry.com

Submitted by sirmanu on Sun, 2017-12-10 18:57

Hi David!
I have an enormous CSV about 14 GB!
It is almost unmanageable and too slow to process, I guess, because magicparser have to move forward in each iteration, so the end almost never arrives.
Is there any practical solution? Maybe split it into several .csv files?
Thank you very much!

Submitted by support on Mon, 2017-12-11 09:43

Hi,

CSV is much easier to split than XML - have a go with the following;

<?php
  $header 
TRUE;
  
$rows 50000;
  
$filename $argv[1];
  if (!
$fp = @fopen($filename,"r")) die("Could not open ".$filename."\n");
  if (
$header)
  {
    
$header trim(fgets($fp));
  }
  
$count 0;
  
$newcount 0;
  while(
$line trim(fgets($fp)))
  {
    if (!isset(
$newfp) || (!($count $rows)))
    {
      if (isset(
$newfp)) fclose($newfp);
      
$newcount++;
      
$newfilename str_replace(".","_".$newcount.".",$filename);
      
$newfp fopen($newfilename,"w");
      if (
$header)
      {
        
fwrite($newfp,$header."\n");
      }
    }
    
fwrite($newfp,$line."\n");
    
$count++;
  }
  
fclose($newfp);
?>

Change $header at line 2 to FALSE if your CSV does not have a header row, edit $rows at line 3 to the number of products you want per feed and save in the same folder as the feed (e.g. feeds/split.php) and then from the command line;

php split.php BigFeed.csv

...and the script will make

BigFeed_1.csv
BigFeed_2.csv
etc..

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by sirmanu on Mon, 2017-12-11 10:14

Excellent! An do you have any easy idea about how to integrate this script into the cron process?
For example, files that are bigger than 5 GB (or manually selected), automatically slitted and imported.

Submitted by support on Mon, 2017-12-11 14:59

It's something that would be more appropriately (I think) incorporated into the fetching process - for example as a local proxy script that fetches the feed URL and saves as the individual _1, _2, etc. versions however bear in mind then that each feed would need to be registered separately and any filters for this source would need to be applied to all instances...

So I'm not quite sure what the benefit would be - a huge CSV file can still be registered and trial imported almost immediately - auto-detection of the CSV file format only involves reading the first 2 lines. The full import can then be left to CRON where it shouldn't really make any difference whether processing 1 feed of 14GB or 14 feeds of 1GB...

If you're not sure or you had something else in mind let me know and I'll have a look at the options further with you of course...

Cheers,
David.
--
PriceTapestry.com

Submitted by Rocket32 on Mon, 2020-08-17 05:55

Hey David.

Is there a way to pull out certain products out of a data feed that is 2GB. I only want particular product. Lets say pull out all the shoe products, or category instead of the entire shopping mall. Can csv file be parsed on desktop first. Possibly pull out all the records with shoes into a separate csv file. Is there a desktop version of Magic Parser or it only performs on the web? I tried opening file with excel and it kept spinning (loop). It would be a Niche site just, but I'm trying to limit space on hosting instead of product mapping.

Submitted by support on Mon, 2020-08-17 08:20

Hi,

You can use Drop Record If Not / Drop Record If Not RegExp to selectively import products from a large feed.

Let's say you only wanted to import a single category, "Shoes", this could be done with a Drop Record If Not filter on the Category field and simply "Shoes" (without the quotes) in the text box on the configuration page.

To import only selected category, Drop Record If Not RegExp can be used, for example:

/(Shoes|Trainers|Boots)/

In this example, the "/" characters are the delimiters (flags can go after the closing delimiter e.g. "i" to make it case-insensitive) , and the expression itself is a pipe-separated list of words to match, all enclosed in brackets.

Hope this helps!

Cheers,
David.
--
PriceTapestry.com