You are here:  » How to handle accents in feed names


How to handle accents in feed names

Submitted by Convergence on Sat, 2013-07-20 22:02 in

Greetings,

We have a merchant who's name has an accent mark over the e, like so: Caché

When the network sends us the feed via FTP to our feed repository on our server, the server changes it to a question mark, like so: Cach?

We are then unable to use the automation tool to "fetch" the feed from our feed repository folder. We have tried the file name as "Cache", "Caché", and "Cache?" to no avail.

We would prefer to have something that's a "global" solution and not merchant specific to address the accent.

Any thoughts?

Thanks!

Submitted by support on Sun, 2013-07-21 09:36

Hi Convergence,

This sounds like a bit of a tricky one whereby (I think) the filename is utf-8 encoded, but the mechanism through which you are browsing the file system is using iso-8859-1 (or at least _not_ utf-8!) encoding making it tricky to obtain the exact filename to use in construction of your Automation Tool job URL.

A generic solution would I think involve running a script on your feed receiving server to rename all files stripping anything outside of the normal ASCII range (so in this case /somedir/Caché would be renamed to /somedir/Cach but I need to have a think about how to go about that, a bit of regexp magic in a bash script should do the trick.

In the meantime, the FileZilla FTP client might be able to work out the correct URL for you. Using FileZilla, if you make a connection to the FTP address of your feed receiving server, and then in the remote window browse to the folder containing the Caché file. and however it happens to be displayed in the remote window, Right-Click on the file and select Copy URL(s) to clipboard. This will generate the link which you will be familiar with entering into the Automation Tool as the URL of a new job, e.g.

ftp://username@ftp.example.com/path/to/Caché

Note that FileZilla generated links don't include the password, so immediately after using the Copy URL(s) to clipboard, paste the link into the URL field of a new Automation Tool job (or editing the one you are trying to create) and then insert the password e.g.

ftp://username:passsword@ftp.example.com/path/to/Caché

Hope this helps!

Cheers,
David.
--
PriceTapestry.com

Submitted by Convergence on Mon, 2013-07-22 18:11

Thanks, David!

Unfortunately, that did not work.

Any other suggestions?

Thanks!

Submitted by support on Tue, 2013-07-23 08:53

Hi Convergence,

Does the URL copied from FileZilla look OK? If you add your password and paste the link directly into the address bar of your Web Browser does that fetch the feed successfully?

Could you perhaps examplify the username and domain name and post the URL into a reply to this thread so I can see what is in place of the accented character?

Thanks,
David.
--
PriceTapestry.com

Submitted by Convergence on Tue, 2013-07-23 20:20

Hi David,

Timing is everything :)

As we have feeds from Commission Junction "sent" from the network to our feed repository THEN we use the automation tool to fetch from there in their .gz format to the proper install/s - THEN a separate CRON deletes the feeds from the repository every hour. This prevents us from having GIGS and GIGS of duplicate feeds as well as saving bandwidth by only fetching CURRENT feeds.

The problem is when the feed is received by the server it's changing the accented e to a ? - The "fetch" script doesn't recognize the ? - Definitely sounds like a server handling problem.

The URL looked fine, again couldn't put the URL in the browser because the feed doesn't exist any more.

This is an example of the URL:

ftp://USERNAME:PASSWORD@example.com/public_html/feeds/Cach?-Cache_Product_Catalog.txt.gz

As you can see, the accented e is replaced with a ? -

I will set this merchant up on a separate/dummy download schedule so can further test without waiting for the merchant to update their feed and send it to us again (which is usually around 2:00AM MST).

May also try a different delivery method for this merchant, whereas we fetch the feed directly from the network (like we do other networks) instead of waiting for them to update and send to us.

Stay tuned...

Submitted by support on Wed, 2013-07-24 08:30

Hi Convergence,

A quick search shows this to be a common FTP issue - it looks (as you say) server related (i.e. the FTP server running on your server that receives the feed) is not handling the accented characters correctly.

The only work-around I can think of is to run a script that strips any non-ASCII characters from the filenames of every file in a given folder - if you want to give this method a try let me know and I'll work out the code; it will be a .sh script just like you're currently using to unzip etc. and would need to be scheduled to run _after_ receiving the file from CJ but _before_ your other server fetches the feed. The job would of course have to be updated so in this case you would be looking to fetch

example.com/public_html/feeds/Cach-Cache_Product_Catalog.txt.gz

Cheers,
David.
--
PriceTapestry.com

Submitted by Convergence on Wed, 2013-07-24 22:13

Hi David,

The .sh script sounds like a winner.

We could just had the .sh script before our fetch CRON, like so:

/bin/bash /home/username/public_html/path/to/scripts/characterstrip.sh;/usr/bin/wget -O /dev/null "http://www.path/to/scripts/fetch.php?password=XXXXX$&filename=@ALL"

Am I on the right track?

Submitted by support on Thu, 2013-07-25 08:20

Hi Convergence,

characterstrip.sh would have to run on your feed receiving server (let's say server1) just before your fetch.php job runs on server2, e.g.

03:50 - server1 runs characterstrip.sh
04:00 - server2 runs fetch.php (to fetch you files from server1, with clean filenames)

Would that be viable to setup? The only problem being if the incoming FTP from CJ is not predictable and could happen at 03:55 for example.

Cheers,
David.
--
PriceTapestry.com

Submitted by Convergence on Thu, 2013-07-25 09:03

Hi David,

Think there may be a little confusion. Our central feed repository is on the same server as our installs. We just have a separate location for feeds from CJ because they send them when a merchant has an update while other networks do not, and there are several hundred CJ merchants we use.

Right now each install fetches CJ feeds once an hour, if no feed is there (based on minimum file size in the automation tool) life is good. If there is a file there to fetch, then it's fetched/renamed to the appropriate install. The different installs fetch no later than :03 after the hour. AT :05 after the hour we have another CRON that deletes all files in the repository - preventing us from having gigs and gigs of space being used by feeds that are now in their installs. If one feed is used in multiple installs then the install with the earliest import becomes the master and we fetch the unzipped feeds and rename it for that install.

Furthermore, we run separate CRONs throughout day to unzip/decompress any feeds in the installs BEFORE the imports begin. This decreases server loads at import. We do not unzip/decompress any feeds in an install while that installs import is running.

Our CRONs are like an orchestra and so far it's working. As of yet we haven't had any feeds slip by in that 2 minute window - and since the repository is on the same server and we recently went put in a Solid State Drive - it's really fast when copying from one part of the server to another. Again, since we do this 24 hours a day and not all merchants update daily there's never a real build-up of feeds to fetch. (CJ is only one of 6 networks we are currently using - you should see the automation tools/fetch CRON cycles for all those!). Then in a couple months we hope to launch the site on it's .co.uk counterpart.

Probably more info than what was needed - sorry.

Unless I'm mistaken how this could work, running the characterstrip.sh as the first part of the fetch CRON should be seamless.

Thanks!

Submitted by support on Thu, 2013-07-25 10:30

Hi Convergence,

Ah - no problem then!

I'm more comfortable suggesting this as a .php script rather than a bash script, so here's the code to strip anything EXCEPT a-Z, A-Z, 0-9, space, underscore, hyphen or period:

characterstrip.php

<?php
  chdir
("/home/username/public_html/path/to/feeds/")
    or die(
"Could not change directory!");
  if (
$dh opendir("."))
  {
    while ((
$f readdir($dh)) !== false)
    {
      if (!
preg_match("/^[a-zA-Z0-9\.\- ]{1,}$/",$f))
      {
        
$new preg_replace("/[^a-zA-Z0-9\.\- ]/","",$f);  
        
rename($f,$new);
      }
    }
    
closedir($dh);
  }
?>

Edit the directory name in line 2 to point to the folder containing the feeds where the corrupt filenames may exist, and save the file in your scripts folder e.g.

/home/username/public_html/path/to/scripts/characterstrip.php

With that in place, then your CRON job:

/usr/bin/php /home/username/public_html/path/to/scripts/characterstrip.php;/usr/bin/wget -O /dev/null "http://www.path/to/scripts/fetch.php?password=XXXXX$&filename=@ALL"

Don't forget that you will need to edit any jobs for the feeds containing the corrupt characters so that they match the renamed versions - you'll only have to do that once of course for any feeds that have this issue...

Cheers,
David.
--
PriceTapestry.com

Submitted by Convergence on Thu, 2013-07-25 11:02

Hi David,

Thanks!

Will slip this in after the current CRON is done (we run the install that this merchant would go into, twice a day).

Thanks again!