Hi David,
Another day, another issue...
Is there any differences between importing via feeds_import.php and import.php?
Because they seems not to apply filters the same way, wich is weird because as I see they call the same function admin_import.
The filter that is causing problems is "UTF8 Decode", as I run a french site I have to deal with a lot of letters with accent in product name, description and category, so I use the "UTF8 Decode" with almost every feed.
Here's an example of the issue with one product wich has "UTF8 Decode" for the name and description:
- if I import it via the admin (feeds_import.php), the product name become: ******** 6ème 5ème ******* (description field: ******** 6ème 5ème *******)
- if I import it via SSH (import.php), the product name become: ******** 6eme 5eme ******* (description field: ******** 6?-5? *******)
So it's a kind of big mess when I'm doing product and category mapping as I do not have the same results with the 2 methods. The better results are with feeds_import.php but I've got a cron running every night wich mess up everything...
Could it be an issue with the tapestry_hyphenate and tapestry_normalise functions?
I've had to add some lines to them to deal with accent, but I'm not quite sure about the differences between those 2 functions.
Can you tell me what those two functions exactly do and where/when they are called?
Thanks a lot,
Hugo
Ok, so maybe the problem is my website was set up with php5 enabled and it seems the php binary uses php4 (see below)?
I've run your script, in both case it return "6?", but I modified it and put "6& #232;me" (without the space) instead of "6ème";
- from the web I get "6ème" >> perfect, but it must be the interpretation of my browser as in the source I see "6& #232;me"
- from SSH I get "6& #232;me"
I wonder if another problem I have with php in command line is related; when I run a php command I get this message below and then the script run:
PHP Warning: mime_magic: type regex BEGIN[[:space:]]*[{] application/x-awk invalid in Unknown on line 0
PHP Warning: mime_magic: type search/400 \\input text/x-tex invalid in Unknown on line 0
PHP Warning: mime_magic: type search/400 \\section text/x-tex invalid in Unknown on line 0
PHP Warning: mime_magic: type search/400 \\setlength text/x-tex invalid in Unknown on line 0
PHP Warning: mime_magic: type search/400 \\documentstyle text/x-tex invalid in Unknown on line 0
PHP Warning: mime_magic: type search/400 \\chapter text/x-tex invalid in Unknown on line 0
PHP Warning: mime_magic: type search/400 \\documentclass text/x-tex invalid in Unknown on line 0
X-Powered-By: PHP/4.4.8_pre20070816-pl1-gentoo
Content-type: text/html
I have edited my last post, the utf-8 code wasn't displayed even with code tag.
Anyway, I've found how to resolve my issue!
When calling php with command line instead of calling php with:
$php
or
$/usr/bin/php
or
$/usr/local/bin/php
(was using this one)
I directly call php5 with $/usr/local/php5/bin/php
Thanks to your explanation, I figured out a little more how this works and could have fix my problem by myself.
Maybe I will later have other questions on the normalise and hyphenate as I have to mess around a lot with them to deal with special characters and all the accents of the french language...
Thanks,
Hugo
Ooops, third post in a row...
In fact, it's not quite fixed, there's still a weird and huge issue of the same kind.
I will take the same example as in my first post:
- if I import it via the admin (feeds_import.php), the product name become: ******** 6ème 5ème ******* (description field: ******** 6ème 5ème *******)
- if I import it via SSH (import.php), the product name become: ******** 6?e 5?e ******* (description field: ******** 6ème-5ème *******)
Notice the UTF8 decode function is working well for the product description (and also the category I assume as they were all correctly mapped), but it's "acting weird" for the product name (and thus leading to a 404 when clicking on it because of the ? in the url).
Am out of ideas right now...
Hi David,
This morning, I had a surprise, my cron job for auto-import worked well and all the words in product and description were decoded well.
The only thing I've changed in my .sh script is the last one, to call php5:
/usr/local/php5/bin/php /home/*******/www/scripts/import.php @ALL
So I thought it was great, I just have to create a script with only the last line and call it from command line instead of inputing the command myself.
But when I tried to call directly my cron script with:
$sh ******.sh
It worked as if I've called import.php (like in my last post).
So the only way I have to make it work is to execute the .sh script through the "Cron task" module in webmin...
Any hint on how I could make import.php working well directly from command line?
Are the first line of the files in /scripts/ folder mandatory?
Because when I call those files I already have to put the path to php in the command line.
Anyway, I've tried to changed it to the path for php5 but it didn't make any differences.
Thanks,
Hugo
Hi Hugo,
The first line only applies if you mark the script as executable and then fire it off on its own rather than via another command (like php itself).
It sounds like there might be some kind of permission thing going on. If I understand you correctly, are you saying that exactly the same script works via CRON, but not if you execute it manually from the command line??
Cheers,
David.
It sounds like there might be some kind of permission thing going on. If I understand you correctly, are you saying that exactly the same script works via CRON, but not if you execute it manually from the command line??
Yep, it's exactly that.
And in both case it's executed as root...
Thanks,
Hugo
I don't know if it's related, but it may be a thing you need to know.
Even if marked as executable and with the first line being #!/bin/sh
, I cannot run my .sh script with this command $fetch.sh
.
I have to input:
$sh fetch.sh
or
$/bin/sh fetch.sh
to make it work.
But from CRON the command I call is /home/*******/fetch.sh
and it works well.
(Wanted to edit my first post, but I can't right now)
In fact I can run my .sh script from command line without inputing $sh ...
I have to imput the full path:
$/home/******/fetch.sh
But it still have the UTF8 Decode issue, of course...
Hi Hugo,
The "current directory" is not normally in the executable path on Linux; so you need to prefix the command with ./ to indicate the current directory, for example:
$./fetch.sh
Cheers,
David.
Hi David,
Ok, thank you for your explanation;
So:
1) Do you know how I could make import.php "fully" works in command line?
2) I just found another weird but little issue with UTF8 Decode: every time a word contains "ît" it get transformed to "ô", for example the word "boîtier" (wich means case) become "boôier" (means nothing).
Do you know why it happens and how to fix this?
(An easy way could be a filter wich replace "î" by "i" automaticaly during the import, I assume?)
Thanks,
Hugo
Hi Hugo,
Is that only in the case of the command line also - with cron, "it" does not get replaced?
Cheers,
David.
Hi David,
Is that only in the case of the command line also - with cron, "it" does not get replaced?
It happens in both case.
Please notice it only happens with "ît", with a circumflex accent.
For my "original" issue, I will run a few import test this afternoon, as it doesn't seems to happens with all feeds?!?
Thanks,
Hugo
Hi David,
So it's seems this happen only with 2 feeds;
One of them is stated as "utf-8" (I can't parse him directly on my PC as it's a +100MB file).
The second one is stated as "ISO-8859-15" with CDATA tags, but if I don't make an UTF8 Decode on his fields, I get UTF8 non decoded characters(i.e: "é" "Ú").
- When I make an import via admin panel (with UTF8 Decode filters), all the characters seems to be decoded but the character " ' " stay in product name and URL (so it lead to a 404) in spite of this line I put in tapestry_hyphenate and normalise function $text = str_replace("'","",$text);
.
- When i make an import via command line (with UTF8 Decode filter), it seems all characters are well decoded in product description, but in product name I get "D?r" instead of "D'or" (or "Dor"), "Coll?e" instead of "Collège", "El?entaire" instead of "Elémentaire", "Myst?e" instead of "Mystêre"...
[If needed I can send you this second feed, it's only 2MB uncompressed]
Hope this could help you find a way to help me...
Thanks,
Hugo
Hello Hugo,
By all means drop me an email with the feed that is causing these problems - I suspect it is something to do with a character encoding mis-match in the feed; although this normally stops the parser dead in its tracks as it is impossible to maintain XML state once the character encoding has become corrupted.
Perhaps the easiest thing to do is if you could send me a link to the compressed file (for example in your /feeds/ or a temp folder) and I'll download it directly to my test server in order to take a look...
Cheers,
David.
Hello Hugo,
tapestry_normalise() makes data safe for use throughout the site, primarily for use within URLs. tapestry_hyphenate() is basically just an str_replace with SPACE for a "-" in order to make search engine friendly URLs from words containing spaces.
Regarding the differing functionality between the web based import and the command line import - whilst the PHP code is of course identical, it is a completely different compilation of PHP that is used. For web based requests, the Apache module (almost always) is used, whereas for command line it is the php binary. Normally they are compiled together against the same libraries - but it sounds in this instance like they have been compiled against different libraries and this is affecting the way UTF-8 / iso-8859-1 characters are handled.
Could you perhaps try this script both from the web and then from the command line to see if utf8_decode function is behaving differently:
test.php
<?php
$text = "6ème";
print utf8_decode($text);
?>
Cheers,
David.