I AM GROOT
Or, languages are really hard.
So I was handing over some CSV export functionality to a client who loaded it into Excel as it is without using the import wizard. This resulted in misinterpreted UTF-8 as WIN-1252. I quickly wrote this little function (error handling omitted for brevity):
<span style="color: #000000"><span style="color: #0000BB"><?php<br> </span><span style="color: #007700">function </span><span style="color: #0000BB">uconv</span><span style="color: #007700">(</span><span style="color: #0000BB">$text</span><span style="color: #007700">) {<br> </span><span style="color: #0000BB">$descriptorspec </span><span style="color: #007700">= array(array(</span><span style="color: #DD0000">"pipe"</span><span style="color: #007700">, </span><span style="color: #DD0000">"r"</span><span style="color: #007700">), array(</span><span style="color: #DD0000">"pipe"</span><span style="color: #007700">, </span><span style="color: #DD0000">"w"</span><span style="color: #007700">));<br> </span><span style="color: #0000BB">$process </span><span style="color: #007700">= </span><span style="color: #0000BB">proc_open</span><span style="color: #007700">(</span><span style="color: #DD0000">"/usr/bin/uconv --add-signature"</span><span style="color: #007700">, </span><span style="color: #0000BB">$descriptorspec</span><span style="color: #007700">, </span><span style="color: #0000BB">$pipes</span><span style="color: #007700">);<br> </span><span style="color: #0000BB">fwrite</span><span style="color: #007700">(</span><span style="color: #0000BB">$pipes</span><span style="color: #007700">[</span><span style="color: #0000BB">0</span><span style="color: #007700">], </span><span style="color: #0000BB">$text</span><span style="color: #007700">);<br> </span><span style="color: #0000BB">fclose</span><span style="color: #007700">(</span><span style="color: #0000BB">$pipes</span><span style="color: #007700">[</span><span style="color: #0000BB">0</span><span style="color: #007700">]);<br> </span><span style="color: #0000BB">$text </span><span style="color: #007700">= </span><span style="color: #0000BB">stream_get_contents</span><span style="color: #007700">(</span><span style="color: #0000BB">$pipes</span><span style="color: #007700">[</span><span style="color: #0000BB">1</span><span style="color: #007700">]);<br> </span><span style="color: #0000BB">fclose</span><span style="color: #007700">(</span><span style="color: #0000BB">$pipes</span><span style="color: #007700">[</span><span style="color: #0000BB">1</span><span style="color: #007700">]);<br> </span><span style="color: #0000BB">proc_close</span><span style="color: #007700">(</span><span style="color: #0000BB">$process</span><span style="color: #007700">);<br> return </span><span style="color: #0000BB">$text</span><span style="color: #007700">;<br> }<br></span><span style="color: #0000BB">?></span></span>
A quick test of the function showed it working, so I patched the CSV export to call it, deployed it on the dev server and... it died on the first accented character. I have checked on the dev server from command line and it worked. W.T.F. I compared the mbstring ini values, all the same. W.T.F, no, really, this can't be.
Well, there must be something different, right? What could be? Locale? But what's locale? Environment variables. Hrm, proc_open
has environment variables too. Well then let's see whether my shell feeds something into this script that makes it work: env -i php x.php
. It breaks! Yay! It's always such relief when I can reproduce a bug that refuses to be reproduced. The solution is always easy after -- the LANG
environment variable is en_US.utf8
in the shell, and C
in Apache:
<span style="color: #000000"><span style="color: #0000BB"><?php<br>proc_open</span><span style="color: #007700">(</span><span style="color: #DD0000">"/usr/bin/uconv --add-signature"</span><span style="color: #007700">, </span><span style="color: #0000BB">$descriptorspec</span><span style="color: #007700">, </span><span style="color: #0000BB">$pipes</span><span style="color: #007700">, </span><span style="color: #0000BB">NULL</span><span style="color: #007700">, array(</span><span style="color: #DD0000">'LANG' </span><span style="color: #007700">=> </span><span style="color: #DD0000">'en_US.utf8'</span><span style="color: #007700">));<br></span><span style="color: #0000BB">?></span></span>
Ps. Curiously enough, -f utf-8
as an uconv
argument didn't help -- but -f utf-8 -t utf-8
did. Morale of the story: uconv
defaults to the value LANG
both to and from. This is not documented and it's very hard to discover.