|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
this is a bug in the encoding procedureHI,
I am a chinese and I am using log4cxx as a logging facility in my project(the locale in my linux server has been set to "zh_CN.GBK"). when I switch to the 0.10.0 release(I used version 0.97 beta before), I came cross a problem: all the chinese logging message produced by my program could not be displayed correctly. Therefore, I decided to examine the source, and i found something which I suspect was the cause of my problem, the suspected code is: std::string Transcoder::encodeCharsetName(const LogString& val) { char asciiTable[] = { ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6' , '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', ' ' }; std::string out; for(LogString::const_iterator iter = val.begin(); iter != val.end(); iter++) { if (*iter >= 0x30 && *iter < 0x7F) { out.append(1, asciiTable[*iter - 0x30]); // this is the problematic line of code for me. } else { out.append(1, LOSSCHAR); } } printf(out.c_str()); return out; } I replace the line "out.append(1, asciiTable[*iter - 0x30]);" to "out.append(1, *iter);", then my problem was solved.(The input arguement of this function is "GBK" in my system. Before I hacked the code, this function resturn "12;"; After the hacking, this function return "GBK" which is my desire result). I don't understand why we need to change the name of the charset name(for the fear of non-ascii charset names? even with that fear, I can't see the need of changing from "GBK" to "12;") |
|
|
Re: this is a bug in the encoding procedureI'm thinking the constant should be 0x20, not 0x30. The code was an
attempt to be able to handle non-ASCII platforms like EBCDIC but looks like it was mangled and was done without access to a non-ASCII platform. Was just trying to do enough decoding to get the encoding name to load a full charset. On Aug 20, 2009, at 9:06 PM, shadow king wrote: > HI, > > I am a chinese and I am using log4cxx as a logging facility in my > project(the locale in my linux server has been set to "zh_CN.GBK"). > > when I switch to the 0.10.0 release(I used version 0.97 beta > before), I came cross a problem: all the chinese logging message > produced by my program could not be displayed correctly. > > Therefore, I decided to examine the source, and i found something > which I suspect was the cause of my problem, the suspected code is: > > std::string Transcoder::encodeCharsetName(const LogString& val) { > char asciiTable[] = { ' ', '!', '"', '#', '$', '%', '&', '\'', > '(', ')', '*', '+', ',', '-', '.', '/', > '0', '1', '2', '3', '4', '5', '6' , '7', > '8', '9', ':', ';', '<', '=', '>', '?', > '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', > 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', > 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', > 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', > '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', > 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', > 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', > 'x', 'y', 'z', '{', '|', '}', '~', ' ' }; > std::string out; > for(LogString::const_iterator iter = val.begin(); > iter != val.end(); > iter++) { > if (*iter >= 0x30 && *iter < 0x7F) { > out.append(1, asciiTable[*iter - 0x30]); // this is the > problematic line of code for me. > } else { > out.append(1, LOSSCHAR); > } > } > printf(out.c_str()); > return out; > } > > I replace the line "out.append(1, asciiTable[*iter - 0x30]);" to > "out.append(1, *iter);", then my problem was solved.(The input > arguement of this function is "GBK" in my system. Before I hacked > the code, this function resturn "12;"; After the hacking, this > function return "GBK" which is my desire result). > > I don't understand why we need to change the name of the charset > name(for the fear of non-ascii charset names? even with that fear, > I can't see the need of changing from "GBK" to "12;") |
|
|
Re: this is a bug in the encoding procedurethanks for your reply.
BTW, Is there a convinent way to swtich off all the "decoding & encoding" thing completely? Because I don't want the performance penalty imposed by the related function. For me, I cound not see the benefit of using unicode as the internal charset in log4cxx and I just want the log4cxx to log the messages without any charset convertion. On Fri, Aug 21, 2009 at 12:00 PM, Curt Arnold <carnold@...> wrote: I'm thinking the constant should be 0x20, not 0x30. The code was an attempt to be able to handle non-ASCII platforms like EBCDIC but looks like it was mangled and was done without access to a non-ASCII platform. Was just trying to do enough decoding to get the encoding name to load a full charset. |
|
|
Re: this is a bug in the encoding procedureOn Aug 21, 2009, at 4:32 AM, shadow king wrote: > thanks for your reply. > > BTW, Is there a convinent way to swtich off all the "decoding & > encoding" thing completely? Because I don't want the performance > penalty imposed by the related function. > > For me, I cound not see the benefit of using unicode as the > internal charset in log4cxx and I just want the log4cxx to log the > messages without any charset convertion. > > On Fri, Aug 21, 2009 at 12:00 PM, Curt Arnold <carnold@...> > wrote: > I'm thinking the constant should be 0x20, not 0x30. The code was an > attempt to be able to handle non-ASCII platforms like EBCDIC but > looks like it was mangled and was done without access to a non-ASCII > platform. Was just trying to do enough decoding to get the encoding > name to load a full charset. > You can hardwire the assumed encoding with ./configure --with-charset=utf-8 ./configure --with-charset=usascii ./configure --with-charset=iso-8859-1 All three will replace conversion with glorified copy operations. Specifying usascii will replace all non ASCII characters with a loss character ('?') but if you specify an particular encoding for a file, the resulting file will be valid. utf-8 will blast characters directly into internal representation. If you do not specify an encoding on any file appenders, the output file will have the same charset as the platform. Filters, XML files, SocketAppenders, and any thing with a specified encoding may be invalid. However, if you just going straight through to a file, you won't end up with loss characters. iso-8859-1 will convert characters to utf-8. If you do not specify any encoding, the output file will have the same encoding as the platform. Won't result in illegal byte sequences like specifying UTF-8, but any explicit encoding may result in character substitution. |
| Free embeddable forum powered by Nabble | Forum Help |