|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
Character encoding for APT filesWhen Doxia generates HTML from APT, it appears to force the HTML file
to use ISO-8859-1, regardless of the original APT's encoding. I don't really understand why, since the Maven Doxia Converter supposedly generates all files in UTF-8: http://maven.apache.org/doxia/doxia-tools/doxia-converter/index.html I found another user who's having a similar problem: http://www.mailinglistarchive.com/users@.../ msg21983.html He demonstrated a technique that appears to tell Doxia which encoding to use: <plugin> <artifactId>maven-site-plugin</artifactId> <configuration> <outputEncoding>UTF-8</outputEncoding> </configuration> </plugin> But this has no effect for me. Is there any way to force Doxia to produce UTF-8 HTML for my APT files? Thanks, Trevor |
|
|
Re: Character encoding for APT filesThe doxia-converter is not used by the site plugin, it is supposed to be a stand-alone tool. There has been a lot of work regarding encoding issues since I last worked on Doxia and I'm not up-to-date with the exact status. Maybe Herve or Vincent can clarify? Cheers, -Lukas Trevor Harmon wrote: > When Doxia generates HTML from APT, it appears to force the HTML file > to use ISO-8859-1, regardless of the original APT's encoding. I don't > really understand why, since the Maven Doxia Converter supposedly > generates all files in UTF-8: > > http://maven.apache.org/doxia/doxia-tools/doxia-converter/index.html > > I found another user who's having a similar problem: > > http://www.mailinglistarchive.com/users@.../ msg21983.html > > He demonstrated a technique that appears to tell Doxia which encoding > to use: > > <plugin> > <artifactId>maven-site-plugin</artifactId> > <configuration> > <outputEncoding>UTF-8</outputEncoding> > </configuration> > </plugin> > > But this has no effect for me. Is there any way to force Doxia to > produce UTF-8 HTML for my APT files? Thanks, > > Trevor > > |
|
|
Re: Character encoding for APT filesOn Jan 13, 2009, at 5:27 AM, Lukas Theussl wrote:
> There has been a lot of work regarding encoding issues since I last > worked on Doxia and I'm not up-to-date with the exact status. Maybe > Herve or Vincent can clarify? I haven't heard from them. Should I file a bug on this issue? Trevor |
|
|
Re: Character encoding for APT filesyes please, a minimalistic test project that illustrates the problem will certainly help. thanks! -Lukas Trevor Harmon wrote: > On Jan 13, 2009, at 5:27 AM, Lukas Theussl wrote: > >> There has been a lot of work regarding encoding issues since I last >> worked on Doxia and I'm not up-to-date with the exact status. Maybe >> Herve or Vincent can clarify? > > > I haven't heard from them. Should I file a bug on this issue? > > Trevor > > |
|
|
Re: Character encoding for APT filesOn Jan 22, 2009, at 10:16 AM, Lukas Theussl wrote:
> yes please, a minimalistic test project that illustrates the problem > will certainly help. http://jira.codehaus.org/browse/DOXIA-278 Trevor |
|
|
Re: Character encoding for APT filesLe jeudi 22 janvier 2009, Trevor Harmon a écrit :
> On Jan 22, 2009, at 10:16 AM, Lukas Theussl wrote: > > yes please, a minimalistic test project that illustrates the problem > > will certainly help. > > http://jira.codehaus.org/browse/DOXIA-278 > > Trevor Sorry, I was working on other things and missed this discussion. I just commented (and closed as "Not A Bug" :) ) the issue. Regards, Hervé |
|
|
Re: Character encoding for APT filesOn Jan 22, 2009, at 4:50 PM, Hervé BOUTEMY wrote:
> Sorry, I was working on other things and missed this discussion. > I just commented (and closed as "Not A Bug" :) ) the issue. I agree that autodetecting is not a bullet-proof feature, but an absolute guarantee is not required in this case. I share Jason van Zyl's view: "If it's right most of the time, and it saves the user from having to know or worry about it then yes I would use it." [1] Another issue is that without autodetection, supporting more than one type of character encoding for the APT files in a Maven project is impossible. That said, if autodetection is simply out of the question, let me suggest a different tack. Doxia appears to require ISO-8859-1 for APT files by default. This is a Western-centric encoding that lacks support for Asian languages. It is also deprecated. According to Wikipedia: "The ISO/IEC working group responsible for maintaining eight-bit coded character sets disbanded and ceased all maintenance of ISO 8859, including ISO 8859-1, in order to concentrate on the Universal Character Set and Unicode." [2] I would also say that with the increasing popularity of UTF-8, the number of encoding problems encountered by users due to Doxia favoring ISO-8859-1 is already larger than any problems that might occur due to bad autodetection. In other words, autodetection might be wrong some of the time, but for many users, ISO-8859-1 is wrong all of the time. In light of this, I suggest changing Doxia's APT handling so that it defaults to UTF-8 rather than ISO-8859-1. Not only will this help UTF-8 users (who may be a majority), it will also help increase Maven's acceptance in the Asian world, a trend that is already happening [3]. I can work on a patch for this, if there's a chance it will be accepted. Trevor [1] http://www.nabble.com/Re%3A--VOTE--POM-Element-for-Source-File-Encoding-p16566779.html [2] http://en.wikipedia.org/wiki/ISO_8859-1 [3] http://blogs.sonatype.com/people/2008/07/apache-maven-the-definitive-chinese-guide/ |
|
|
Re: Character encoding for APT filesI knew this would cause another discussion: encoding choices are always like
this :) Le vendredi 23 janvier 2009, Trevor Harmon a écrit : > On Jan 22, 2009, at 4:50 PM, Hervé BOUTEMY wrote: > > Sorry, I was working on other things and missed this discussion. > > I just commented (and closed as "Not A Bug" :) ) the issue. > > I agree that autodetecting is not a bullet-proof feature, but an > absolute guarantee is not required in this case. I share Jason van > Zyl's view: "If it's right most of the time, and it saves the user > from having to know or worry about it then yes I would use it." [1] the problem with such an auto-dection in a tool like Doxia used by maven-site-plugin is that if the guessed encoding is not right, you can't do anything (or you have to configure it, which is what you wanted to avoid) It is not the case for example in a GUI, like a web browser, where a user can change the encoding in a couple of clicks if there is a problem > > Another issue is that without autodetection, supporting more than one > type of character encoding for the APT files in a Maven project is > impossible. same remarks than before: and what if guessed encoding from a file is wrong? > > That said, if autodetection is simply out of the question, let me > suggest a different tack. Doxia appears to require ISO-8859-1 for APT > files by default. This is a Western-centric encoding that lacks > support for Asian languages. It is also deprecated. According to > Wikipedia: > > "The ISO/IEC working group responsible for maintaining eight-bit coded > character sets disbanded and ceased all maintenance of ISO 8859, > including ISO 8859-1, in order to concentrate on the Universal > Character Set and Unicode." [2] > > I would also say that with the increasing popularity of UTF-8, the > number of encoding problems encountered by users due to Doxia favoring > ISO-8859-1 is already larger than any problems that might occur due to > bad autodetection. In other words, autodetection might be wrong some > of the time, but for many users, ISO-8859-1 is wrong all of the time. problematic for a lot of people. There was a proposal implemented in a lot of Maven plugin to make encoding easily configurable: see [4] When the question of default encoding came, there was a large poll (you'll find links in the proposal), which came to the conclusion that default source encoding should be platform encoding. The configuration part of the proposal was taken into account in maven-site-plugin 2.0-beta-7 on 03 Jul 2008 (see MSITE-314), but the default encoding wasn't changed: it is tracked MSITE-326 to let people vote if they want platform encoding (= the full proposal, which is platform dependant) instead of ISO-8859-1. There don't seem to be real traction... There are a lot of Maven plugins today that complain if you don't configure default encoding: it is a simple property to add in your POM. Doesn't it meet your needs? > > In light of this, I suggest changing Doxia's APT handling so that it > defaults to UTF-8 rather than ISO-8859-1. Not only will this help > UTF-8 users (who may be a majority), do you have figures, or is it a guess? AFAIK, Windows default encoding is still CP-1252 in west european languages. I don't know if this has changed with Vista. Then I doubt everybody switched to UTF-8. There is no really ideal default encoding: only configuration fixes the issue. > it will also help increase > Maven's acceptance in the Asian world, a trend that is already > happening [3]. > > I can work on a patch for this, if there's a chance it will be accepted. > > Trevor > > [1] > http://www.nabble.com/Re%3A--VOTE--POM-Element-for-Source-File-Encoding-p16 >566779.html [2] http://en.wikipedia.org/wiki/ISO_8859-1 > [3] > http://blogs.sonatype.com/people/2008/07/apache-maven-the-definitive-chines >e-guide/ http://docs.codehaus.org/display/MAVENUSER/POM+Element+for+Source+File+Encoding |
|
|
Re: Character encoding for APT filesOn Jan 23, 2009, at 3:24 PM, Hervé BOUTEMY wrote:
> the problem with such an auto-dection in a tool like Doxia used by > maven-site-plugin is that if the guessed encoding is not right, you > can't do > anything I was thinking that manually specifying a particular encoding would override the autodetection feature. > (or you have to configure it, which is what you wanted to avoid) If autodetection guesses wrong (and I maintain that it would seldom guess wrong), having to configure it those few times would be better than having to configure it all the time, which is what UTF-8 users have to do now. >> Another issue is that without autodetection, supporting more than one >> type of character encoding for the APT files in a Maven project is >> impossible. > same remarks than before: and what if guessed encoding from a file > is wrong? The error rate would go from all the time to some of the time, which is still a win. Again, I'm assuming that autodetection is optional and enabled by default; if it causes problems it could be disabled, reverting to the same behavior as before. > There are a lot of Maven plugins today that complain if you don't > configure > default encoding: it is a simple property to add in your POM. > Doesn't it meet > your needs? The problem is that I have many dozens of POMs, and I have to declare the encoding in all of them. Is there some way of configuring the encoding globally, perhaps in settings.xml? >> In light of this, I suggest changing Doxia's APT handling so that it >> defaults to UTF-8 rather than ISO-8859-1. Not only will this help >> UTF-8 users (who may be a majority), > do you have figures, or is it a guess? It's a guess, though there's circumstantial evidence pointing to the rise of UTF-8. It's definitely growing on the web [1], and text editors I've used, such as Eclipse on Linux and TextMate on Mac OS X, default to UTF-8. I'm actually surprised UTF-8 hasn't been adopted more quickly because it solves so many issues. But I worry that we're never we're never going to get there if modern applications continue to require native file encodings by default. Trevor [1] http://www.w3.org/QA/2008/05/utf8-web-growth.html |
| Free embeddable forum powered by Nabble | Forum Help |