|
View:
New views
14 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
Re: Non-Roman characters in TLDs and domain namesOn 11/03/2009 08:42 PM, Sidney Markowitz wrote:
> > However, what does this mean for detecting URLs in plain text messages > in which a URL string can be in a non-ASCII charset and MUAs might > (eventually) parse them as URLs? > It seems clear that we will need to flatten/encode any URI domain to punycode for URIBL lookups. http://search.cpan.org/search?query=punycode&mode=all There exist some Punycode handling libs in CPAN. We might be best off standardizing on a particular library so URIBL's can use the same methodology for encoding their punycode listings. The unclear part is if we will need to decode URI's prior to punycode encoding. I suspect we will be forced to decode. Why? * Encoding punycode with binary garbage input might be poorly defined and unstandardized? * Some spamassassin using sites decode everything by preference while most others do not decode. This means you could be querying URIBL's with two different flattened punycode strings? Please correct me if my understanding is incorrect. Warren Togami wtogami@... |
|
|
Re: Non-Roman characters in TLDs and domain namesSidney,
> News of an ICANN decision to allow international character > sets in domain names was reported last week, for example IDN and punycode has been around for a while below TLD, but so far the few TLDs were only for testing. We came across it in: http://marc.info/?t=123928717600002 > I'm concerned that it might have a big impact on SpamAssassin's parsing > of headers and URLs. It is quite possible there is still some too-strict regexp lying around. I know I fixed some in a dkim plugin. > However, what does this mean for detecting URLs in plain text messages > in which a URL string can be in a non-ASCII charset and MUAs might > (eventually) parse them as URLs? Slippery road ahead... Can't hurt to open a PR as a placeholder for concerns and ideas. Mark |
|
|
Re: Non-Roman characters in TLDs and domain namesWarren Togami wrote, On 4/11/09 3:27 PM:
> It seems clear that we will need to flatten/encode any URI domain to > punycode for URIBL lookups. I agree with that -- if something has non-ASCII characters then punycode is the canonical form to use to look it up. > The unclear part is if we will need to decode URI's prior to punycode > encoding. I suspect we will be forced to decode. I'm not sure exactly what you mean, but the big issue that I see is how to determine that a string is a URL (where it starts and where it stops) that needs to be encoded to punycode. Is that what you are talking about? The rule of thumb that I used when working on code to extract URLs from plain text is that is some common MUA hot links it, then we want to treat it as a URL. Perhaps the answer is to wait until MUAs support these URLs and then follow that rule of thumb. -- sidney |
|
|
Re: Non-Roman characters in TLDs and domain namesOn 11/03/2009 09:50 PM, Sidney Markowitz wrote:
> Warren Togami wrote, On 4/11/09 3:27 PM: >> It seems clear that we will need to flatten/encode any URI domain to >> punycode for URIBL lookups. > > I agree with that -- if something has non-ASCII characters then punycode > is the canonical form to use to look it up. > >> The unclear part is if we will need to decode URI's prior to punycode >> encoding. I suspect we will be forced to decode. > > I'm not sure exactly what you mean, but the big issue that I see is how > to determine that a string is a URL (where it starts and where it stops) > that needs to be encoded to punycode. Is that what you are talking > about? The rule of thumb that I used when working on code to extract > URLs from plain text is that is some common MUA hot links it, then we > want to treat it as a URL. Perhaps the answer is to wait until MUAs > support these URLs and then follow that rule of thumb. > > -- sidney http://日本語.テスト/ Did your MUA turn that into a clickable link? Thunderbird yes Evolution no GMail yes Squirrelmail no Roundcubemail no Yes, it might be hard to figure out the beginning and end of a URL without decoding the entire message. Determining if we can do it without full body decoding will be an important first step before decoding what else we will do. I suspect we will be forced to decode arbitrary encodings before punycode flattening because URI domains can be encoded in different ways. The following examples are not correct, but it demonstrates the problem: ASCII without decoding the domain sent as UTF-8 http://日本語.テスト/ ASCII without decoding the domain sent as ISO-2022-JP http://$BF|K\8l(B.$B%F%9%H(B/ Both of these strings are the same domain name. But if not decoded before punycode flattening they will query as different strings in the URIBL lookup. Warren Togami wtogami@... |
|
|
Re: Non-Roman characters in TLDs and domain namesWarren Togami wrote, On 4/11/09 4:20 PM:
> http://日本語.テスト/ > > Did your MUA turn that into a clickable link? > > Thunderbird yes > Evolution no > GMail yes > Squirrelmail no > Roundcubemail no To me this says that SpamAssassin should see it as a URL and check the domain when doing URBL testing. Especially if Outlook and/or Outlook Express parse it as a URL. But only if clicking on the hot link in the MUA results in a browser actually going to the site. In other words, if it is useful to spammers to get people to go the URL then we should recognize it s a URL. > The following examples are not correct, but it demonstrates the problem: > > ASCII without decoding the domain sent as UTF-8 > http://日本語.テスト/ > > ASCII without decoding the domain sent as ISO-2022-JP > http://$BF|K\8l(B.$B%F%9%H(B/ My Thunderbird only interprets the first one completely as a URL string, the second one it ends at the pipe character, making it useless for a spammer. The first one is clickable, but I don't see that Firefox, at least, is interpreting it as the proper domain. Except that I don't understand the encodings enough to say what it is doing. Can you show me the equivalent for the following URL, which is a real site? That way we can easily answer the question "If the MUA makes it a hot link, is it a link that works?" http://例え.テスト/ And what about this? http://例え.テスト/メインページ -- sidney |
|
|
Re: Non-Roman characters in TLDs and domain namesSidney Markowitz wrote, On 4/11/09 6:21 PM:
> http://例え.テスト/ > > And what about this? > > http://例え.テスト/メインページ This is interesting... With Thunderbird and Firefox, if I click on those links Firefox ends up sticking in a "www." prefix and saying that it can't find the server. But if I right-click, copy the link, and paste it into Firefox then it works fine. -- sidney |
|
|
Re: Non-Roman characters in TLDs and domain namesOn 11/04/2009 12:21 AM, Sidney Markowitz wrote:
> >> The following examples are not correct, but it demonstrates the problem: >> >> ASCII without decoding the domain sent as UTF-8 >> http://日本語.テスト/ >> >> ASCII without decoding the domain sent as ISO-2022-JP >> http://$BF|K\8l(B.$B%F%9%H(B/ > > My Thunderbird only interprets the first one completely as a URL string, > the second one it ends at the pipe character, making it useless for a > spammer. The first one is clickable, but I don't see that Firefox, at My point was lost here. I pasted these URL's as an example of what the spamassassin URI parser might see without decoding. The above two examples are http://日本語.テスト/ in two common encodings of Japanese e-mail. Since they are not decoded by spamassassin, they might become two different punycode strings and two different URIBL lookups. This is why we may need to always decode before punycode encoding. > > Can you show me the equivalent for the following URL, which is a real > site? That way we can easily answer the question "If the MUA makes it a > hot link, is it a link that works?" Clickable link today is not relevant. MUA and browsers in the future will adapt to support these international TLD's. Prominent clients like Thunderbird and gmail today already make them clickable. I suspect the other clients don't make them clickable today only because they are unknown TLD's or they don't recognize non-ascii domains as valid URI's. Yet. Warren Togami wtogami@... |
|
|
Re: Non-Roman characters in TLDs and domain namesOn ons 04 nov 2009 04:20:25 CET, Warren Togami wrote
> http://日本語.テスト/ > Did your MUA turn that into a clickable link? > > Thunderbird yes > Evolution no > GMail yes > Squirrelmail no > Roundcubemail no horde imp yes -- xpoint |
|
|
Re: Non-Roman characters in TLDs and domain namesOn ons 04 nov 2009 06:21:17 CET, Sidney Markowitz wrote
> http://例え.テスト/ works > http://例え.テスト/メインページ works partly subdir is not a link -- xpoint |
|
|
Re: Non-Roman characters in TLDs and domain namesWarren Togami <wtogami@...> writes: > http://日本語.テスト/ > > Did your MUA turn that into a clickable link? > > Thunderbird yes > Evolution no > GMail yes > Squirrelmail no > Roundcubemail no gnus (trunk, emacs22): yes (and it looked ok), but when handed to firefox did not work (url is not valid and cannot be loaded). |
|
|
Re: Non-Roman characters in TLDs and domain namesWarren Togami wrote, On 4/11/09 7:17 PM:
> My point was lost here. I pasted these URL's as an example of what the > spamassassin URI parser might see without decoding Let's see if I understand this correctly: The message consists of a sequence of bytes which encode characters in a certain charset. When host names and domain names were restricted to 7-bit ASCII then they could be parsed out by SpamAssassin by looking at the raw bytes without regard to the charset. Now we would have to convert the entire byte stream from the raw bytes to wide characters according to the charset of the message before we could be sure to parse and handle text URLs correctly. Does that sum it up? I haven't paid attention to the issue of charset encoding and wide characters. How much are we getting away with assuming that most emails are in one-byte character codes or at least in codes that represent the ASCII set as one byte and so we can just apply rules to the raw byte strings and it works most of the time? How badly does SPamAsassin fall down if mail is encoded in a charset that violates that assumption? > Clickable link today is not relevant. MUA and browsers in the future > will adapt to support these international TLD's. It is relevant to what we should handle right now versus what we plan to handle in the future when MUAs are changed. -- sidney |
|
|
Re: Non-Roman characters in TLDs and domain namesWarren Togami wrote:
> On 11/03/2009 09:50 PM, Sidney Markowitz wrote: >> Warren Togami wrote, On 4/11/09 3:27 PM: >>> It seems clear that we will need to flatten/encode any URI domain to >>> punycode for URIBL lookups. >> >> I agree with that -- if something has non-ASCII characters then >> punycode is the canonical form to use to look it up. >> >>> The unclear part is if we will need to decode URI's prior to >>> punycode encoding. I suspect we will be forced to decode. >> >> I'm not sure exactly what you mean, but the big issue that I see is >> how to determine that a string is a URL (where it starts and where it >> stops) that needs to be encoded to punycode. Is that what you are >> talking about? The rule of thumb that I used when working on code to >> extract URLs from plain text is that is some common MUA hot links it, >> then we want to treat it as a URL. Perhaps the answer is to wait >> until MUAs support these URLs and then follow that rule of thumb. >> >> -- sidney > > http://日本語.テスト/ > > Did your MUA turn that into a clickable link? > > Thunderbird yes > Evolution no > GMail yes > Squirrelmail no > Roundcubemail no knode yes. /Per Jessen, Zürich |
|
|
Re: Non-Roman characters in TLDs and domain namesOn 11/04/2009 12:34 PM, Sidney Markowitz wrote:
> Warren Togami wrote, On 4/11/09 7:17 PM: >> My point was lost here. I pasted these URL's as an example of what the >> spamassassin URI parser might see without decoding > > Let's see if I understand this correctly: The message consists of a > sequence of bytes which encode characters in a certain charset. When > host names and domain names were restricted to 7-bit ASCII then they > could be parsed out by SpamAssassin by looking at the raw bytes without > regard to the charset. Now we would have to convert the entire byte > stream from the raw bytes to wide characters according to the charset of > the message before we could be sure to parse and handle text URLs > correctly. Does that sum it up? > > I haven't paid attention to the issue of charset encoding and wide > characters. How much are we getting away with assuming that most emails > are in one-byte character codes or at least in codes that represent the > ASCII set as one byte and so we can just apply rules to the raw byte > strings and it works most of the time? How badly does SPamAsassin fall > down if mail is encoded in a charset that violates that assumption? > >> Clickable link today is not relevant. MUA and browsers in the future >> will adapt to support these international TLD's. > > It is relevant to what we should handle right now versus what we plan to > handle in the future when MUAs are changed. > > -- sidney > What spamassassin handles right now is fine. Punycode domain names (without the soon to be ratified IDN TLD's) are rare because clients do not support it. I thought this thread was about talking about the future after clients begin supporting IDN TLD's. Warren |
| Free embeddable forum powered by Nabble | Forum Help |