How to translate utf-8 hex to unicode hex?

View: New views
10 Messages — Rating Filter:   Alert me  

How to translate utf-8 hex to unicode hex?

by Sean-130 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi,

My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
What I want is 2 bytes unicode.

For example:
let input = "%E9%A6%AC"
let output = "99AC"

Based on the output, I can then get the real CJK: 馬.

Is it possible to do it from within Vim?

Thanks

Sean
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Tony Mechelynck-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On 10/11/09 19:44, Sean wrote:

>
> Hi,
>
> My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
> What I want is 2 bytes unicode.
>
> For example:
> let input = "%E9%A6%AC"
> let output = "99AC"
>
> Based on the output, I can then get the real CJK: 馬.
>
> Is it possible to do it from within Vim?
>
> Thanks
>
> Sean

You can do it the hard way, with arithmetic computations which I shall
explain below.

Or you can do it the easy way, by writing the bytes to disc as if they
were Latin1 (see ":help ++opt") and reading them back as UTF-8.

Or you can use the iconv() function (q.v.).


UTF-8 bytes are divided in "waterproof" categories as follows:

- Bytes 0x00 to 0x7F are "single" bytes, they each represent a single
codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII.

- Bytes 0xC0 to (currently) 0xF4 or (as originally foreseen and still
supported by Vim) 0xFD are "header" bytes in a multibyte sequence. Such
a byte MUST be the first byte of its sequence and the number of "one"
bits above the topmost "zero" bit indicates the number of bytes
(including this one) in the whole sequence.

- Bytes 0x80 to 0xBF are "trailer" bytes in a multibyte sequence. They
can be any byte in the sequence except the first.

- Bytes OxFE and OxFF are always invalid anywhere in UTF-8 text.

- In the bytes of a multibyte sequence, all bits after the topmost
"zero" bit in each byte constitute the "payload": they are data bits,
and in UTF-8 the most significant bits always come first.


Your example translates as follows:

0xE9 = 1110.1001 binary
        header byte
        the sequence is of three bytes
        payload: 1001
0xA6 = 1010.0110 binary
        trailer byte
        payload: 100110
0xAC = 1010.1100 binary
        trailer byte
        payload: 101100
Result (concatenated payload bits) 1001.1001.1010.1100 binary, or U+99AC

Note that some hanzi are above U+20000; the UTF-8 code for them consists
of four bytes, not three: e.g. = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3
= %F0%A0%84%A3 in "percent-escaped" HTTP coding.


The Unicode code space had originally been foreseen as ranging from
U+0000  to U+7FFFFFFF but the current standards say that no codepoints
above U+10FFFD will ever be valid; also, codepoints whose hex
representation is xxFFFE or xxFFFF (where xx is anything) have been
expressly designated as invalid, never to be used.


Best regards,
Tony.
--
Putt's Law:
        Technology is dominated by two types of people:
                Those who understand what they do not manage.
                Those who manage what they do not understand.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Sean-130 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Tony,

I thought I had enough knowledge on UNICODE and UTF8, but it is
nothing after reading your message.

Now, I get what I want:

let input = "\xE9\xA6\xAC"
let output=iconv(input, "utf-8", "utf8")

Bingo!  The output is real ==> '馬'

Thanks again.

Sean

On Nov 10, 12:06 pm, Tony Mechelynck <antoine.mechely...@...>
wrote:

> On 10/11/09 19:44, Sean wrote:
>
>
>
>
>
> > Hi,
>
> > My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
> > What I want is 2 bytes unicode.
>
> > For example:
> > let input = "%E9%A6%AC"
> > let output = "99AC"
>
> > Based on the output, I can then get the real CJK: 馬.
>
> > Is it possible to do it from within Vim?
>
> > Thanks
>
> > Sean
>
> You can do it the hard way, with arithmetic computations which I shall
> explain below.
>
> Or you can do it the easy way, by writing the bytes to disc as if they
> were Latin1 (see ":help ++opt") and reading them back as UTF-8.
>
> Or you can use the iconv() function (q.v.).
>
> UTF-8 bytes are divided in "waterproof" categories as follows:
>
> - Bytes 0x00 to 0x7F are "single" bytes, they each represent a single
> codepoint in the exact same format as in Latin-1 or 7-bit US-ASCII.
>
> - Bytes 0xC0 to (currently) 0xF4 or (as originally foreseen and still
> supported by Vim) 0xFD are "header" bytes in a multibyte sequence. Such
> a byte MUST be the first byte of its sequence and the number of "one"
> bits above the topmost "zero" bit indicates the number of bytes
> (including this one) in the whole sequence.
>
> - Bytes 0x80 to 0xBF are "trailer" bytes in a multibyte sequence. They
> can be any byte in the sequence except the first.
>
> - Bytes OxFE and OxFF are always invalid anywhere in UTF-8 text.
>
> - In the bytes of a multibyte sequence, all bits after the topmost
> "zero" bit in each byte constitute the "payload": they are data bits,
> and in UTF-8 the most significant bits always come first.
>
> Your example translates as follows:
>
> 0xE9 = 1110.1001 binary
>         header byte
>         the sequence is of three bytes
>         payload: 1001
> 0xA6 = 1010.0110 binary
>         trailer byte
>         payload: 100110
> 0xAC = 1010.1100 binary
>         trailer byte
>         payload: 101100
> Result (concatenated payload bits) 1001.1001.1010.1100 binary, or U+99AC
>
> Note that some hanzi are above U+20000; the UTF-8 code for them consists
> of four bytes, not three: e.g. = U+20123 = UTF-8 0xF0 0xA0 0x84 0xA3
> = %F0%A0%84%A3 in "percent-escaped" HTTP coding.
>
> The Unicode code space had originally been foreseen as ranging from
> U+0000  to U+7FFFFFFF but the current standards say that no codepoints
> above U+10FFFD will ever be valid; also, codepoints whose hex
> representation is xxFFFE or xxFFFF (where xx is anything) have been
> expressly designated as invalid, never to be used.
>
> Best regards,
> Tony.
> --
> Putt's Law:
>         Technology is dominated by two types of people:
>                 Those who understand what they do not manage.
>                 Those who manage what they do not understand.
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Tony Mechelynck-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On 10/11/09 22:24, Sean wrote:

>
> Hi Tony,
>
> I thought I had enough knowledge on UNICODE and UTF8, but it is
> nothing after reading your message.
>
> Now, I get what I want:
>
> let input = "\xE9\xA6\xAC"
> let output=iconv(input, "utf-8", "utf8")
>
> Bingo!  The output is real ==>  '馬'
>
> Thanks again.
>
> Sean

The particular parameters you give to iconv make it an identity permutation.

When I do

        :echo "\xE9\xA6\xAC"

(with 'encoding' set to "utf-8") the result is





Best regards,
Tony.
--
"I am, in point of fact, a particularly haughty and exclusive person,
of pre-Adamite ancestral descent.  You will understand this when I tell
you that I can trace my ancestry back to a protoplasmal primordial
atomic globule.  Consequently, my family pride is something
inconceivable.  I can't help it.  I was born sneering."
                -- Pooh-Bah, "The Mikado", Gilbert & Sullivan

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Sean-130 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Tony,

The last mile to go:

This always worked:  (output is 馬)
-------------------------------------------
let input = "\xE9\xA6\xAC"
let output = iconv(input, "UTF-8", "UTF-8")
-------------------------------------------

However, this failed:  (output is '\xE9\xA6\xAC')
-------------------------------------------
let input = "%E9%A6%AC"
let input = substitute(input, '%', '\\x', 'g')
let output = iconv(input, "UTF-8", "UTF-8")
-------------------------------------------

Now, the key becomes how to translate string (single quoted) to string
(double quoted). I guess that "\x" is meaningful only within double
quote.

Thanks

Sean
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Tony Mechelynck-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On 11/11/09 02:06, Sean wrote:

>
> Hi Tony,
>
> The last mile to go:
>
> This always worked:  (output is 馬)
> -------------------------------------------
> let input = "\xE9\xA6\xAC"
> let output = iconv(input, "UTF-8", "UTF-8")
> -------------------------------------------
>
> However, this failed:  (output is '\xE9\xA6\xAC')
> -------------------------------------------
> let input = "%E9%A6%AC"
> let input = substitute(input, '%', '\\x', 'g')
> let output = iconv(input, "UTF-8", "UTF-8")
> -------------------------------------------
>
> Now, the key becomes how to translate string (single quoted) to string
> (double quoted). I guess that "\x" is meaningful only within double
> quote.
>
> Thanks
>
> Sean
> >
>

what about (untested)

function HttpToString(str)
        return substitute(a:str, '%\(\x\x\)',
                \ '\=eval(''"\x'' . submatch(1) . ''"'')', 'g')
endfunction

If I haven't goofed,

        :echo HttpToStr('%E9%A6%AC')

ought to return



Note the use of pairs of single quotes to represent actual single quotes
in a single-quoted string.

The use of a continuation line assumes 'nocompatible'.

See also
        :help sub-replace-expression
        :help eval()


Best regards,
Tony.
--
Lizzie Borden took an axe,
And plunged it deep into the VAX;
Don't you envy people who
Do all the things _YOU_ want to do?

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Sean-130 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Tony,

You are real genius!  It simply worked without modification!

I added it as part of VimIM plugin online:
http://maxiangjiang.googlepages.com/vimim.vim.html
    " ================================ }}}
    " ====  VimIM SoGou Cloud IM  ==== {{{
    " ====================================

Now, let me show you the power of Vim:

input in PinYin => woyouyigeqiguaidemeilidemeng
output in Chinese => 我有一个奇怪的美丽的梦
It is meaningless :) => "I have a strange but beautiful dream."

This is my gift to you:
http://maxiangjiang.googlepages.com/dream.png

Thanks

Sean

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by winterTTr-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Wed, Nov 11, 2009 at 12:15 PM, Sean <maxiangjiang@...> wrote:
Hi Tony,

You are real genius!  It simply worked without modification!

I added it as part of VimIM plugin online:
http://maxiangjiang.googlepages.com/vimim.vim.html
   " ================================ }}}
   " ====  VimIM SoGou Cloud IM  ==== {{{
   " ====================================

Now, let me show you the power of Vim:

input in PinYin => woyouyigeqiguaidemeilidemeng
output in Chinese => 我有一个奇怪的美丽的梦
It is meaningless :) => "I have a strange but beautiful dream."

This is my gift to you:
http://maxiangjiang.googlepages.com/dream.png

Thanks

You are the author of the VimIM ? Nice to see you here.

And you want to use the sogou-cloud input result , and translate the 
content from url to the vim content ?
Realy amazing thought ;-)

Sean




--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Tony Mechelynck-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On 11/11/09 05:15, Sean wrote:

> Hi Tony,
>
> You are real genius!  It simply worked without modification!
>
> I added it as part of VimIM plugin online:
> http://maxiangjiang.googlepages.com/vimim.vim.html
>      " ================================ }}}
>      " ====  VimIM SoGou Cloud IM  ==== {{{
>      " ====================================
>
> Now, let me show you the power of Vim:
>
> input in PinYin =>  woyouyigeqiguaidemeilidemeng
> output in Chinese =>  我有一个奇怪的美丽的梦
> It is meaningless :) =>  "I have a strange but beautiful dream."

Meaningless? It evokes powerful meanings to me; I link it with Martin
Luther King's famous "I had a dream" discourse, and, maybe less known,
the Hymn ("La Espero", i.e. "Hope") of the Esperantist movement, which
ends in words meaning "Our diligent colleagues won't tire in a labour of
peace, till the beautiful dream of mankind shall come true for eternal
blessing".

>
> This is my gift to you:
> http://maxiangjiang.googlepages.com/dream.png
>
> Thanks

My thanks to you, for the beautiful sentence in hanzi.

>
> Sean

Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
187. You promise yourself that you'll only stay online for another
      15 minutes...at least once every hour.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


Re: How to translate utf-8 hex to unicode hex?

by Lily-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


This is what I am using (need +python):
"   %xx -> 对应的字符(到消息)[[[2
function Lilydjwg_hexchar()
  let chars = Lilydjwg_get_pattern_at_cursor('\(%[[:xdigit:]]\{2}\)\
+')
  if chars == ''
    echohl WarningMsg
    echo '在光标处未发现%表示的十六进制字符串!' "echo that the form of string cannot be
found there.
    echohl None
    return
  endif
  let str = substitute(chars, '%', '\\x', 'g')
  exe 'py print ''' . str . ''''
endfunction
"  取得光标处的匹配[[[2
function Lilydjwg_get_pattern_at_cursor(pat) "This is a function I
borrowed from another plugin
  let col = col('.') - 1
  let line = getline('.')
  let ebeg = -1
  let cont = match(line, a:pat, 0)
  while (ebeg >= 0 || (0 <= cont) && (cont <= col))
    let contn = matchend(line, a:pat, cont)
    if (cont <= col) && (col < contn)
      let ebeg = match(line, a:pat, cont)
      let elen = contn - ebeg
      break
    else
      let cont = match(line, a:pat, contn)
    endif
  endwh
  if ebeg >= 0
    return strpart(line, ebeg, elen)
  else
    return ""
  endif
endfunction

nmap <silent> t% :call Lilydjwg_hexchar()<CR>

After writing this to .vimrc, when I move the cursor to where the %xx
string is and press t%, I can see the decoded string.
Or you can just use a program called ascii2uni, eg:

echo %E9%A6%AC | ascii2uni -q -a J

and the output is 馬. You can combine this with the filter (:h filter)
feature of Vim to get lines of characters converted directly from
within Vim.

On Nov 11, 2:44 am, Sean <maxiangji...@...> wrote:

> Hi,
>
> My input is from HTTP, 3 hard-coded bytes of UTF-8 hex value.
> What I want is 2 bytes unicode.
>
> For example:
> let input = "%E9%A6%AC"
> let output = "99AC"
>
> Based on the output, I can then get the real CJK: 馬.
>
> Is it possible to do it from within Vim?
>
> Thanks
>
> Sean
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---