"Save As" vs. storeAsURL() text export filter difference

View: New views
4 Messages — Rating Filter:   Alert me  

"Save As" vs. storeAsURL() text export filter difference

by David Lu-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi All,

I have some Java code that converts Microsoft Word documents
to plain text.  It works fine.  However, I do have a problem
with some special characters such as the left and right double
quotes, the trademark and copyright symbols, etc., that are
not working as I expect.

Basically, for a Word document, when I use the GUI and "Save As"
the document as "Text (encoded)" and "UTF-8", I get all the special
characters in the output file.

However, when I use Java and call storeAsURL() with the same
input file, using "Text (encoded)" for FilterName and "UTF-8"
for FilterOptions, some of the characters, namely the trademark
and copyright symbols, and a few others, are saved as question
marks.

I've also tried using "Windows-1252/WinLatin 1" as the encoding
with the same results.

The "Save As" from GUI seems to work "better" than calling
storeAsURL() in terms of preserving more characters.  But the
documentation for storeAsURL() seems to indicate it's the same
as "Save As".  So do I need to specify additional properties
for storeAsURL()?

The API documentation for the call is pretty good:

http://api.openoffice.org/docs/common/ref/com/sun/star/frame/XStorable.html#storeAsURL

But is there documentation specific to each of the filters?

For what it's worth, in Java I'm connecting to OO using UNO.
This is on Linux and I run it with -headless.  For GUI tests
I run the same installation of OO on the same machine with the
same user ID.  I'm using OO 3.1.1.

Thank you for any help you can provide.

                     - David -


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: "Save As" vs. storeAsURL() text export filter difference

by T. J. Frazier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, David,

David Lu wrote:

>
> Hi All,
>
> I have some Java code that converts Microsoft Word documents
> to plain text.  It works fine.  However, I do have a problem
> with some special characters such as the left and right double
> quotes, the trademark and copyright symbols, etc., that are
> not working as I expect.
>
> Basically, for a Word document, when I use the GUI and "Save As"
> the document as "Text (encoded)" and "UTF-8", I get all the special
> characters in the output file.
>
> However, when I use Java and call storeAsURL() with the same
> input file, using "Text (encoded)" for FilterName and "UTF-8"
> for FilterOptions, some of the characters, namely the trademark
> and copyright symbols, and a few others, are saved as question
> marks.
>
> I've also tried using "Windows-1252/WinLatin 1" as the encoding
> with the same results.
>
> The "Save As" from GUI seems to work "better" than calling
> storeAsURL() in terms of preserving more characters.  But the
> documentation for storeAsURL() seems to indicate it's the same
> as "Save As".  So do I need to specify additional properties
> for storeAsURL()?
>
> The API documentation for the call is pretty good:
>
> http://api.openoffice.org/docs/common/ref/com/sun/star/frame/XStorable.html#storeAsURL 
>
>
> But is there documentation specific to each of the filters?
>
> For what it's worth, in Java I'm connecting to OO using UNO.
> This is on Linux and I run it with -headless.  For GUI tests
> I run the same installation of OO on the same machine with the
> same user ID.  I'm using OO 3.1.1.
>
> Thank you for any help you can provide.
>
>                     - David -
>
Try "UTF8" instead of "UTF-8".

Working in Basic, I hit a similar problem. After a successful
load/store, the file itself (I looked with the IDE) had these
interesting strings, which I copied:

aArgs(2).Name = "FilterName"
aArgs(2).Value = "Text (encoded)"
aArgs(3).Name = "FilterOptions"
aArgs(3).Value = "UTF8,CRLF,Times New Roman,en-US,"

Otherwise, you might have more luck posting your question on the
dev@... list.

HTH
--
/tj/

T. J. Frazier
Melbourne, FL

(TJFrazier on OO.o)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: "Save As" vs. storeAsURL() text export filter difference

by David Lu-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi T. J.,

I already tried "UTF8" (based on some Google searches) and it
actually performed worse, as it did not translate any non-ASCII
characters at all.  Characters such as the left and right double
quotes were changed to ?.

I stumbled upon "UTF-8" after noticing in the GUI that the
character code encoding says "UTF-8" instead of "UTF8".  It
worked better, preserving the left and right double quotes,
but did not preserve the trademark and copyright symbols.
Whereas when doing the Save As from the GUI, all of those
characters are preserved.

Based on your suggestion, I also tried various fonts and they
don't seem to make any difference.

I'll give the dev@... list a try.  Thank you!

                   - David -

T. J. Frazier wrote:
 >

> Try "UTF8" instead of "UTF-8".
>
> Working in Basic, I hit a similar problem. After a successful
> load/store, the file itself (I looked with the IDE) had these
> interesting strings, which I copied:
>
> aArgs(2).Name = "FilterName"
> aArgs(2).Value = "Text (encoded)"
> aArgs(3).Name = "FilterOptions"
> aArgs(3).Value = "UTF8,CRLF,Times New Roman,en-US,"
>
> Otherwise, you might have more luck posting your question on the
> dev@... list.
>
> HTH

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: "Save As" vs. storeAsURL() text export filter difference

by Stephan Bergmann :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 11/05/09 06:31, David Lu wrote:

> I have some Java code that converts Microsoft Word documents
> to plain text.  It works fine.  However, I do have a problem
> with some special characters such as the left and right double
> quotes, the trademark and copyright symbols, etc., that are
> not working as I expect.
>
> Basically, for a Word document, when I use the GUI and "Save As"
> the document as "Text (encoded)" and "UTF-8", I get all the special
> characters in the output file.
>
> However, when I use Java and call storeAsURL() with the same
> input file, using "Text (encoded)" for FilterName and "UTF-8"
> for FilterOptions, some of the characters, namely the trademark
> and copyright symbols, and a few others, are saved as question
> marks.
>
> I've also tried using "Windows-1252/WinLatin 1" as the encoding
> with the same results.
>
> The "Save As" from GUI seems to work "better" than calling
> storeAsURL() in terms of preserving more characters.  But the
> documentation for storeAsURL() seems to indicate it's the same
> as "Save As".  So do I need to specify additional properties
> for storeAsURL()?

This sounds strange, and I suggest you file a bug for it.

Windows-1252 has additional characters compared to ISO 8859-1, in the
range 0x80--0x9F, and at first it sounded like that fact might somehow
be related to the problem.  However, trademark (U+2122) is in that area
(0x99) while copyright (U+00A9) is not (0xA9), yet you say that both
have the problem...

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...