|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
Re: Unimarc, marc21, Unicode, and MARC::File::XMLOn 3/20/06, Paul POULAIN <paul.poulain@...> wrote:
> Mike Rylander a écrit : > > I tested with the record you sent Ed and me, and everything seems to > > work for me ... > > As you can see, I tested several variants of the UNIMARC flag, and > > even tested not sending the encoding to new_from_xml() ... it all > > seems to work for me, and I'm not sure what problems you're seeing. > > Perhaps you just needed to set your binmode for the XML source? > > strange, strange... > > What does my script : > * retrieve the MARC::Record from zebra > * read some datas from mysql > * build a page with HTML::Template > * send the pages to the browser Are you getting XML or binary MARC from zebra? > > I added 3 lines to save the record in a file after reading from zebra. > Adding binmode(F,':utf8'); > before saving my record in F, give me correct UTF-8. > without binmode, it's NOK. > > But when I put the MARC::record in a page builded with HTML::Template, > it's wrong. > The HTML is utf-8 (html page encoding). > It also contains some strings from mySQL and all strings from mySQL > appear as correct utf8 while all strings coming from the MARC::record > coming from zebra are not ! > > I can add "binmode()" to the template output, but everything goes wrong > with strings from mySQL. > Are you using decode_utf8($mysql_string) to let Perl know that the database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl about that, and the DBD::MySQL maintainer haven't added that functionality to the module yet. > Any suggestion welcomed ! > -- > Paul POULAIN et Henri Damien LAURENT > Consultants indépendants > en logiciels libres et bibliothéconomie (http://www.koha-fr.org) > -- Mike Rylander mrylander@... GPLS -- PINES Development Database Developer http://open-ils.org _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Re: Unimarc, marc21, Unicode, and MARC::File::XMLHello Mike,
I'll answer to the second question, since I worked with Paul on Perl/MySQL and UTF-8... On Mon, 20 Mar 2006 09:59:32 -0500 "Mike Rylander" <mrylander@...> wrote: > Are you using decode_utf8($mysql_string) to let Perl know that the > database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl > about that, and the DBD::MySQL maintainer haven't added that > functionality to the module yet. We don't use decode_utf8. Just after the database handler creation, we force communication to be UTF-8 with "set names 'UTF8'" SQL query. As we know our data are UTF-8 stored and we want UTF-8, all works fine. Bye -- Pierrick LE GALL INEO media system _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Unimarc, marc21, Unicode, and MARC::File::XMLMike Rylander a écrit :
> On 3/20/06, Paul POULAIN <paul.poulain@...> wrote: > >>Mike Rylander a écrit : >> >>>I tested with the record you sent Ed and me, and everything seems to >>>work for me ... >>>As you can see, I tested several variants of the UNIMARC flag, and >>>even tested not sending the encoding to new_from_xml() ... it all >>>seems to work for me, and I'm not sure what problems you're seeing. >>>Perhaps you just needed to set your binmode for the XML source? >> >>strange, strange... >> >>What does my script : >>* retrieve the MARC::Record from zebra >>* read some datas from mysql >>* build a page with HTML::Template >>* send the pages to the browser > Are you getting XML or binary MARC from zebra? XML. The test.xml I sended to you on friday comes was the $raw = $rs->record(0)->raw(); record. > Are you using decode_utf8($mysql_string) to let Perl know that the > database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl > about that, and the DBD::MySQL maintainer haven't added that > functionality to the module yet. I thought we had to decode_utf8($mysql_string), and began to investigate a lot. But after many hours of digging & getting problems, I now have a working mySQL in utf8 for all of Koha. without any binmode of decode_utf8 ... And it seems joshua & Tümer (Turkey) has the same conclusion : no more problems with mySQL & Perl. We all use a recent version of mySQL, even if DBD::mysql maintainer (from mysql.com : joshua dropped him a mail but got no answer) did nothing on the cpan package. -- Paul POULAIN et Henri Damien LAURENT Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org) _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Re: Unimarc, marc21, Unicode, and MARC::File::XMLOn 3/20/06, Pierrick LE GALL <pierrick@...> wrote:
> Hello Mike, > > I'll answer to the second question, since I worked with Paul on > Perl/MySQL and UTF-8... > > On Mon, 20 Mar 2006 09:59:32 -0500 > "Mike Rylander" <mrylander@...> wrote: > > > Are you using decode_utf8($mysql_string) to let Perl know that the > > database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl > > about that, and the DBD::MySQL maintainer haven't added that > > functionality to the module yet. > > We don't use decode_utf8. Just after the database handler creation, we > force communication to be UTF-8 with "set names 'UTF8'" SQL query. As > we know our data are UTF-8 stored and we want UTF-8, all works fine. > Except that Perl doesn't know that the data is already UTF8 ... which is the problem. Perl /does/ know that the MARC data is UTF8, and it has to convert one string or the other on output. If you explicitly use binmode() to set the PerlIO state to utf8, then the MARC::Record strings, which are known good UTF8, are not transformed, but the MySQL data, of which Perl has no encoding notions, gets "transformed", and thus broken. The only consistent and correct way to deal with UTF8 data in perl is to let PerlIO handle it by marking all sources as either providing UTF8 data or not. You can do that with binmode(), open() and several other ways, including this in modern Perls ( http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm ). Because DBD::mysql doesn't give you a way to mark its socket as UTF8, you need to be a little underhanded and tell Perl as soon as possible using decode(), or by making utf8 the default mode for all PerlIO channels. There really isn't any way around this if you want to claim real UTF8 support and be able to use components that really do support UTF8 natively, like MARC::File::XML and MARC::Record. It's unfortunate that the DBD::mysql people won't fix their module, but there really is a right way to do this, even without their help. Is there a performance penalty with decode()? Yep. Would that go away with a fix to the DBD::mysql module? Mostly, so you really need to bug them. > Bye > > -- > Pierrick LE GALL > INEO media system > -- Mike Rylander mrylander@... GPLS -- PINES Development Database Developer http://open-ils.org _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Re: Unimarc, marc21, Unicode, and MARC::File::XMLOn Mon, 20 Mar 2006 10:54:08 -0500
"Mike Rylander" <mrylander@...> wrote: > Except that Perl doesn't know that the data is already UTF8 ... which > is the problem. [...] You're completely right, I understand the difference. We made UTF8 work from MySQL bu we didn't tried to work on data coming from MySQL. Just "select ..." and "print". So it works but we are limited on strings processing. > It's unfortunate that the DBD::mysql people won't fix their module, > but there really is a right way to do this, even without their help. > Is there a performance penalty with decode()? Yep. Would that go > away with a fix to the DBD::mysql module? Mostly, so you really need > to bug them. The problem with decode() is the impact. Adding this process on each string retrieved from MySQL represents hundreds of code lines. Not so hard to modify but the solution is not /elegant/. Being able to flag data coming from MySQL as UTF8 to Perl would be the /elegant/ solution, as you said. Maybe we should try harder to have this feature from DBD::mysql developers. Thanks for your precisions. Bye -- Pierrick LE GALL INEO media system _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
|
|
|
Re: Re: Unimarc, marc21, Unicode, and MARC::File::XMLTümer Garip wrote:
> Hi, > > This problem if I understood it correctly has got nothing to do with > mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC > which I am not familiar with. > As you know (Paul) I have an utf-8 version working. > > I had the same problem from records coming from zebra and found out that > it is not doing what it is supposed to do unless you explicitly set it > to utf-8. You have to explicitly put "encoding utf-8" in all your zebra > config files especially the zebra.cfg and your .abs . Otherwise unlike > the documentation saying that zebra character code is automatically set > by the xml encoding it DOES NOT. illustrates this (Inserts an XML record from Perl in UTF-8) ? > Perl send xml to zebra with encoding utf-8 on the header and utf-8 data > in it. Zebra saves all the data in utf-8 but returns an xml saying > encoding iso8859-1 at the header and utf-8 characters in data. No module > can correct this as it is stupid. Just need to know when the stupidity starts:-) / Adam > I corrected the problem by adding encoding:UTF-8 in zebra.cfg, > record.abs, sort-string.chr > > Hope it solves yours, > > Tumer > > > > _______________________________________________ > Koha-zebra mailing list > Koha-zebra@... > http://lists.nongnu.org/mailman/listinfo/koha-zebra > _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
RE: Re: Unimarc, marc21, Unicode, and MARC::File::XMLHi Adam,
You seem a bit offended that was not my intention, just frustation sometimes makes me use harsh words and translanting them to english may be too harsh. I do not need to send you any config+examples cause I tested this with your default config files. I am attaching an xml record in utf-8 Briefly I had default configuration files and build zebra with xml records. When I noticed the problem I used yaz-client to see what was going on. On my log I could see data going in the zebra was with encoding utf-8 While yaz client was returning xml with headers saying iso-8859-1 while I could actually see the utf-8 characters as they show as hex in yaz client. I have retried this procedures just now and it seems the same. Just adding encoding:UTF-8 to zebra.cfg and restarting the server you get correct heading and correct data. Please note that server has to be restarted but zebradb does not have to be rebuilt. Thanks Tumer -----Original Message----- From: Adam Dickmeiss [mailto:adam@...] Sent: Tuesday, March 21, 2006 9:00 PM To: Tümer Garip Cc: paul.poulain@...; koha-zebra@... Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML Tümer Garip wrote: > Hi, > > This problem if I understood it correctly has got nothing to do with > mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC > which I am not familiar with. As you know (Paul) I have an utf-8 > version working. > > I had the same problem from records coming from zebra and found out > that it is not doing what it is supposed to do unless you explicitly > set it to utf-8. You have to explicitly put "encoding utf-8" in all > your zebra config files especially the zebra.cfg and your .abs . > Otherwise unlike the documentation saying that zebra character code is > automatically set by the xml encoding it DOES NOT. I can't reproduce this (bug). Care to share a a config+example that illustrates this (Inserts an XML record from Perl in UTF-8) ? > Perl send xml to zebra with encoding utf-8 on the header and utf-8 > data in it. Zebra saves all the data in utf-8 but returns an xml > saying encoding iso8859-1 at the header and utf-8 characters in data. > No module can correct this as it is stupid. Just need to know when the stupidity starts:-) / Adam > I corrected the problem by adding encoding:UTF-8 in zebra.cfg, > record.abs, sort-string.chr > > Hope it solves yours, > > Tumer > > > > _______________________________________________ > Koha-zebra mailing list > Koha-zebra@... > http://lists.nongnu.org/mailman/listinfo/koha-zebra > _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Re: Unimarc, marc21, Unicode, and MARC::File::XMLTümer Garip wrote:
> Hi Adam, > You seem a bit offended that was not my intention, just frustation > sometimes > makes me use harsh words and translanting them to english may be too > harsh. > > I do not need to send you any config+examples cause I tested this with > your default config files. I am attaching an xml record in utf-8 If you're to receive help from me you need to to tell me which zebra.cfg you're using. And show me the record + the way you indexed it (zebraidx update ?) > > Briefly I had default configuration files and build zebra with xml > records. When I noticed the problem > I used yaz-client to see what was going on. On my log I could see data > going in the zebra was with encoding utf-8 > While yaz client was returning xml with headers saying iso-8859-1 while > I could actually see the utf-8 characters as they show as hex in yaz > client. I also need to know what you see? And you you'd expect to see. / Adam > I have retried this procedures just now and it seems the same. Just > adding encoding:UTF-8 to zebra.cfg and restarting the server you get > correct heading and correct data. Please note that server has to be > restarted but zebradb does not have to be rebuilt. > > Thanks > Tumer > > -----Original Message----- > From: Adam Dickmeiss [mailto:adam@...] > Sent: Tuesday, March 21, 2006 9:00 PM > To: Tümer Garip > Cc: paul.poulain@...; koha-zebra@... > Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and > MARC::File::XML > > > Tümer Garip wrote: > >>Hi, >> >>This problem if I understood it correctly has got nothing to do with >>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC >>which I am not familiar with. As you know (Paul) I have an utf-8 >>version working. >> >>I had the same problem from records coming from zebra and found out >>that it is not doing what it is supposed to do unless you explicitly >>set it to utf-8. You have to explicitly put "encoding utf-8" in all >>your zebra config files especially the zebra.cfg and your .abs . >>Otherwise unlike the documentation saying that zebra character code is > > >>automatically set by the xml encoding it DOES NOT. > > I can't reproduce this (bug). Care to share a a config+example that > illustrates this (Inserts an XML record from Perl in UTF-8) ? > > >>Perl send xml to zebra with encoding utf-8 on the header and utf-8 >>data in it. Zebra saves all the data in utf-8 but returns an xml >>saying encoding iso8859-1 at the header and utf-8 characters in data. >>No module can correct this as it is stupid. > > Just need to know when the stupidity starts:-) > > / Adam > > >>I corrected the problem by adding encoding:UTF-8 in zebra.cfg, >>record.abs, sort-string.chr >> >>Hope it solves yours, >> >>Tumer >> >> >> >>_______________________________________________ >>Koha-zebra mailing list >>Koha-zebra@... >>http://lists.nongnu.org/mailman/listinfo/koha-zebra >> > > > > > _______________________________________________ > Koha-zebra mailing list > Koha-zebra@... > http://lists.nongnu.org/mailman/listinfo/koha-zebra > _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
RE: Re: Unimarc, marc21, Unicode, and MARC::File::XMLI thought I explained it but here it is again:
I do not think which method you use is relevant here but but just try this: In the release version ZEBRA test/usmarc folder change the zebra.cfg to read recordType: grs.xml in the tabs folder change marc21.abs to read record.abs Use zebraidx to create the database with the single XML record I sent to you. Start the zebrasrv at the required port. Use yaz-client f @attr 1=1016 book format xml show I see the xml record header saying <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> Further down you'll see utf-8 characters of correct hex as \XC5\X9F Now stop the server. Add line encoding:utf-8 to your zebra.cfg Restart the server Do the same search you get <?xml version="1.0" encoding="UTF-8" standalone="yes"?> Conclusion: The database does keep the data in UTF-8 as expected. Server does not know about database character set or the xml record taht was parsed in and unless specificly set to UTF-8 in Zebra.cfg srever goes ahead and changes the header or in fact it produces itself a header saying iso-8859-1 while giving out utf-8 characters. I did not ask any help on this thanks. Just clearing some issues with Paul's problem. Tumer -----Original Message----- From: Adam Dickmeiss [mailto:adam@...] Sent: Tuesday, March 21, 2006 9:58 PM To: Tümer Garip Cc: koha-zebra@... Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML Tümer Garip wrote: > Hi Adam, > You seem a bit offended that was not my intention, just frustation > sometimes makes me use harsh words and translanting them to english > may be too harsh. > > I do not need to send you any config+examples cause I tested this with > your default config files. I am attaching an xml record in utf-8 If you're to receive help from me you need to to tell me which zebra.cfg you're using. And show me the record + the way you indexed it (zebraidx update ?) > > Briefly I had default configuration files and build zebra with xml > records. When I noticed the problem I used yaz-client to see what was > going on. On my log I could see data going in the zebra was with > encoding utf-8 While yaz client was returning xml with headers saying > iso-8859-1 while I could actually see the utf-8 characters as they > show as hex in yaz client. I also need to know what you see? And you you'd expect to see. / Adam > I have retried this procedures just now and it seems the same. Just > adding encoding:UTF-8 to zebra.cfg and restarting the server you get > correct heading and correct data. Please note that server has to be > restarted but zebradb does not have to be rebuilt. > > Thanks > Tumer > > -----Original Message----- > From: Adam Dickmeiss [mailto:adam@...] > Sent: Tuesday, March 21, 2006 9:00 PM > To: Tümer Garip > Cc: paul.poulain@...; koha-zebra@... > Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and > MARC::File::XML > > > Tümer Garip wrote: > >>Hi, >> >>This problem if I understood it correctly has got nothing to do with >>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC >>which I am not familiar with. As you know (Paul) I have an utf-8 >>version working. >> >>I had the same problem from records coming from zebra and found out >>that it is not doing what it is supposed to do unless you explicitly >>set it to utf-8. You have to explicitly put "encoding utf-8" in all >>your zebra config files especially the zebra.cfg and your .abs . >>Otherwise unlike the documentation saying that zebra character code is > > >>automatically set by the xml encoding it DOES NOT. > > I can't reproduce this (bug). Care to share a a config+example that > illustrates this (Inserts an XML record from Perl in UTF-8) ? > > >>Perl send xml to zebra with encoding utf-8 on the header and utf-8 >>data in it. Zebra saves all the data in utf-8 but returns an xml >>saying encoding iso8859-1 at the header and utf-8 characters in data. >>No module can correct this as it is stupid. > > Just need to know when the stupidity starts:-) > > / Adam > > >>I corrected the problem by adding encoding:UTF-8 in zebra.cfg, >>record.abs, sort-string.chr >> >>Hope it solves yours, >> >>Tumer >> >> >> >>_______________________________________________ >>Koha-zebra mailing list >>Koha-zebra@... >>http://lists.nongnu.org/mailman/listinfo/koha-zebra >> > > > > > _______________________________________________ > Koha-zebra mailing list > Koha-zebra@... > http://lists.nongnu.org/mailman/listinfo/koha-zebra > _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Re: Unimarc, marc21, Unicode, and MARC::File::XMLTümer Garip wrote:
> I thought I explained it but here it is again: > > I do not think which method you use is relevant here but but just try > this: > > In the release version ZEBRA test/usmarc folder change the zebra.cfg to > read > recordType: grs.xml > in the tabs folder change marc21.abs to read record.abs > Use zebraidx to create the database with the single XML record I sent to > you. > Start the zebrasrv at the required port. > Use yaz-client > f @attr 1=1016 book > format xml > show > > I see the xml record header saying > <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> > > Further down you'll see utf-8 characters of correct hex as > \XC5\X9F > > Now stop the server. > Add line encoding:utf-8 to your zebra.cfg > Restart the server > Do the same search you get > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > > Conclusion: > The database does keep the data in UTF-8 as expected. > Server does not know about database character set or the xml record taht > was parsed in and unless specificly set to UTF-8 in Zebra.cfg srever > goes ahead and changes the header or in fact it produces itself a header > saying iso-8859-1 while giving out utf-8 characters. Correct. I was unable to reproduce this fault.. becauase my XML test record was able to be represented in UNICODE/UTF-8. Your sample is NOT Converion from UTF-8 to ISO-8859-1 fails in Zebra.. And in this case, Zebra keeps data as is, but unfortunately alters the header anyway. That's the mistake. Better behavior would probably be for Zebra to not return the data at all, but return a surrogate diagnostic for the record .. As you say, Zebra can be forced to use utf-8 in retrieval phase in the configuration. You can also specify utf-8 via the Z39.50 protocol .. (charset utf-8 in yaz-client).. and you should be able to achieve the same with ZOOM-Perl. For Zebra 1.3 we kept Latin-1 as defualt character set because of a number of installations using that.. For Zebra 1.4 default is UTF-8.. so there should not be a problem with that - in this case. / Adam > > I did not ask any help on this thanks. Just clearing some issues with > Paul's problem. > Tumer > -----Original Message----- > From: Adam Dickmeiss [mailto:adam@...] > Sent: Tuesday, March 21, 2006 9:58 PM > To: Tümer Garip > Cc: koha-zebra@... > Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and > MARC::File::XML > > > Tümer Garip wrote: > >>Hi Adam, >>You seem a bit offended that was not my intention, just frustation >>sometimes makes me use harsh words and translanting them to english >>may be too harsh. >> >>I do not need to send you any config+examples cause I tested this with > > >>your default config files. I am attaching an xml record in utf-8 > > If you're to receive help from me you need to to tell me which zebra.cfg > > you're using. And show me the record + the way you indexed it (zebraidx > update ?) > >>Briefly I had default configuration files and build zebra with xml >>records. When I noticed the problem I used yaz-client to see what was >>going on. On my log I could see data going in the zebra was with >>encoding utf-8 While yaz client was returning xml with headers saying >>iso-8859-1 while I could actually see the utf-8 characters as they >>show as hex in yaz client. > > I also need to know what you see? And you you'd expect to see. > > / Adam > > >>I have retried this procedures just now and it seems the same. Just >>adding encoding:UTF-8 to zebra.cfg and restarting the server you get >>correct heading and correct data. Please note that server has to be >>restarted but zebradb does not have to be rebuilt. >> >>Thanks >>Tumer >> >>-----Original Message----- >>From: Adam Dickmeiss [mailto:adam@...] >>Sent: Tuesday, March 21, 2006 9:00 PM >>To: Tümer Garip >>Cc: paul.poulain@...; koha-zebra@... >>Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and >>MARC::File::XML >> >> >>Tümer Garip wrote: >> >> >>>Hi, >>> >>>This problem if I understood it correctly has got nothing to do with >>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC >>>which I am not familiar with. As you know (Paul) I have an utf-8 >>>version working. >>> >>>I had the same problem from records coming from zebra and found out >>>that it is not doing what it is supposed to do unless you explicitly >>>set it to utf-8. You have to explicitly put "encoding utf-8" in all >>>your zebra config files especially the zebra.cfg and your .abs . >>>Otherwise unlike the documentation saying that zebra character code is >> >> >>>automatically set by the xml encoding it DOES NOT. >> >>I can't reproduce this (bug). Care to share a a config+example that >>illustrates this (Inserts an XML record from Perl in UTF-8) ? >> >> >> >>>Perl send xml to zebra with encoding utf-8 on the header and utf-8 >>>data in it. Zebra saves all the data in utf-8 but returns an xml >>>saying encoding iso8859-1 at the header and utf-8 characters in data. >>>No module can correct this as it is stupid. >> >>Just need to know when the stupidity starts:-) >> >>/ Adam >> >> >> >>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg, >>>record.abs, sort-string.chr >>> >>>Hope it solves yours, >>> >>>Tumer >>> >>> >>> >>>_______________________________________________ >>>Koha-zebra mailing list >>>Koha-zebra@... >>>http://lists.nongnu.org/mailman/listinfo/koha-zebra >>> >> >> >> >> >>_______________________________________________ >>Koha-zebra mailing list >>Koha-zebra@... >>http://lists.nongnu.org/mailman/listinfo/koha-zebra >> > > > > > _______________________________________________ > Koha-zebra mailing list > Koha-zebra@... > http://lists.nongnu.org/mailman/listinfo/koha-zebra > _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Unicode, XML,Zebra,WindowsHi Adam,
Well I am pleased you managed to reproduce this bug. Here are a few things to consider going towards 1.4 Zebra document says: "Generally, the files are simple ASCII files, which can be maintained using any text editor. " And it also says: "encoding encodingname This directive specifies character encoding for external records. For records such as XML that specifies encoding within the file via a header this directive is ignored. If neither this directive is given, nor an encoding is set within external records, ISO-8859-1 encoding is assumed. " In fact this is not the case. As you have seen the XML file I send to you had a header saying UTF-8 and also had utf-8 characters in it. In the windows enviroment that was a genuine utf-8 document. Zebra not only did not detect that but also ignored the header saying utf-8. We have a similar problem with .chr files as well. In windows environment Notepad is the simplest text editor. Write a sort.chr file with notepad with some utf8 characters, put encoding utf-8 in it and zebraidx gives syntax error. To overcome that I had to produce a sort.chr file on a colleagues Unix and use that. When I look into that file with notepad its unreadable so very difficult to maintain. I think whats happening with this utf-8 thing is that windows and unix are using different representations of whether a file is unicode. Since you do a binary for windows as well I think zebra should stop checking characters according to unix and rely on things like xml headers or encoding directives of configuration files (which by the way iy says it does but as you have seen it does not). I should also say that I am testing 1.4 and it is very very more efficient in terms of cpu and memory usage but this problem remains. Well we can't have them all or can we? Regards, Tumer -----Original Message----- From: Adam Dickmeiss [mailto:adam@...] Sent: Tuesday, March 21, 2006 11:55 PM To: Tümer Garip Cc: koha-zebra@... Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML Tümer Garip wrote: > I thought I explained it but here it is again: > > I do not think which method you use is relevant here but but just try > this: > > In the release version ZEBRA test/usmarc folder change the zebra.cfg > to read > recordType: grs.xml > in the tabs folder change marc21.abs to read record.abs > Use zebraidx to create the database with the single XML record I sent > you. > Start the zebrasrv at the required port. > Use yaz-client > f @attr 1=1016 book > format xml > show > > I see the xml record header saying > <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> > > Further down you'll see utf-8 characters of correct hex as \XC5\X9F > > Now stop the server. > Add line encoding:utf-8 to your zebra.cfg > Restart the server > Do the same search you get > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > > Conclusion: > The database does keep the data in UTF-8 as expected. > Server does not know about database character set or the xml record > taht was parsed in and unless specificly set to UTF-8 in Zebra.cfg > srever goes ahead and changes the header or in fact it produces itself > a header saying iso-8859-1 while giving out utf-8 characters. Correct. I was unable to reproduce this fault.. becauase my XML test record was able to be represented in UNICODE/UTF-8. Your sample is NOT Converion from UTF-8 to ISO-8859-1 fails in Zebra.. And in this case, Zebra keeps data as is, but unfortunately alters the header anyway. That's the mistake. Better behavior would probably be for Zebra to not return the data at all, but return a surrogate diagnostic for the record .. As you say, Zebra can be forced to use utf-8 in retrieval phase in the configuration. You can also specify utf-8 via the Z39.50 protocol .. (charset utf-8 in yaz-client).. and you should be able to achieve the same with ZOOM-Perl. For Zebra 1.3 we kept Latin-1 as defualt character set because of a number of installations using that.. For Zebra 1.4 default is UTF-8.. so there should not be a problem with that - in this case. / Adam > > I did not ask any help on this thanks. Just clearing some issues with > Paul's problem. Tumer > -----Original Message----- > From: Adam Dickmeiss [mailto:adam@...] > Sent: Tuesday, March 21, 2006 9:58 PM > To: Tümer Garip > Cc: koha-zebra@... > Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and > MARC::File::XML > > > Tümer Garip wrote: > >>Hi Adam, >>You seem a bit offended that was not my intention, just frustation >>sometimes makes me use harsh words and translanting them to english >>may be too harsh. >> >>I do not need to send you any config+examples cause I tested this with > > >>your default config files. I am attaching an xml record in utf-8 > > If you're to receive help from me you need to to tell me which > zebra.cfg > > you're using. And show me the record + the way you indexed it > (zebraidx > update ?) > >>Briefly I had default configuration files and build zebra with xml >>records. When I noticed the problem I used yaz-client to see what was >>going on. On my log I could see data going in the zebra was with >>encoding utf-8 While yaz client was returning xml with headers saying >>iso-8859-1 while I could actually see the utf-8 characters as they >>show as hex in yaz client. > > I also need to know what you see? And you you'd expect to see. > > / Adam > > >>I have retried this procedures just now and it seems the same. Just >>adding encoding:UTF-8 to zebra.cfg and restarting the server you get >>correct heading and correct data. Please note that server has to be >>restarted but zebradb does not have to be rebuilt. >> >>Thanks >>Tumer >> >>-----Original Message----- >>From: Adam Dickmeiss [mailto:adam@...] >>Sent: Tuesday, March 21, 2006 9:00 PM >>To: Tümer Garip >>Cc: paul.poulain@...; koha-zebra@... >>Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and >>MARC::File::XML >> >> >>Tümer Garip wrote: >> >> >>>Hi, >>> >>>This problem if I understood it correctly has got nothing to do with >>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC >>>which I am not familiar with. As you know (Paul) I have an utf-8 >>>version working. >>> >>>I had the same problem from records coming from zebra and found out >>>that it is not doing what it is supposed to do unless you explicitly >>>set it to utf-8. You have to explicitly put "encoding utf-8" in all >>>your zebra config files especially the zebra.cfg and your .abs . >>>Otherwise unlike the documentation saying that zebra character code >>>is >> >> >>>automatically set by the xml encoding it DOES NOT. >> >>I can't reproduce this (bug). Care to share a a config+example that >>illustrates this (Inserts an XML record from Perl in UTF-8) ? >> >> >> >>>Perl send xml to zebra with encoding utf-8 on the header and utf-8 >>>data in it. Zebra saves all the data in utf-8 but returns an xml >>>saying encoding iso8859-1 at the header and utf-8 characters in data. >>>No module can correct this as it is stupid. >> >>Just need to know when the stupidity starts:-) >> >>/ Adam >> >> >> >>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg, >>>record.abs, sort-string.chr >>> >>>Hope it solves yours, >>> >>>Tumer >>> >>> >>> >>>_______________________________________________ >>>Koha-zebra mailing list >>>Koha-zebra@... >>>http://lists.nongnu.org/mailman/listinfo/koha-zebra >>> >> >> >> >> >>_______________________________________________ >>Koha-zebra mailing list >>Koha-zebra@... >>http://lists.nongnu.org/mailman/listinfo/koha-zebra >> > > > > > _______________________________________________ > Koha-zebra mailing list > Koha-zebra@... > http://lists.nongnu.org/mailman/listinfo/koha-zebra > _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
| Free embeddable forum powered by Nabble | Forum Help |