RE: [Koha-devel] Building zebradb

View: New views
8 Messages — Rating Filter:   Alert me  

Parent Message unknown RE: [Koha-devel] Building zebradb

by Tümer Garip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

To clear some issues with Sebastian:
>>Again, I'd be keen to know which Zebra version you're running?
1- I am using Zebra version 1.3.34
>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>want to work with (when records are loaded in MARCXML, it can spit out

>>either 2709 or XML)? Or is it just that you want to have them in
>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
so
>>it seems to make sense to use it.
2- Definitely correct and theoretically yes, but I keep saying that when
you feed zebra with MARCXML you CAN NOT get back MARC records. All you
get back is the leader and lots of blank space. I tried it with
YAZ-client as well.
I have reported this as a bug on indexdata list but never got an answer.
>> I get update times around .5-.7 seconds,
3- .4 or .5 is what we get as well. Stupidly enough write the file to
disk use zebraidx and you get .04-.09 and even faster times. What magic
zebraidx use that ZOOM does not know I donno.
>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>the problem? Even if you don't, I'm sure Adam and the Zebra crew would

>>like to see some stack traces of those crashes!!
4- The problem is intermittent. I'll try to provide details when it
reoccurs. What it says is  "Fatal error: inconsistent register" from
then on you have to throw the zebradatabase away and rebuild everything.
You cannot fallback to the working state. Shadow system was supposed to
do this. If something is gone wrong when you commit it should not go and
mess the whole database but just refuse to do it and let you wipe off
the last update commit operation. But you can not do that. So unless I
have missed something shadow files are useless.

Here is some more to think about zebra:
1- The MARCXML we feed into zebra is
<collection><record></record></collection> package. When you get it back
it is  still like that as supposed to be.  But if you feed zebra with
iso2709 marc records and ask back from zebra xml records you get back
<record></record> package no <collection> wrapping around it. Although
this is not a problem currently, I still do not like zebra doing that.
Its MARC to XML conversion should follow standarts rather than create
its own.

Thanks for your quick response
Regards,

Tumer

-----Original Message-----
From: Joshua Ferraro []
Sent: Sunday, March 12, 2006 7:06 AM
To: koha-devel@...; tgarip@...
Subject: [Koha-devel] Building zebradb


Tumer,

Sebastian's been good enough to respond to your post (I forwarded this
to the koha-zebra list). If you get a change, could you join koha-zebra
(if you're not already on it) and follow up -- I've a
feeling it could prove to be a very productive thread.

Cheers,

Joshua

----- Forwarded message from Sebastian Hammer <quinn@...>
-----


Hi Joshua,

Thanks for this feedback, it's very interesting. Clearly some of the
issues you describe (i.e. a lack of stability around upadets) indicate
software problems, but there's also some interesting ideas for possible
refinements or new developments which I think would be really useful to
get into the general development plans for the software... there are
more folks at ID involved in Zebra development that I'd like to get into

these thoughts... I dunno if a wiki or just a larger zebra-dev list is
in order, but it's something to think about.

Joshua Ferraro wrote:

>----- Forwarded message from Tümer Garip <tgarip@...> -----
>
>
>Hi,
>We have now put the zebra into production level systems. So here is
>some experience to share.
>
>Building the zebra database from single records is a veeeeery looong
>process. (100K records 150k items)
>
>
Yes, that confirms my expectations. We could think about building  some
kind of buffering into for first-time updating, or else the logic has to

be in the application, as you've seen... the situation is particularly
grim if shadow-indexing is enabled during indexing and every record is
committed, since this causes a sync of the disks, which could take up to

a whole second.

Also, I'm not sure which version of Zebra you're using? I've been foing
some performance-testing of Zebra for the Internet Archive, and noted
quite a difference between 1.3 and 1.4 (the CVS version) which is really

where all the development happens.

>Best method we found:
>
>1- Change zebra.cfg file to include
>
>iso2079.recordType:grs.marcxml.collection
>recordType:grs.xml.collection
>
>2- Write (or hack export.pl) to export all the marc records as one big
>chunk to the correct directory with an extension .iso2079 And system
>call "zebraidx -g iso2079 -d <dbnamehere> update records -n".
>
>This ensures that zebra knows its reading marc records rather than xml
>and builds 100K+ records in zooming speed. Your zoom module always uses

>the grs.xml filter while you can anytime update or reindex any big
>chunk of the database as long as you have marc records.
>
>
Good strategy, I think.. but of course it's weird and awkward to have to

use two different formats, especially when they both have limitations.
We really must look into handling ISO2709 from ZOOM.

Mind you, version 1.4 should be able to read multiple collection-wrapped

MARCXML records in one file, but only (AFAIK) in conjunction witht the
new XSLT-based index rules. I *would* like to try to develop a good way
to work with bibliographic data in that framework.

>3-We are still using the old API so we read the xml and use
>MARC::Record->new_from_xml( $xmldata ) A note here that we did not had
>to upgrade MARC::Record or MARC::Charset at all. Any marc created
>within KOHA is UTF8 and any marc imported into KOHA (old
>marc_subfield_tables) was correctly decoded to utf8 with char_decode of

>biblio.
>
>
Why not ask for the records in ISO2709 from Zebra if that's what you
want to work with (when records are loaded in MARCXML, it can spit out
either 2709 or XML)? Or is it just that you want to have them in
MARC::Record? At any rate, ISO2709 is a more compact exchange format, so

it seems to make sense to use it.

>4- We modified circ2.pm and items table to have item onloan field and
>mapped it to marc holdings data. Now our opac search do not call mysql
>but for the branchname.
>
>5- Average updates per day is about 2000 (circulation+cataloger). I can

>say that the speed of the zoom search which slows down during a commit
>operation is acceptable considering the speed gain we have on the
>search.
>
>
Again, I'd be keen to know which Zebra version you're running?

Beause the Internet Archive will be doing similar point-updates for
records (only a *lot* more often than 2000 times per day) I have been
looking a lot at the update speed for these small changes that only
affect a single term in a record (like a circulation code).. In my test
database of 150K records, 10K average size, I get update times around
.5-.7 seconds, which just seems intuitively faster than it should have
to be. I'm going to nudge the coders to see if we can possibly do this
better.

>6- Zebra behaves very well with searches but is very tempremental with
>updates. A queue of updates sometimes crashes the zebraserver. When the

>database crash we can not save anything even though we are using shadow

>files. I'll be reporting on this issue once we can isolate the
>problems.
>
>
Now *that* is nasty. Do you have *any* way of consistently recreating
the problem? Even if you don't, I'm sure Adam and the Zebra crew would
like to see some stack traces of those crashes!!

--Sebastian

>Regards,
>Tumer
>
>
>
>_______________________________________________
>Koha-devel mailing list
>Koha-devel@...
>http://lists.nongnu.org/mailman/listinfo/koha-devel
>
>----- End forwarded message -----
>
>
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853



----- End forwarded message -----

--
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: RE: [Koha-devel] Building zebradb

by Sebastian Hammer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tümer Garip wrote:

>Hi,
>
>To clear some issues with Sebastian:
>  
>
>>>Again, I'd be keen to know which Zebra version you're running?
>>>      
>>>
>1- I am using Zebra version 1.3.34
>  
>
It might be a good idea to take 1.4 for a spin.. there's been some
changes to the ISAM system. It should otherwise do everythiung the
current server does, and more.

>>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>>want to work with (when records are loaded in MARCXML, it can spit out
>>>      
>>>
>>>either 2709 or XML)? Or is it just that you want to have them in
>>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
>>>      
>>>
>so
>  
>
>>>it seems to make sense to use it.
>>>      
>>>
>2- Definitely correct and theoretically yes, but I keep saying that when
>you feed zebra with MARCXML you CAN NOT get back MARC records. All you
>get back is the leader and lots of blank space. I tried it with
>YAZ-client as well.
>I have reported this as a bug on indexdata list but never got an answer.
>  
>
I wouldn't be too surprised if this was caused by the <collection>
wrapper.. Zebra 1.4 can be configured to look for data at a certain
level of the DOM tree -- 1.3 assumed the root element is the root
element of the record.

>>>I get update times around .5-.7 seconds,
>>>      
>>>
>3- .4 or .5 is what we get as well. Stupidly enough write the file to
>disk use zebraidx and you get .04-.09 and even faster times. What magic
>zebraidx use that ZOOM does not know I donno.
>  
>
The magic is simply updating multiple records at the same time, in one
go -- something that is not possible though ZOOM today.

>>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>>the problem? Even if you don't, I'm sure Adam and the Zebra crew would
>>>      
>>>
>>>like to see some stack traces of those crashes!!
>>>      
>>>
>4- The problem is intermittent. I'll try to provide details when it
>reoccurs. What it says is  "Fatal error: inconsistent register" from
>then on you have to throw the zebradatabase away and rebuild everything.
>You cannot fallback to the working state. Shadow system was supposed to
>do this. If something is gone wrong when you commit it should not go and
>mess the whole database but just refuse to do it and let you wipe off
>the last update commit operation. But you can not do that. So unless I
>have missed something shadow files are useless.
>  
>
Well, they are useless when the problem is caused by a bug in Zebra, as
seems to be the case here. Clearly, the error is not detected until it's
too late. It would be very helpful if we could discover some sequence of
updates that reliably recreated this..

but I'd see this as another good reason for taking 1.4 for a spin. As I
say, there have been notable changes made to the indexing subsystem --
it may be that this is a problem that has been fixed or eliminated by
some other change.

>Here is some more to think about zebra:
>1- The MARCXML we feed into zebra is
><collection><record></record></collection> package. When you get it back
>it is  still like that as supposed to be.  But if you feed zebra with
>iso2709 marc records and ask back from zebra xml records you get back
><record></record> package no <collection> wrapping around it. Although
>this is not a problem currently, I still do not like zebra doing that.
>Its MARC to XML conversion should follow standarts rather than create
>its own.
>  
>
Check the standard. The <collection> wrapper is optional. For my part, I
never  use the <collection> wrapper when I deal with single records. I
don't know if Zebra does the right thing with it, but it seems to work
for you otherwise..

--Seb

>Thanks for your quick response
>Regards,
>
>Tumer
>
>-----Original Message-----
>From: Joshua Ferraro []
>Sent: Sunday, March 12, 2006 7:06 AM
>To: koha-devel@...; tgarip@...
>Subject: [Koha-devel] Building zebradb
>
>
>Tumer,
>
>Sebastian's been good enough to respond to your post (I forwarded this
>to the koha-zebra list). If you get a change, could you join koha-zebra
>(if you're not already on it) and follow up -- I've a
>feeling it could prove to be a very productive thread.
>
>Cheers,
>
>Joshua
>
>----- Forwarded message from Sebastian Hammer <quinn@...>
>-----
>
>
>Hi Joshua,
>
>Thanks for this feedback, it's very interesting. Clearly some of the
>issues you describe (i.e. a lack of stability around upadets) indicate
>software problems, but there's also some interesting ideas for possible
>refinements or new developments which I think would be really useful to
>get into the general development plans for the software... there are
>more folks at ID involved in Zebra development that I'd like to get into
>
>these thoughts... I dunno if a wiki or just a larger zebra-dev list is
>in order, but it's something to think about.
>
>Joshua Ferraro wrote:
>
>  
>
>>----- Forwarded message from Tümer Garip <tgarip@...> -----
>>
>>
>>Hi,
>>We have now put the zebra into production level systems. So here is
>>some experience to share.
>>
>>Building the zebra database from single records is a veeeeery looong
>>process. (100K records 150k items)
>>
>>
>>    
>>
>Yes, that confirms my expectations. We could think about building  some
>kind of buffering into for first-time updating, or else the logic has to
>
>be in the application, as you've seen... the situation is particularly
>grim if shadow-indexing is enabled during indexing and every record is
>committed, since this causes a sync of the disks, which could take up to
>
>a whole second.
>
>Also, I'm not sure which version of Zebra you're using? I've been foing
>some performance-testing of Zebra for the Internet Archive, and noted
>quite a difference between 1.3 and 1.4 (the CVS version) which is really
>
>where all the development happens.
>
>  
>
>>Best method we found:
>>
>>1- Change zebra.cfg file to include
>>
>>iso2079.recordType:grs.marcxml.collection
>>recordType:grs.xml.collection
>>
>>2- Write (or hack export.pl) to export all the marc records as one big
>>chunk to the correct directory with an extension .iso2079 And system
>>call "zebraidx -g iso2079 -d <dbnamehere> update records -n".
>>
>>This ensures that zebra knows its reading marc records rather than xml
>>and builds 100K+ records in zooming speed. Your zoom module always uses
>>    
>>
>
>  
>
>>the grs.xml filter while you can anytime update or reindex any big
>>chunk of the database as long as you have marc records.
>>
>>
>>    
>>
>Good strategy, I think.. but of course it's weird and awkward to have to
>
>use two different formats, especially when they both have limitations.
>We really must look into handling ISO2709 from ZOOM.
>
>Mind you, version 1.4 should be able to read multiple collection-wrapped
>
>MARCXML records in one file, but only (AFAIK) in conjunction witht the
>new XSLT-based index rules. I *would* like to try to develop a good way
>to work with bibliographic data in that framework.
>
>  
>
>>3-We are still using the old API so we read the xml and use
>>MARC::Record->new_from_xml( $xmldata ) A note here that we did not had
>>to upgrade MARC::Record or MARC::Charset at all. Any marc created
>>within KOHA is UTF8 and any marc imported into KOHA (old
>>marc_subfield_tables) was correctly decoded to utf8 with char_decode of
>>    
>>
>
>  
>
>>biblio.
>>
>>
>>    
>>
>Why not ask for the records in ISO2709 from Zebra if that's what you
>want to work with (when records are loaded in MARCXML, it can spit out
>either 2709 or XML)? Or is it just that you want to have them in
>MARC::Record? At any rate, ISO2709 is a more compact exchange format, so
>
>it seems to make sense to use it.
>
>  
>
>>4- We modified circ2.pm and items table to have item onloan field and
>>mapped it to marc holdings data. Now our opac search do not call mysql
>>but for the branchname.
>>
>>5- Average updates per day is about 2000 (circulation+cataloger). I can
>>    
>>
>
>  
>
>>say that the speed of the zoom search which slows down during a commit
>>operation is acceptable considering the speed gain we have on the
>>search.
>>
>>
>>    
>>
>Again, I'd be keen to know which Zebra version you're running?
>
>Beause the Internet Archive will be doing similar point-updates for
>records (only a *lot* more often than 2000 times per day) I have been
>looking a lot at the update speed for these small changes that only
>affect a single term in a record (like a circulation code).. In my test
>database of 150K records, 10K average size, I get update times around
>.5-.7 seconds, which just seems intuitively faster than it should have
>to be. I'm going to nudge the coders to see if we can possibly do this
>better.
>
>  
>
>>6- Zebra behaves very well with searches but is very tempremental with
>>updates. A queue of updates sometimes crashes the zebraserver. When the
>>    
>>
>
>  
>
>>database crash we can not save anything even though we are using shadow
>>    
>>
>
>  
>
>>files. I'll be reporting on this issue once we can isolate the
>>problems.
>>
>>
>>    
>>
>Now *that* is nasty. Do you have *any* way of consistently recreating
>the problem? Even if you don't, I'm sure Adam and the Zebra crew would
>like to see some stack traces of those crashes!!
>
>--Sebastian
>
>  
>
>>Regards,
>>Tumer
>>
>>
>>
>>_______________________________________________
>>Koha-devel mailing list
>>Koha-devel@...
>>http://lists.nongnu.org/mailman/listinfo/koha-devel
>>
>>----- End forwarded message -----
>>
>>
>>
>>    
>>
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853




_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: RE: [Koha-devel] Building zebradb

by Sebastian Hammer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Oh, btw. The daily snapshots of 1.4 can be found at

ftp://ftp.indexdata.dk/pub/snapshot

--Seb


_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

RE: RE: [Koha-devel] Building zebradb

by Tümer Garip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi again Sebastian,

You are a gem. You are absolutely right about the <collection> wrapper.
The MARC::File::XML module produces MARCXML with this wrapper. Having
removed this wrapper now I can get iso2079 out of zebra with no problem
at all.

I'll play with version 1.4 during this week.

And here is something more for those using Windows platform.
I used to report that sorting does not work. Well it does work. The
problem was I have utf8 characters in my
sort.chr table. The windows notepad puts some hidden character to the
beginning of a file if it contains utf8 characters. Zebra does not like
that and gives syntax error. So I used another package to produce my
sort.chr Everything is now OK.

I have tp get back and start using 1.4 now

Thanks
Tumer

-----Original Message-----
From: Sebastian Hammer [mailto:quinn@...]
Sent: Sunday, March 12, 2006 5:03 PM
To: Tümer Garip
Cc: koha-zebra@...; Adam Dickmeiss
Subject: Re: [Koha-zebra] RE: [Koha-devel] Building zebradb


Tümer Garip wrote:

>Hi,
>
>To clear some issues with Sebastian:
>  
>
>>>Again, I'd be keen to know which Zebra version you're running?
>>>      
>>>
>1- I am using Zebra version 1.3.34
>  
>
It might be a good idea to take 1.4 for a spin.. there's been some
changes to the ISAM system. It should otherwise do everythiung the
current server does, and more.

>>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>>want to work with (when records are loaded in MARCXML, it can spit
out

>>>      
>>>
>>>either 2709 or XML)? Or is it just that you want to have them in
>>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
>>>      
>>>
>so
>  
>
>>>it seems to make sense to use it.
>>>      
>>>
>2- Definitely correct and theoretically yes, but I keep saying that
>when you feed zebra with MARCXML you CAN NOT get back MARC records. All

>you get back is the leader and lots of blank space. I tried it with
>YAZ-client as well. I have reported this as a bug on indexdata list but

>never got an answer.
>  
>
I wouldn't be too surprised if this was caused by the <collection>
wrapper.. Zebra 1.4 can be configured to look for data at a certain
level of the DOM tree -- 1.3 assumed the root element is the root
element of the record.

>>>I get update times around .5-.7 seconds,
>>>      
>>>
>3- .4 or .5 is what we get as well. Stupidly enough write the file to
>disk use zebraidx and you get .04-.09 and even faster times. What magic

>zebraidx use that ZOOM does not know I donno.
>  
>
The magic is simply updating multiple records at the same time, in one
go -- something that is not possible though ZOOM today.

>>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>>the problem? Even if you don't, I'm sure Adam and the Zebra crew
would
>>>      
>>>
>>>like to see some stack traces of those crashes!!
>>>      
>>>
>4- The problem is intermittent. I'll try to provide details when it
>reoccurs. What it says is  "Fatal error: inconsistent register" from
>then on you have to throw the zebradatabase away and rebuild
>everything. You cannot fallback to the working state. Shadow system was

>supposed to do this. If something is gone wrong when you commit it
>should not go and mess the whole database but just refuse to do it and
>let you wipe off the last update commit operation. But you can not do
>that. So unless I have missed something shadow files are useless.
>  
>
Well, they are useless when the problem is caused by a bug in Zebra, as
seems to be the case here. Clearly, the error is not detected until it's

too late. It would be very helpful if we could discover some sequence of

updates that reliably recreated this..

but I'd see this as another good reason for taking 1.4 for a spin. As I
say, there have been notable changes made to the indexing subsystem --
it may be that this is a problem that has been fixed or eliminated by
some other change.

>Here is some more to think about zebra:
>1- The MARCXML we feed into zebra is
><collection><record></record></collection> package. When you get it
>back it is  still like that as supposed to be.  But if you feed zebra
>with iso2709 marc records and ask back from zebra xml records you get
>back <record></record> package no <collection> wrapping around it.
>Although this is not a problem currently, I still do not like zebra
>doing that. Its MARC to XML conversion should follow standarts rather
>than create its own.
>  
>
Check the standard. The <collection> wrapper is optional. For my part, I

never  use the <collection> wrapper when I deal with single records. I
don't know if Zebra does the right thing with it, but it seems to work
for you otherwise..

--Seb

>Thanks for your quick response
>Regards,
>
>Tumer
>
>-----Original Message-----
>From: Joshua Ferraro []
>Sent: Sunday, March 12, 2006 7:06 AM
>To: koha-devel@...; tgarip@...
>Subject: [Koha-devel] Building zebradb
>
>
>Tumer,
>
>Sebastian's been good enough to respond to your post (I forwarded this
>to the koha-zebra list). If you get a change, could you join koha-zebra

>(if you're not already on it) and follow up -- I've a feeling it could
>prove to be a very productive thread.
>
>Cheers,
>
>Joshua
>
>----- Forwarded message from Sebastian Hammer <quinn@...>
>-----
>
>
>Hi Joshua,
>
>Thanks for this feedback, it's very interesting. Clearly some of the
>issues you describe (i.e. a lack of stability around upadets) indicate
>software problems, but there's also some interesting ideas for possible

>refinements or new developments which I think would be really useful to

>get into the general development plans for the software... there are
>more folks at ID involved in Zebra development that I'd like to get
into

>
>these thoughts... I dunno if a wiki or just a larger zebra-dev list is
>in order, but it's something to think about.
>
>Joshua Ferraro wrote:
>
>  
>
>>----- Forwarded message from Tümer Garip <tgarip@...> -----
>>
>>
>>Hi,
>>We have now put the zebra into production level systems. So here is
>>some experience to share.
>>
>>Building the zebra database from single records is a veeeeery looong
>>process. (100K records 150k items)
>>
>>
>>    
>>
>Yes, that confirms my expectations. We could think about building  some
>kind of buffering into for first-time updating, or else the logic has
to
>
>be in the application, as you've seen... the situation is particularly
>grim if shadow-indexing is enabled during indexing and every record is
>committed, since this causes a sync of the disks, which could take up
to
>
>a whole second.
>
>Also, I'm not sure which version of Zebra you're using? I've been foing
>some performance-testing of Zebra for the Internet Archive, and noted
>quite a difference between 1.3 and 1.4 (the CVS version) which is
really

>
>where all the development happens.
>
>  
>
>>Best method we found:
>>
>>1- Change zebra.cfg file to include
>>
>>iso2079.recordType:grs.marcxml.collection
>>recordType:grs.xml.collection
>>
>>2- Write (or hack export.pl) to export all the marc records as one big
>>chunk to the correct directory with an extension .iso2079 And system
>>call "zebraidx -g iso2079 -d <dbnamehere> update records -n".
>>
>>This ensures that zebra knows its reading marc records rather than xml
>>and builds 100K+ records in zooming speed. Your zoom module always
uses

>>    
>>
>
>  
>
>>the grs.xml filter while you can anytime update or reindex any big
>>chunk of the database as long as you have marc records.
>>
>>
>>    
>>
>Good strategy, I think.. but of course it's weird and awkward to have
>to
>
>use two different formats, especially when they both have limitations.
>We really must look into handling ISO2709 from ZOOM.
>
>Mind you, version 1.4 should be able to read multiple
>collection-wrapped
>
>MARCXML records in one file, but only (AFAIK) in conjunction witht the
>new XSLT-based index rules. I *would* like to try to develop a good way

>to work with bibliographic data in that framework.
>
>  
>
>>3-We are still using the old API so we read the xml and use
>>MARC::Record->new_from_xml( $xmldata ) A note here that we did not had

>>to upgrade MARC::Record or MARC::Charset at all. Any marc created
>>within KOHA is UTF8 and any marc imported into KOHA (old
>>marc_subfield_tables) was correctly decoded to utf8 with char_decode
of

>>    
>>
>
>  
>
>>biblio.
>>
>>
>>    
>>
>Why not ask for the records in ISO2709 from Zebra if that's what you
>want to work with (when records are loaded in MARCXML, it can spit out
>either 2709 or XML)? Or is it just that you want to have them in
>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
so
>
>it seems to make sense to use it.
>
>  
>
>>4- We modified circ2.pm and items table to have item onloan field and
>>mapped it to marc holdings data. Now our opac search do not call mysql

>>but for the branchname.
>>
>>5- Average updates per day is about 2000 (circulation+cataloger). I
>>can
>>    
>>
>
>  
>
>>say that the speed of the zoom search which slows down during a commit
>>operation is acceptable considering the speed gain we have on the
>>search.
>>
>>
>>    
>>
>Again, I'd be keen to know which Zebra version you're running?
>
>Beause the Internet Archive will be doing similar point-updates for
>records (only a *lot* more often than 2000 times per day) I have been
>looking a lot at the update speed for these small changes that only
>affect a single term in a record (like a circulation code).. In my test

>database of 150K records, 10K average size, I get update times around
>.5-.7 seconds, which just seems intuitively faster than it should have
>to be. I'm going to nudge the coders to see if we can possibly do this
>better.
>
>  
>
>>6- Zebra behaves very well with searches but is very tempremental with
>>updates. A queue of updates sometimes crashes the zebraserver. When
the

>>    
>>
>
>  
>
>>database crash we can not save anything even though we are using
>>shadow
>>    
>>
>
>  
>
>>files. I'll be reporting on this issue once we can isolate the
>>problems.
>>
>>
>>    
>>
>Now *that* is nasty. Do you have *any* way of consistently recreating
>the problem? Even if you don't, I'm sure Adam and the Zebra crew would
>like to see some stack traces of those crashes!!
>
>--Sebastian
>
>  
>
>>Regards,
>>Tumer
>>
>>
>>
>>_______________________________________________
>>Koha-devel mailing list
>>Koha-devel@...
>>http://lists.nongnu.org/mailman/listinfo/koha-devel
>>
>>----- End forwarded message -----
>>
>>
>>
>>    
>>
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853




_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: RE: [Koha-devel] Building zebradb

by Sebastian Hammer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tümer Garip wrote:

>Hi again Sebastian,
>
>You are a gem. You are absolutely right about the <collection> wrapper.
>The MARC::File::XML module produces MARCXML with this wrapper. Having
>removed this wrapper now I can get iso2079 out of zebra with no problem
>at all.
>  
>
Glad you got that sorted.

>I'll play with version 1.4 during this week.
>
>And here is something more for those using Windows platform.
>I used to report that sorting does not work. Well it does work. The
>problem was I have utf8 characters in my
>sort.chr table. The windows notepad puts some hidden character to the
>beginning of a file if it contains utf8 characters. Zebra does not like
>that and gives syntax error. So I used another package to produce my
>sort.chr Everything is now OK.
>  
>
:-)

--Seb

>I have tp get back and start using 1.4 now
>
>Thanks
>Tumer
>
>-----Original Message-----
>From: Sebastian Hammer [mailto:quinn@...]
>Sent: Sunday, March 12, 2006 5:03 PM
>To: Tümer Garip
>Cc: koha-zebra@...; Adam Dickmeiss
>Subject: Re: [Koha-zebra] RE: [Koha-devel] Building zebradb
>
>
>Tümer Garip wrote:
>
>  
>
>>Hi,
>>
>>To clear some issues with Sebastian:
>>
>>
>>    
>>
>>>>Again, I'd be keen to know which Zebra version you're running?
>>>>    
>>>>
>>>>        
>>>>
>>1- I am using Zebra version 1.3.34
>>
>>
>>    
>>
>It might be a good idea to take 1.4 for a spin.. there's been some
>changes to the ISAM system. It should otherwise do everythiung the
>current server does, and more.
>
>  
>
>>>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>>>want to work with (when records are loaded in MARCXML, it can spit
>>>>        
>>>>
>out
>  
>
>>>>    
>>>>
>>>>either 2709 or XML)? Or is it just that you want to have them in
>>>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
>>>>    
>>>>
>>>>        
>>>>
>>so
>>
>>
>>    
>>
>>>>it seems to make sense to use it.
>>>>    
>>>>
>>>>        
>>>>
>>2- Definitely correct and theoretically yes, but I keep saying that
>>when you feed zebra with MARCXML you CAN NOT get back MARC records. All
>>    
>>
>
>  
>
>>you get back is the leader and lots of blank space. I tried it with
>>YAZ-client as well. I have reported this as a bug on indexdata list but
>>    
>>
>
>  
>
>>never got an answer.
>>
>>
>>    
>>
>I wouldn't be too surprised if this was caused by the <collection>
>wrapper.. Zebra 1.4 can be configured to look for data at a certain
>level of the DOM tree -- 1.3 assumed the root element is the root
>element of the record.
>
>  
>
>>>>I get update times around .5-.7 seconds,
>>>>    
>>>>
>>>>        
>>>>
>>3- .4 or .5 is what we get as well. Stupidly enough write the file to
>>disk use zebraidx and you get .04-.09 and even faster times. What magic
>>    
>>
>
>  
>
>>zebraidx use that ZOOM does not know I donno.
>>
>>
>>    
>>
>The magic is simply updating multiple records at the same time, in one
>go -- something that is not possible though ZOOM today.
>
>  
>
>>>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>>>the problem? Even if you don't, I'm sure Adam and the Zebra crew
>>>>        
>>>>
>would
>  
>
>>>>    
>>>>
>>>>like to see some stack traces of those crashes!!
>>>>    
>>>>
>>>>        
>>>>
>>4- The problem is intermittent. I'll try to provide details when it
>>reoccurs. What it says is  "Fatal error: inconsistent register" from
>>then on you have to throw the zebradatabase away and rebuild
>>everything. You cannot fallback to the working state. Shadow system was
>>    
>>
>
>  
>
>>supposed to do this. If something is gone wrong when you commit it
>>should not go and mess the whole database but just refuse to do it and
>>let you wipe off the last update commit operation. But you can not do
>>that. So unless I have missed something shadow files are useless.
>>
>>
>>    
>>
>Well, they are useless when the problem is caused by a bug in Zebra, as
>seems to be the case here. Clearly, the error is not detected until it's
>
>too late. It would be very helpful if we could discover some sequence of
>
>updates that reliably recreated this..
>
>but I'd see this as another good reason for taking 1.4 for a spin. As I
>say, there have been notable changes made to the indexing subsystem --
>it may be that this is a problem that has been fixed or eliminated by
>some other change.
>
>  
>
>>Here is some more to think about zebra:
>>1- The MARCXML we feed into zebra is
>><collection><record></record></collection> package. When you get it
>>back it is  still like that as supposed to be.  But if you feed zebra
>>with iso2709 marc records and ask back from zebra xml records you get
>>back <record></record> package no <collection> wrapping around it.
>>Although this is not a problem currently, I still do not like zebra
>>doing that. Its MARC to XML conversion should follow standarts rather
>>than create its own.
>>
>>
>>    
>>
>Check the standard. The <collection> wrapper is optional. For my part, I
>
>never  use the <collection> wrapper when I deal with single records. I
>don't know if Zebra does the right thing with it, but it seems to work
>for you otherwise..
>
>--Seb
>
>  
>
>>Thanks for your quick response
>>Regards,
>>
>>Tumer
>>
>>-----Original Message-----
>>From: Joshua Ferraro []
>>Sent: Sunday, March 12, 2006 7:06 AM
>>To: koha-devel@...; tgarip@...
>>Subject: [Koha-devel] Building zebradb
>>
>>
>>Tumer,
>>
>>Sebastian's been good enough to respond to your post (I forwarded this
>>to the koha-zebra list). If you get a change, could you join koha-zebra
>>    
>>
>
>  
>
>>(if you're not already on it) and follow up -- I've a feeling it could
>>prove to be a very productive thread.
>>
>>Cheers,
>>
>>Joshua
>>
>>----- Forwarded message from Sebastian Hammer <quinn@...>
>>-----
>>
>>
>>Hi Joshua,
>>
>>Thanks for this feedback, it's very interesting. Clearly some of the
>>issues you describe (i.e. a lack of stability around upadets) indicate
>>software problems, but there's also some interesting ideas for possible
>>    
>>
>
>  
>
>>refinements or new developments which I think would be really useful to
>>    
>>
>
>  
>
>>get into the general development plans for the software... there are
>>more folks at ID involved in Zebra development that I'd like to get
>>    
>>
>into
>  
>
>>these thoughts... I dunno if a wiki or just a larger zebra-dev list is
>>in order, but it's something to think about.
>>
>>Joshua Ferraro wrote:
>>
>>
>>
>>    
>>
>>>----- Forwarded message from Tümer Garip <tgarip@...> -----
>>>
>>>
>>>Hi,
>>>We have now put the zebra into production level systems. So here is
>>>some experience to share.
>>>
>>>Building the zebra database from single records is a veeeeery looong
>>>process. (100K records 150k items)
>>>
>>>
>>>  
>>>
>>>      
>>>
>>Yes, that confirms my expectations. We could think about building  some
>>kind of buffering into for first-time updating, or else the logic has
>>    
>>
>to
>  
>
>>be in the application, as you've seen... the situation is particularly
>>grim if shadow-indexing is enabled during indexing and every record is
>>committed, since this causes a sync of the disks, which could take up
>>    
>>
>to
>  
>
>>a whole second.
>>
>>Also, I'm not sure which version of Zebra you're using? I've been foing
>>some performance-testing of Zebra for the Internet Archive, and noted
>>quite a difference between 1.3 and 1.4 (the CVS version) which is
>>    
>>
>really
>  
>
>>where all the development happens.
>>
>>
>>
>>    
>>
>>>Best method we found:
>>>
>>>1- Change zebra.cfg file to include
>>>
>>>iso2079.recordType:grs.marcxml.collection
>>>recordType:grs.xml.collection
>>>
>>>2- Write (or hack export.pl) to export all the marc records as one big
>>>chunk to the correct directory with an extension .iso2079 And system
>>>call "zebraidx -g iso2079 -d <dbnamehere> update records -n".
>>>
>>>This ensures that zebra knows its reading marc records rather than xml
>>>and builds 100K+ records in zooming speed. Your zoom module always
>>>      
>>>
>uses
>  
>
>>>  
>>>
>>>      
>>>
>>
>>
>>    
>>
>>>the grs.xml filter while you can anytime update or reindex any big
>>>chunk of the database as long as you have marc records.
>>>
>>>
>>>  
>>>
>>>      
>>>
>>Good strategy, I think.. but of course it's weird and awkward to have
>>to
>>
>>use two different formats, especially when they both have limitations.
>>We really must look into handling ISO2709 from ZOOM.
>>
>>Mind you, version 1.4 should be able to read multiple
>>collection-wrapped
>>
>>MARCXML records in one file, but only (AFAIK) in conjunction witht the
>>new XSLT-based index rules. I *would* like to try to develop a good way
>>    
>>
>
>  
>
>>to work with bibliographic data in that framework.
>>
>>
>>
>>    
>>
>>>3-We are still using the old API so we read the xml and use
>>>MARC::Record->new_from_xml( $xmldata ) A note here that we did not had
>>>      
>>>
>
>  
>
>>>to upgrade MARC::Record or MARC::Charset at all. Any marc created
>>>within KOHA is UTF8 and any marc imported into KOHA (old
>>>marc_subfield_tables) was correctly decoded to utf8 with char_decode
>>>      
>>>
>of
>  
>
>>>  
>>>
>>>      
>>>
>>
>>
>>    
>>
>>>biblio.
>>>
>>>
>>>  
>>>
>>>      
>>>
>>Why not ask for the records in ISO2709 from Zebra if that's what you
>>want to work with (when records are loaded in MARCXML, it can spit out
>>either 2709 or XML)? Or is it just that you want to have them in
>>MARC::Record? At any rate, ISO2709 is a more compact exchange format,
>>    
>>
>so
>  
>
>>it seems to make sense to use it.
>>
>>
>>
>>    
>>
>>>4- We modified circ2.pm and items table to have item onloan field and
>>>mapped it to marc holdings data. Now our opac search do not call mysql
>>>      
>>>
>
>  
>
>>>but for the branchname.
>>>
>>>5- Average updates per day is about 2000 (circulation+cataloger). I
>>>can
>>>  
>>>
>>>      
>>>
>>
>>
>>    
>>
>>>say that the speed of the zoom search which slows down during a commit
>>>operation is acceptable considering the speed gain we have on the
>>>search.
>>>
>>>
>>>  
>>>
>>>      
>>>
>>Again, I'd be keen to know which Zebra version you're running?
>>
>>Beause the Internet Archive will be doing similar point-updates for
>>records (only a *lot* more often than 2000 times per day) I have been
>>looking a lot at the update speed for these small changes that only
>>affect a single term in a record (like a circulation code).. In my test
>>    
>>
>
>  
>
>>database of 150K records, 10K average size, I get update times around
>>.5-.7 seconds, which just seems intuitively faster than it should have
>>to be. I'm going to nudge the coders to see if we can possibly do this
>>better.
>>
>>
>>
>>    
>>
>>>6- Zebra behaves very well with searches but is very tempremental with
>>>updates. A queue of updates sometimes crashes the zebraserver. When
>>>      
>>>
>the
>  
>
>>>  
>>>
>>>      
>>>
>>
>>
>>    
>>
>>>database crash we can not save anything even though we are using
>>>shadow
>>>  
>>>
>>>      
>>>
>>
>>
>>    
>>
>>>files. I'll be reporting on this issue once we can isolate the
>>>problems.
>>>
>>>
>>>  
>>>
>>>      
>>>
>>Now *that* is nasty. Do you have *any* way of consistently recreating
>>the problem? Even if you don't, I'm sure Adam and the Zebra crew would
>>like to see some stack traces of those crashes!!
>>
>>--Sebastian
>>
>>
>>
>>    
>>
>>>Regards,
>>>Tumer
>>>
>>>
>>>
>>>_______________________________________________
>>>Koha-devel mailing list
>>>Koha-devel@...
>>>http://lists.nongnu.org/mailman/listinfo/koha-devel
>>>
>>>----- End forwarded message -----
>>>
>>>
>>>
>>>  
>>>
>>>      
>>>
>>
>>
>>    
>>
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853




_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: RE: [Koha-devel] Building zebradb

by Paul POULAIN-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tümer Garip a écrit :
> Hi again Sebastian,
>
> You are a gem. You are absolutely right about the <collection> wrapper.
> The MARC::File::XML module produces MARCXML with this wrapper. Having
> removed this wrapper now I can get iso2079 out of zebra with no problem
> at all.

Hi Tümer,

could you detail how you "remove the <collection> wrapper" ? I
understand what it could mean, but I don't know what to do to give it a try.

However, I agree that with <collection> I can't get iso2709 in
yaz-client, but get an empty record (except for leader)

Thanks again for your help !
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Parent Message unknown Re: [Koha-devel] Building zebradb

by Paul POULAIN-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tümer Garip a écrit :
> Hi,

Hello Tümer,

> We have now put the zebra into production level systems. So here is some
> experience to share.
> Building the zebra database from single records is a veeeeery looong
> process. (100K records 150k items)
>
> Best method we found:
>
> 1- Change zebra.cfg file to include
>
> iso2079.recordType:grs.marcxml.collection
> recordType:grs.xml.collection
if I understand, you now have 2 types of records in your DB (or 2
differents representations of a record)

> 2- Write (or hack export.pl) to export all the marc records as one big
> chunk to the correct directory with an extension .iso2079 And system
> call "zebraidx -g iso2079 -d <dbnamehere> update records -n".

Could you send us the code for export.pl ?

> This ensures that zebra knows its reading marc records rather than xml
> and builds 100K+ records in zooming speed.
> Your zoom module always uses the grs.xml filter while you can anytime
> update or reindex any big chunk of the database as long as you have marc
> records.

Great, I think I understand.

> 3-We are still using the old API weso  read the xml and use
> MARC::Record->new_from_xml( $xmldata )
> A note here that we did not had to upgrade MARC::Record or MARC::Charset
> at all. Any marc created within KOHA is UTF8 and any marc imported into
> KOHA (old marc_subfield_tables) was correctly decoded to utf8 with
> char_decode of biblio.

Could it be possible to use this zebra.cfg to manage iso2709 through
Perl-ZOOM ?
If yes, we could avoid marc => xml => zoom and zoom => xml => marc
transformations.

> 4- We modified circ2.pm and items table to have item onloan field and
> mapped it to marc holdings data. Now our opac search do not call mysql
> but for the branchname.

Could you send us/me the code too ?

> 5- Average updates per day is about 2000 (circulation+cataloger). I can
> say that the speed of the zoom search which slows down during a commit
> operation is acceptable considering the speed gain we have on the
> search.
>
> 6- Zebra behaves very well with searches but is very tempremental with
> updates. A queue of updates sometimes crashes the zebraserver. When the
> database crash we can not save anything even though we are using shadow
> files. I'll be reporting on this issue once we can isolate the problems.

You're definetly a gem too ;-)

--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

RE: [Koha-devel] Building zebradb

by Tümer Garip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Paul,
The script you have requested I have posted them tou you directly as I
donno whether the list accepts attachments.

Regarding your questions below:
1- Well yes I am feeding zebra with 2 different kinds of records
(iso2079 and xml) and I do not see any problem with this as ZEBRA
changes everything to its own format anyway. This way I can create a
100K+ records in around 5 minutes in ZEBRA (this time excludes the time
it takes to export my whole database whisch is around 15 min). I can use
ZOOM with XML to update - delete or add new records.

2-Regarding whether we can use perl-ZOOM with iso2079 records it seems
at the moment NO!. And it is all our wish that indexdata does
incorporate this facility for us at some stage.

Best of luck,

Tumer

-----Original Message-----
From: Paul POULAIN [mailto:paul.poulain@...]
Sent: Wednesday, March 15, 2006 7:19 PM
To: Tümer Garip
Cc: koha-devel@...; koha-zebra@...
Subject: Re: [Koha-devel] Building zebradb


Tümer Garip a écrit :
> Hi,

Hello Tümer,

> We have now put the zebra into production level systems. So here is
> some experience to share. Building the zebra database from single
> records is a veeeeery looong process. (100K records 150k items)
>
> Best method we found:
>
> 1- Change zebra.cfg file to include
>
> iso2079.recordType:grs.marcxml.collection
> recordType:grs.xml.collection
if I understand, you now have 2 types of records in your DB (or 2
differents representations of a record)

> 2- Write (or hack export.pl) to export all the marc records as one big

> chunk to the correct directory with an extension .iso2079 And system
> call "zebraidx -g iso2079 -d <dbnamehere> update records -n".

Could you send us the code for export.pl ?

> This ensures that zebra knows its reading marc records rather than xml

> and builds 100K+ records in zooming speed. Your zoom module always
> uses the grs.xml filter while you can anytime update or reindex any
> big chunk of the database as long as you have marc records.

Great, I think I understand.

> 3-We are still using the old API weso  read the xml and use
> MARC::Record->new_from_xml( $xmldata ) A note here that we did not had

> to upgrade MARC::Record or MARC::Charset at all. Any marc created
> within KOHA is UTF8 and any marc imported into KOHA (old
> marc_subfield_tables) was correctly decoded to utf8 with char_decode
> of biblio.

Could it be possible to use this zebra.cfg to manage iso2709 through
Perl-ZOOM ?
If yes, we could avoid marc => xml => zoom and zoom => xml => marc
transformations.

> 4- We modified circ2.pm and items table to have item onloan field and
> mapped it to marc holdings data. Now our opac search do not call mysql

> but for the branchname.

Could you send us/me the code too ?

> 5- Average updates per day is about 2000 (circulation+cataloger). I
> can say that the speed of the zoom search which slows down during a
> commit operation is acceptable considering the speed gain we have on
> the search.
>
> 6- Zebra behaves very well with searches but is very tempremental with

> updates. A queue of updates sometimes crashes the zebraserver. When
> the database crash we can not save anything even though we are using
> shadow files. I'll be reporting on this issue once we can isolate the
> problems.

You're definetly a gem too ;-)

--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra