|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
Import SpeedHi all,
So we've finally got records importing into Zebra, but there are some serious performance issues, some related to MARC::Charset and some (I suspect) to Perl-ZOOM/Zebra. So ... let me lay out what we're doing and then show some benchmarks I've done before I get to my questions. CONNECTION OBJECT All of our connections to Zebra are handled in our Context module. My test indicate that it maintains the same connection throughout all the operations. The connection maintainer checks the connection like this: if (defined($context->{"Zconn"})) { $Zconn = $context->{"Zconn"}; $rs=$Zconn->search_pqf('@attr 1=4 mineral'); if ($Zconn->errcode() != 0) { $context->{"Zconn"} = &new_Zconn(); return $context->{"Zconn"}; } return $context->{"Zconn"}; } else { $context->{"Zconn"} = &new_Zconn(); return $context->{"Zconn"}; } } Of course, in the context of batch import, this gets called for every record import (with '.xml' as the storage format) ... just wondering if there's a better way ... see below for details. IMPORT PROCESS We've got a script called 'bulkmarcimport' that is able to take a file full of MARC records and import them into Zebra as well as some MySQL fields. It can be set to perform a 'commit' operation after a certain number of records, but I haven't seen any performance issues related to when the 'commit' operation happens (ie, if you do it once every 100 records it seems about the same speed as if you do it once every 10000). The process looks something like this: Grab a MARC record with MARC::Batch Add some local use fields, etc. Feed it to old-koha routines to put some stuff in MySQL Hand it to MARC::File::XML to create an XML object MARC::File::XML converts from MARC-8 to UTF-8 where necessary Feed it to Zebra via perl-zoom Perform a commit operation every so often BENCHMARCKS I recently tried the import with 5000 records. The first few hundred were rolling along at about 3-5 per second but this eventually droped to 1-2 per second and finally to about 1 per 2 secs. The whole batch imported in 40604 seconds. Running a benchmarc on an import of 100 records yields the following: sam:/home/koha/testing/cvsrepo/koha# perl -d:DProf misc/migration_tools/bulkmarcimport.pl -d -n 100 -commit 100 -file /home/jmf/koha.mrc deleting biblios COMMIT OPERATION SUCCESSFUL 100 MARC records imported in 59.907173871994 seconds sam:/home/koha/testing/cvsrepo/koha# dprofpp tmon.out Exporter::export_ok_tags has -1 unstacked calls in outer AutoLoader::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export has 12 unstacked calls in outer bytes::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer POSIX::__ANON__ has 1 unstacked calls in outer POSIX::load_imports has 1 unstacked calls in outer Exporter::export has -12 unstacked calls in outer Storable::thaw has 1 unstacked calls in outer bytes::length has 1 unstacked calls in outer POSIX::AUTOLOAD has -2 unstacked calls in outer Total Elapsed Time = 39.96495 Seconds User+System Time = 18.82495 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 19.4 3.657 20.081 25211 0.0001 0.0008 MARC::Charset::marc8_to_utf8 17.1 3.223 9.841 264274 0.0000 0.0000 MARC::Charset::Table::get_code 13.6 2.566 2.566 264257 0.0000 0.0000 Storable::mretrieve 11.3 2.141 0.000 264257 0.0000 0.0000 Storable::thaw 8.57 1.613 2.934 528514 0.0000 0.0000 Class::Accessor::__ANON__ 8.05 1.516 1.516 264274 0.0000 0.0000 SDBM_File::FETCH 7.07 1.331 2.669 264257 0.0000 0.0000 MARC::Charset::Code::char_value 7.02 1.321 1.321 528514 0.0000 0.0000 Class::Accessor::get 6.80 1.281 11.123 264274 0.0000 0.0000 MARC::Charset::Table::lookup_by_ma rc8 4.81 0.906 0.906 264257 0.0000 0.0000 MARC::Charset::_process_escape 2.34 0.441 0.837 14532 0.0000 0.0001 MARC::Record::field 2.10 0.396 0.396 264274 0.0000 0.0000 MARC::Charset::Table::db 2.09 0.393 0.393 196731 0.0000 0.0000 MARC::Field::tag 1.83 0.344 0.344 25714 0.0000 0.0000 MARC::File::XML::escape 1.66 0.313 21.170 503 0.0006 0.0421 MARC::File::XML::record So ... that's all good, mostly MARC::* to blame for speed issues there. However, that doesn't explain the dramatic slowdown over a period of time. So I did another benchmark, this time with 5000 records. Here were the results: # dprofpp tmon.out Exporter::export_ok_tags has -1 unstacked calls in outer AutoLoader::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export has 12 unstacked calls in outer bytes::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer POSIX::__ANON__ has 1 unstacked calls in outer POSIX::load_imports has 1 unstacked calls in outer Exporter::export has -12 unstacked calls in outer utf8::AUTOLOAD has -1 unstacked calls in outer utf8::SWASHNEW has 1 unstacked calls in outer Storable::thaw has 1 unstacked calls in outer bytes::length has 1 unstacked calls in outer POSIX::AUTOLOAD has -2 unstacked calls in outer Total Elapsed Time = 37695.61 Seconds User+System Time = 33524.23 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 97.9 32848 32848. 10170 3.2299 3.2299 Net::Z3950::ZOOM::connection_searc 515 h_pqf 0.44 147.7 788.26 103492 0.0001 0.0008 MARC::Charset::marc8_to_utf8 0.37 124.2 400.37 126313 0.0000 0.0000 MARC::Charset::Table::get_code 0.36 121.2 121.20 126295 0.0000 0.0000 Storable::mretrieve 0.20 68.49 68.491 126313 0.0000 0.0000 SDBM_File::FETCH 0.20 67.63 0.000 126295 0.0000 0.0000 Storable::thaw 0.17 56.76 56.767 252590 0.0000 0.0000 Class::Accessor::get 0.17 56.11 112.88 252590 0.0000 0.0000 Class::Accessor::__ANON__ 0.15 48.71 449.09 126313 0.0000 0.0000 MARC::Charset::Table::lookup_by_ma 1 rc8 0.12 41.82 95.098 126295 0.0000 0.0000 MARC::Charset::Code::char_value 0.10 33.27 33.274 126295 0.0000 0.0000 MARC::Charset::_process_escape 0.06 18.81 18.811 126313 0.0000 0.0000 MARC::Charset::Table::db 0.05 17.07 32.295 728288 0.0000 0.0000 MARC::Record::field 0.05 15.44 15.443 802346 0.0000 0.0000 MARC::Field::tag 0.04 13.86 828.46 25241 0.0005 0.0328 MARC::File::XML::record QUESTIONS So I think we could gain some import speed if we were to preprocess the set of MARC records by doing the MARC-8 to UTF-8 and fixing position 9 in the leader _before_ we begin the import. But that still leaves us with an import that starts out fairly fast and then starts seriously lagging. And if I'm not mistaken, it's the connection that is taking so much time over the long haul. So, my questions: Can we just process the raw MARC? Why did we choose the '.xml' storage method in Zebra and is it a good choice? Would '.sgml' or '.marc' be a better choice (because we could batch import directly instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for the import and then switch to '.xml'? Any suggestions on how to handle the connection in a more efficient way? Cheers, -- Joshua Ferraro VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE President, Technology migration, training, maintenance, support LibLime Featuring Koha Open-Source ILS jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import SpeedJosh,
Just to let you know that Adam and I both at the SRU meeting in Holland at the moment, and will be until the weekend; so apologies in advance if you don't get a useful response before next week. _/|_ ___________________________________________________________________ /o ) \/ Mike Taylor <mike@...> http://www.miketaylor.org.uk )_v__/\ "Historically, Taunton is part of Minehead already" -- Monty Python's Flying Circus. _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import SpeedJoshua,
Importing records one at a time when first building a database, or when doing a batch update that is a substantial percentage of the size of the database is not a good idea. The software has no way to optimize the layout of the index files, so for each record update, things get shuffled around, resulting on very sluggish update performance and a less-than-ideal layout inside the index files. It would be highly advisable to do at least the initial import from the command-line. I think it would make a lot of sense if this could be done well from the protocol, but AFAIK, the extended service interface at the moment only allows you to insert one record at a time. >Can we just process the raw MARC? Why did we choose the '.xml' >storage method in Zebra and is it a good choice? Would '.sgml' or >'.marc' be a better choice (because we could batch import directly >instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for >the import and then switch to '.xml'? > > That's a good question. You use .xml because extended services only work with XML. It *may* be possible to ingest records from the command-line as grs.marcxml (which reads MARC records and renders them internally as MARCXML), then do subsequent updates as XML, doing the conversion on the client side. I say *may*, because I haven't tried that, but I think it'd be worth a shot and it should be easy to make the experiment: 1: Start with a sample of MARC records 2: Build the initial index like so: % zebraidx init % zebraidx -f 10 -n -t grs.marcxml update recordfile (-n disables the shadow system for this update) This should run pleasantly fast compared to what you see now. 3: Try to update some records as MARCXML. --Seb >Any suggestions on how to handle the connection in a more efficient >way? > >Cheers, > > > -- Sebastian Hammer, Index Data quinn@... www.indexdata.com Ph: (603) 209-6853 _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import Speed> Date: Thu, 02 Mar 2006 11:05:44 -0500
> From: Sebastian Hammer <quinn@...> > > Importing records one at a time when first building a database, or > when doing a batch update that is a substantial percentage of the > size of the database is not a good idea. The software has no way to > optimize the layout of the index files, so for each record update, > things get shuffled around, resulting on very sluggish update > performance and a less-than-ideal layout inside the index files. Sure, but ... > It would be highly advisable to do at least the initial import from > the command-line. I think it would make a lot of sense if this could > be done well from the protocol, but AFAIK, the extended service > interface at the moment only allows you to insert one record at a > time. But -- ?? What magic does the command-line import have access to that ZOOM update doesn't? Clearly it's using some kind of in-memory caching to hugely reduce the frequency of disk-writes, but why shouldn't that also be used the doing a ZOOM update? Isn't that (part of) the purpose of delaying the "commit" call? If not, then we need to add $conn->option("updateCacheSize" => 100*1024*1024); _/|_ ___________________________________________________________________ /o ) \/ Mike Taylor <mike@...> http://www.miketaylor.org.uk )_v__/\ "If I write in C++ I probably don't use even 10% of the language, and in fact the other 90% I don't think I understand" -- Brian W. Kernighan. _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
|
|
|
Re: Import SpeedMike Taylor wrote:
>>Date: Thu, 02 Mar 2006 11:05:44 -0500 >>From: Sebastian Hammer <quinn@...> >> >>Importing records one at a time when first building a database, or >>when doing a batch update that is a substantial percentage of the >>size of the database is not a good idea. The software has no way to >>optimize the layout of the index files, so for each record update, >>things get shuffled around, resulting on very sluggish update >>performance and a less-than-ideal layout inside the index files. >> >> > >Sure, but ... > > > >>It would be highly advisable to do at least the initial import from >>the command-line. I think it would make a lot of sense if this could >>be done well from the protocol, but AFAIK, the extended service >>interface at the moment only allows you to insert one record at a >>time. >> >> > >But -- ?? What magic does the command-line import have access to that >ZOOM update doesn't? Clearly it's using some kind of in-memory >caching to hugely reduce the frequency of disk-writes, but why >shouldn't that also be used the doing a ZOOM update? Isn't that (part >of) the purpose of delaying the "commit" call? If not, then we need >to add $conn->option("updateCacheSize" => 100*1024*1024) > > When you insert a new record, it is scanned, and keys are extracted from the record according to the indexing rules found in the .abs file. These keys are written to the disk, sorted, and then merged into the index. If you update a thousand records at the same time, in a single operation, then all of those keys are extracted and written into the indexes in a single pass, which is much more efficient than doing it in a thousand passes. When you do a first-time update of a thousand records, things are even more efficient because the merging can be dropped entirely -- keys are written to the indexes just about as fast as the disk can eat them with minimum seeks. Doing multiple, single updates between commits doesn't help here. Keys are still extracted from the records and merged into the indexes once for each individual update operation.. the shadow files merely record and defer the physical 'write' operations, so they can be executed later, independently of the 'read' operations. Update a thousand records one at a time between commits and you just end up with some horribly complex shadow files, and the system will probably have to work pretty hard to do the commit step. It would be much better if we had a new stage between the updating of records and the commit... something to allow us to transfer a large number of records (preferably more than one per operation to cut down on the round-trip traffic), THEN index them, THEN commit the changes. Then we'd be able to do remote updates as efficiently as we can do them locally.. Something like 1. Update, update, update, update..... 2. Index 3. Commit Etc. The problem, as far as I can tell, is that you can only transfer records one at a time in the present extended services system, and they're extracted and indexed one at a time.. this is a fine way to update, well, one record at a time, but it's just about the worst way possible to update 1000 records at a time. It shouldn't be hard to implement either -- all the hard work has already been done, the logic just needs to be implemented differently. Mike, you can speak to Adam about this over lunch if you get a chance.. it is possible that I misrepresent what happens -- but this reflects my understanding. --Seb > _/|_ ___________________________________________________________________ >/o ) \/ Mike Taylor <mike@...> http://www.miketaylor.org.uk >)_v__/\ "If I write in C++ I probably don't use even 10% of the language, > and in fact the other 90% I don't think I understand" -- Brian > W. Kernighan. > > > > -- Sebastian Hammer, Index Data quinn@... www.indexdata.com Ph: (603) 209-6853 _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import SpeedOn Thu, Mar 02, 2006 at 04:40:16PM +0000, Mike Taylor wrote:
> > Date: Thu, 2 Mar 2006 07:44:22 -0800 > > From: Joshua Ferraro <jmf@...> > > > There's your culprit, then. You're spending 39751 of your 40604 > seconds doing needless searches, and 853 seconds (14 minutes) doing > the actual updates. Rip out the searches and you should get a 47-fold > speed increase. > > Why are you doing the search? So far I can see, it's just a probe to > see whether the connection is still alive. But you don't need to do > that: just go ahead and submit the update request, you'll find out > soon enough if the connection's dead and you can re-forge it then if > necessary. if (defined($context->{"Zconn"})) { $Zconn = $context->{"Zconn"}; return $context->{"Zconn"}; } else { $context->{"Zconn"} = &new_Zconn(); return $context->{"Zconn"}; } So ... no search ... if one is defined it just returns it and if it's not alive I assume the app will just crash (no fault tolerance built into the script). And here's the new benchmark for those 5000 records: 5000 MARC records imported in 7727.84231996536 seconds dprofpp tmon.out Exporter::export_ok_tags has -1 unstacked calls in outer AutoLoader::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export has 12 unstacked calls in outer bytes::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer POSIX::__ANON__ has 1 unstacked calls in outer POSIX::load_imports has 1 unstacked calls in outer Exporter::export has -12 unstacked calls in outer utf8::AUTOLOAD has -1 unstacked calls in outer utf8::SWASHNEW has 1 unstacked calls in outer Storable::thaw has 1 unstacked calls in outer bytes::length has 1 unstacked calls in outer POSIX::AUTOLOAD has -2 unstacked calls in outer Total Elapsed Time = 6617.861 Seconds User+System Time = 706.1013 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 21.4 151.3 817.46 103492 0.0001 0.0008 MARC::Charset::marc8_to_utf8 18.0 127.3 416.36 126313 0.0000 0.0000 MARC::Charset::Table::get_code 17.1 121.0 121.08 126295 0.0000 0.0000 Storable::mretrieve 10.9 77.27 0.000 126295 0.0000 0.0000 Storable::thaw 10.1 71.52 71.521 126313 0.0000 0.0000 SDBM_File::FETCH 8.42 59.48 117.80 252590 0.0000 0.0000 Class::Accessor::__ANON__ 8.26 58.31 58.317 252590 0.0000 0.0000 Class::Accessor::get 7.21 50.88 467.25 126313 0.0000 0.0000 MARC::Charset::Table::lookup_by_ma 1 rc8 6.15 43.39 97.718 126295 0.0000 0.0000 MARC::Charset::Code::char_value 4.87 34.35 34.354 126295 0.0000 0.0000 MARC::Charset::_process_escape 2.71 19.10 19.101 126313 0.0000 0.0000 MARC::Charset::Table::db 2.26 15.98 30.245 728288 0.0000 0.0000 MARC::Record::field 2.10 14.79 14.794 802346 0.0000 0.0000 MARC::Field::tag 1.94 13.69 857.27 25241 0.0005 0.0340 MARC::File::XML::record 1.44 10.15 11.456 714137 0.0000 0.0000 MARC::Field::subfields So it's definitely better without the search, but there is still the question of XML ... being able to import raw marc (which would only take a few seconds) would be really nice ... > (Mind you, 14 minutes still seems very slow for 5000 poxy records. I > think there are bulk-update cache issues going on here as well.) -- Joshua Ferraro VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE President, Technology migration, training, maintenance, support LibLime Featuring Koha Open-Source ILS jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import SpeedJoshua,
Done right, a first-time update of 5000 records ought to take less than a minute, so there is definitely room for improvement. The big question in my mind is whether the network interface as it stands is suitable for bulk updates.. we might need Adam's input on that. The primary problem is not so much the XML as the fact that we are updating records one at a time, which is Ok if that's what you mean to do, but it's terrible if you mean to update things in bulk. --Seb Joshua Ferraro wrote: >On Thu, Mar 02, 2006 at 04:40:16PM +0000, Mike Taylor wrote: > > >>>Date: Thu, 2 Mar 2006 07:44:22 -0800 >>>From: Joshua Ferraro <jmf@...> >>> >>> >>> >>There's your culprit, then. You're spending 39751 of your 40604 >>seconds doing needless searches, and 853 seconds (14 minutes) doing >>the actual updates. Rip out the searches and you should get a 47-fold >>speed increase. >> >>Why are you doing the search? So far I can see, it's just a probe to >>see whether the connection is still alive. But you don't need to do >>that: just go ahead and submit the update request, you'll find out >>soon enough if the connection's dead and you can re-forge it then if >>necessary. >> >> >Here's what the connection manager looks like now: > > if (defined($context->{"Zconn"})) { > $Zconn = $context->{"Zconn"}; > return $context->{"Zconn"}; > } else { > $context->{"Zconn"} = &new_Zconn(); > return $context->{"Zconn"}; > } >So ... no search ... if one is defined it just returns it and if >it's not alive I assume the app will just crash (no fault tolerance >built into the script). > >And here's the new benchmark for those 5000 records: > >5000 MARC records imported in 7727.84231996536 seconds > >dprofpp tmon.out Exporter::export_ok_tags has -1 unstacked calls in outer >AutoLoader::AUTOLOAD has -1 unstacked calls in outer >Exporter::Heavy::heavy_export has 12 unstacked calls in outer >bytes::AUTOLOAD has -1 unstacked calls in outer >Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer >POSIX::__ANON__ has 1 unstacked calls in outer >POSIX::load_imports has 1 unstacked calls in outer >Exporter::export has -12 unstacked calls in outer >utf8::AUTOLOAD has -1 unstacked calls in outer >utf8::SWASHNEW has 1 unstacked calls in outer >Storable::thaw has 1 unstacked calls in outer >bytes::length has 1 unstacked calls in outer >POSIX::AUTOLOAD has -2 unstacked calls in outer >Total Elapsed Time = 6617.861 Seconds > User+System Time = 706.1013 Seconds >Exclusive Times >%Time ExclSec CumulS #Calls sec/call Csec/c Name > 21.4 151.3 817.46 103492 0.0001 0.0008 MARC::Charset::marc8_to_utf8 > 18.0 127.3 416.36 126313 0.0000 0.0000 MARC::Charset::Table::get_code > 17.1 121.0 121.08 126295 0.0000 0.0000 Storable::mretrieve > 10.9 77.27 0.000 126295 0.0000 0.0000 Storable::thaw > 10.1 71.52 71.521 126313 0.0000 0.0000 SDBM_File::FETCH > 8.42 59.48 117.80 252590 0.0000 0.0000 Class::Accessor::__ANON__ > 8.26 58.31 58.317 252590 0.0000 0.0000 Class::Accessor::get > 7.21 50.88 467.25 126313 0.0000 0.0000 MARC::Charset::Table::lookup_by_ma > 1 rc8 > 6.15 43.39 97.718 126295 0.0000 0.0000 MARC::Charset::Code::char_value > 4.87 34.35 34.354 126295 0.0000 0.0000 MARC::Charset::_process_escape > 2.71 19.10 19.101 126313 0.0000 0.0000 MARC::Charset::Table::db > 2.26 15.98 30.245 728288 0.0000 0.0000 MARC::Record::field > 2.10 14.79 14.794 802346 0.0000 0.0000 MARC::Field::tag > 1.94 13.69 857.27 25241 0.0005 0.0340 MARC::File::XML::record > 1.44 10.15 11.456 714137 0.0000 0.0000 MARC::Field::subfields > >So it's definitely better without the search, but there is still >the question of XML ... being able to import raw marc (which would >only take a few seconds) would be really nice ... > > > >>(Mind you, 14 minutes still seems very slow for 5000 poxy records. I >>think there are bulk-update cache issues going on here as well.) >> >> > > > -- Sebastian Hammer, Index Data quinn@... www.indexdata.com Ph: (603) 209-6853 _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import Speed> Date: Thu, 02 Mar 2006 14:07:04 -0500
> From: Sebastian Hammer <quinn@...> > >> But -- ?? What magic does the command-line import have access to >> that ZOOM update doesn't? Clearly it's using some kind of >> in-memory caching to hugely reduce the frequency of disk-writes, >> but why shouldn't that also be used the doing a ZOOM update? Isn't >> that (part of) the purpose of delaying the "commit" call? If not, >> then we need to add $conn->option("updateCacheSize" => >> 100*1024*1024) > > 'commit' has nearly nothing to do with it. Right. I now see that I was envisaging commit as being (in part) analogous to the "index" operation that you posit below. > It would be much better if we had a new stage between the updating > of records and the commit... something to allow us to transfer a > large number of records (preferably more than one per operation to > cut down on the round-trip traffic), THEN index them, THEN commit > the changes. Precisely. > Then we'd be able to do remote updates as efficiently as we can do > them locally.. > > Something like > > 1. Update, update, update, update..... > 2. Index > 3. Commit Or, more often: update, update, update ... index update, update, update ... index update, update, update ... index ... commit > Mike, you can speak to Adam about this over lunch if you get a > chance.. it is possible that I misrepresent what happens -- but > this reflects my understanding. I'll see what I can do. _/|_ ___________________________________________________________________ /o ) \/ Mike Taylor <mike@...> http://www.miketaylor.org.uk )_v__/\ "[The cheese factory] beneath Covenant hung insubstantial, lambent nacreous sepulchral vitriol ..." -- Mike Lessacher. _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import Speed> Date: Thu, 2 Mar 2006 13:29:54 -0800
> From: Joshua Ferraro <jmf@...> > >> Why are you doing the search? So far I can see, it's just a probe >> to see whether the connection is still alive. But you don't need >> to do that: just go ahead and submit the update request, you'll >> find out soon enough if the connection's dead and you can re-forge >> it then if x.necessary. > > Here's what the connection manager looks like now: > > if (defined($context->{"Zconn"})) { > $Zconn = $context->{"Zconn"}; > return $context->{"Zconn"}; > } else { > $context->{"Zconn"} = &new_Zconn(); > return $context->{"Zconn"}; > } > So ... no search ... if one is defined it just returns it and if > it's not alive I assume the app will just crash (no fault tolerance > built into the script). OK. Of course, you can reintroduce robustness later by checking for update-failure and re-forging the connection if the error-code indicates this course of action. > And here's the new benchmark for those 5000 records: > > 5000 MARC records imported in 7727.84231996536 seconds Hmm. Well, compared with the previous truly astonishing time of 40604 seconds, that's a better than fivefold improvement, which is not a bad start. But, still -- more than one second a record, we still have _plenty_ of scope for improvement here. How busy is your disk now? > So it's definitely better without the search, but there is still the > question of XML ... being able to import raw marc (which would only > take a few seconds) would be really nice ... I agree with Seb that the XML is unlikely to be culprit here: the actual indexing is the only thing I can think of that would show the pattern you see of taking longer as the database grows. _/|_ ___________________________________________________________________ /o ) \/ Mike Taylor <mike@...> http://www.miketaylor.org.uk )_v__/\ It's hard to retro-fit correctness. _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import SpeedOn Fri, Mar 03, 2006 at 09:04:48AM +0000, Mike Taylor wrote:
> Hmm. Well, compared with the previous truly astonishing time of 40604 > seconds, that's a better than fivefold improvement, which is not a bad > start. But, still -- more than one second a record, we still have > _plenty_ of scope for improvement here. > > How busy is your disk now? It's a remote machine ... do you have suggestions for a utility that measures disc usage on the fly? > > So it's definitely better without the search, but there is still the > > question of XML ... being able to import raw marc (which would only > > take a few seconds) would be really nice ... > > I agree with Seb that the XML is unlikely to be culprit here: the > actual indexing is the only thing I can think of that would show the > pattern you see of taking longer as the database grows. OK ... but if you look back at that benchmark, the majority of our time is now spent converting from marc21 to MARCXML (it seems the most proc intensive part of this is the conversion from MARC-8 encoding to UTF-8). So even if Zebra is quite fast indexing XML, we still have quite a bit of overhead getting the records into XML. I suppose I should do a test where I pre-process the records (convert from MARC to XML) and _then_ import. Whadya think? Cheers, -- Joshua Ferraro VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE President, Technology migration, training, maintenance, support LibLime Featuring Koha Open-Source ILS jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Import SpeedJoshua Ferraro wrote:
>On Fri, Mar 03, 2006 at 09:04:48AM +0000, Mike Taylor wrote: > > >>Hmm. Well, compared with the previous truly astonishing time of 40604 >>seconds, that's a better than fivefold improvement, which is not a bad >>start. But, still -- more than one second a record, we still have >>_plenty_ of scope for improvement here. >> >>How busy is your disk now? >> >> >It's a remote machine ... do you have suggestions for a utility that >measures disc usage on the fly? > > > >>>So it's definitely better without the search, but there is still the >>>question of XML ... being able to import raw marc (which would only >>>take a few seconds) would be really nice ... >>> >>> >>I agree with Seb that the XML is unlikely to be culprit here: the >>actual indexing is the only thing I can think of that would show the >>pattern you see of taking longer as the database grows. >> >> >OK ... but if you look back at that benchmark, the majority of our >time is now spent converting from marc21 to MARCXML (it seems the >most proc intensive part of this is the conversion from MARC-8 >encoding to UTF-8). So even if Zebra is quite fast indexing XML, >we still have quite a bit of overhead getting the records into >XML. I suppose I should do a test where I pre-process the records >(convert from MARC to XML) and _then_ import. Whadya think? > > into enabling Zebra to import MARC directly via the network interface. I don't know what the issues are, but it must be doable. That being said, I find it nearly incomprehensible that a mapping from MARC to MARCXML or MARC8 to UTF-8 should be as demanding as these numbers indicate. --Seb >Cheers, > > > -- Sebastian Hammer, Index Data quinn@... www.indexdata.com Ph: (603) 209-6853 _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
|
|
Re: Re: Import SpeedJoshua Ferraro <jmf@...>
> It's a remote machine ... do you have suggestions for a utility that > measures disc usage on the fly? I think systat or iostat used to do that, but I've not checked today or whether it gives the right numbers to measure this. Otherwise, you could monitor /proc/diskstats if you find the format (is it in kernel sources?) Hope that helps, -- MJ Ray - personal email, see http://mjr.towers.org.uk/email.html Work: http://www.ttllp.co.uk/ irc.oftc.net/slef Jabber/SIP ask _______________________________________________ Koha-zebra mailing list Koha-zebra@... http://lists.nongnu.org/mailman/listinfo/koha-zebra |
| Free embeddable forum powered by Nabble | Forum Help |