Import Speed

View: New views
13 Messages — Rating Filter:   Alert me  

Import Speed

by Joshua Ferraro-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

So we've finally got records importing into Zebra, but there are some
serious performance issues, some related to MARC::Charset and some
(I suspect) to Perl-ZOOM/Zebra.

So ... let me lay out what we're doing and then show some benchmarks
I've done before I get to my questions.

CONNECTION OBJECT

All of our connections to Zebra are handled in our Context module.
My test indicate that it maintains the same connection throughout
all the operations. The connection maintainer checks the connection
like this:

if (defined($context->{"Zconn"})) {
        $Zconn = $context->{"Zconn"};
        $rs=$Zconn->search_pqf('@attr 1=4 mineral');

        if ($Zconn->errcode() != 0) {
                        $context->{"Zconn"} = &new_Zconn();
                        return $context->{"Zconn"};
                }
                return $context->{"Zconn"};
        } else {
                $context->{"Zconn"} = &new_Zconn();
                return $context->{"Zconn"};
                }
}

Of course, in the context of batch import, this gets called for
every record import (with '.xml' as the storage format) ... just
wondering if there's a better way ... see below for details.

IMPORT PROCESS

We've got a script called 'bulkmarcimport' that is able to take a
file full of MARC records and import them into Zebra as well as
some MySQL fields. It can be set to perform a 'commit' operation
after a certain number of records, but I haven't seen any performance
issues related to when the 'commit' operation happens (ie, if you do
it once every 100 records it seems about the same speed as if you
do it once every 10000).

The process looks something like this:

Grab a MARC record with MARC::Batch
Add some local use fields, etc.
Feed it to old-koha routines to put some stuff in MySQL
Hand it to MARC::File::XML to create an XML object
        MARC::File::XML converts from MARC-8 to UTF-8 where necessary
Feed it to Zebra via perl-zoom
Perform a commit operation every so often

BENCHMARCKS

I recently tried the import with 5000 records. The first few hundred
were rolling along at about 3-5 per second but this eventually
droped to 1-2 per second and finally to about 1 per 2 secs. The whole
batch imported in 40604 seconds.

Running a benchmarc on an import of 100 records yields the following:
sam:/home/koha/testing/cvsrepo/koha# perl -d:DProf misc/migration_tools/bulkmarcimport.pl -d -n 100 -commit 100 -file /home/jmf/koha.mrc
deleting biblios
COMMIT OPERATION SUCCESSFUL
100 MARC records imported in 59.907173871994 seconds

sam:/home/koha/testing/cvsrepo/koha# dprofpp tmon.out                          
Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 39.96495 Seconds
  User+System Time = 18.82495 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 19.4   3.657 20.081  25211   0.0001 0.0008  MARC::Charset::marc8_to_utf8
 17.1   3.223  9.841 264274   0.0000 0.0000  MARC::Charset::Table::get_code
 13.6   2.566  2.566 264257   0.0000 0.0000  Storable::mretrieve
 11.3   2.141  0.000 264257   0.0000 0.0000  Storable::thaw
 8.57   1.613  2.934 528514   0.0000 0.0000  Class::Accessor::__ANON__
 8.05   1.516  1.516 264274   0.0000 0.0000  SDBM_File::FETCH
 7.07   1.331  2.669 264257   0.0000 0.0000  MARC::Charset::Code::char_value
 7.02   1.321  1.321 528514   0.0000 0.0000  Class::Accessor::get
 6.80   1.281 11.123 264274   0.0000 0.0000  MARC::Charset::Table::lookup_by_ma
                                             rc8
 4.81   0.906  0.906 264257   0.0000 0.0000  MARC::Charset::_process_escape
 2.34   0.441  0.837  14532   0.0000 0.0001  MARC::Record::field
 2.10   0.396  0.396 264274   0.0000 0.0000  MARC::Charset::Table::db
 2.09   0.393  0.393 196731   0.0000 0.0000  MARC::Field::tag
 1.83   0.344  0.344  25714   0.0000 0.0000  MARC::File::XML::escape
 1.66   0.313 21.170    503   0.0006 0.0421  MARC::File::XML::record

So ... that's all good, mostly MARC::* to blame for speed issues
there. However, that doesn't explain the dramatic slowdown over a
period of time. So I did another benchmark, this time with 5000 records.
Here were the results:

# dprofpp tmon.out
Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
utf8::AUTOLOAD has -1 unstacked calls in outer
utf8::SWASHNEW has 1 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 37695.61 Seconds
  User+System Time = 33524.23 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 97.9   32848 32848.  10170   3.2299 3.2299  Net::Z3950::ZOOM::connection_searc
                 515                         h_pqf
 0.44   147.7 788.26 103492   0.0001 0.0008  MARC::Charset::marc8_to_utf8
 0.37   124.2 400.37 126313   0.0000 0.0000  MARC::Charset::Table::get_code
 0.36   121.2 121.20 126295   0.0000 0.0000  Storable::mretrieve
 0.20   68.49 68.491 126313   0.0000 0.0000  SDBM_File::FETCH
 0.20   67.63  0.000 126295   0.0000 0.0000  Storable::thaw
 0.17   56.76 56.767 252590   0.0000 0.0000  Class::Accessor::get
 0.17   56.11 112.88 252590   0.0000 0.0000  Class::Accessor::__ANON__
 0.15   48.71 449.09 126313   0.0000 0.0000  MARC::Charset::Table::lookup_by_ma
                   1                         rc8
 0.12   41.82 95.098 126295   0.0000 0.0000  MARC::Charset::Code::char_value
 0.10   33.27 33.274 126295   0.0000 0.0000  MARC::Charset::_process_escape
 0.06   18.81 18.811 126313   0.0000 0.0000  MARC::Charset::Table::db
 0.05   17.07 32.295 728288   0.0000 0.0000  MARC::Record::field
 0.05   15.44 15.443 802346   0.0000 0.0000  MARC::Field::tag
 0.04   13.86 828.46  25241   0.0005 0.0328  MARC::File::XML::record

QUESTIONS

So I think we could gain some import speed if we were to preprocess
the set of MARC records by doing the MARC-8 to UTF-8 and fixing
position 9 in the leader _before_ we begin the import. But that still
leaves us with an import that starts out fairly fast and then starts
seriously lagging. And if I'm not mistaken, it's the connection that
is taking so much time over the long haul.

So, my questions:

Can we just process the raw MARC? Why did we choose the '.xml'
storage method in Zebra and is it a good choice? Would '.sgml' or
'.marc' be a better choice (because we could batch import directly
instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
the import and then switch to '.xml'?

Any suggestions on how to handle the connection in a more efficient
way?

Cheers,

--
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS


_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Mike Taylor-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Josh,

Just to let you know that Adam and I both at the SRU meeting in
Holland at the moment, and will be until the weekend; so apologies in
advance if you don't get a useful response before next week.

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <mike@...>  http://www.miketaylor.org.uk
)_v__/\  "Historically, Taunton is part of Minehead already" -- Monty
         Python's Flying Circus.



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Sebastian Hammer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Joshua,

Importing records one at a time when first building a database, or when
doing a batch update that is a substantial percentage of the size of the
database is not a good idea. The software has no way to optimize the
layout of the index files, so for each record update, things get
shuffled around, resulting on very sluggish update performance and a
less-than-ideal layout inside the index files.

It would be highly advisable to do at least the initial import from the
command-line. I think it would make a lot of sense if this could be done
well from the protocol, but AFAIK, the extended service interface at the
moment only allows you to insert one record at a time.

>Can we just process the raw MARC? Why did we choose the '.xml'
>storage method in Zebra and is it a good choice? Would '.sgml' or
>'.marc' be a better choice (because we could batch import directly
>instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
>the import and then switch to '.xml'?
>  
>
That's a good question. You use .xml because extended services only work
with XML. It *may* be possible to ingest records from the command-line
as grs.marcxml (which reads MARC records and renders them internally as
MARCXML), then do subsequent updates as XML, doing the conversion on the
client side. I say *may*, because I haven't tried that, but I think it'd
be worth a shot and it should be easy to make the experiment:

1: Start with a sample of MARC records
2: Build the initial index like so:

% zebraidx init
% zebraidx -f 10 -n -t grs.marcxml update  recordfile     (-n disables
the shadow system for this update)

This should run  pleasantly fast compared to what you see now.

3: Try to update some records as MARCXML.

--Seb

>Any suggestions on how to handle the connection in a more efficient
>way?
>
>Cheers,
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Mike Taylor-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Date: Thu, 02 Mar 2006 11:05:44 -0500
> From: Sebastian Hammer <quinn@...>
>
> Importing records one at a time when first building a database, or
> when doing a batch update that is a substantial percentage of the
> size of the database is not a good idea. The software has no way to
> optimize the layout of the index files, so for each record update,
> things get shuffled around, resulting on very sluggish update
> performance and a less-than-ideal layout inside the index files.

Sure, but ...

> It would be highly advisable to do at least the initial import from
> the command-line. I think it would make a lot of sense if this could
> be done well from the protocol, but AFAIK, the extended service
> interface at the moment only allows you to insert one record at a
> time.

But -- ??  What magic does the command-line import have access to that
ZOOM update doesn't?  Clearly it's using some kind of in-memory
caching to hugely reduce the frequency of disk-writes, but why
shouldn't that also be used the doing a ZOOM update?  Isn't that (part
of) the purpose of delaying the "commit" call?  If not, then we need
to add $conn->option("updateCacheSize" => 100*1024*1024);

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <mike@...>  http://www.miketaylor.org.uk
)_v__/\  "If I write in C++ I probably don't use even 10% of the language,
         and in fact the other 90% I don't think I understand" -- Brian
         W. Kernighan.



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Parent Message unknown Re: Import Speed

by Mike Taylor-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Date: Thu, 2 Mar 2006 07:44:22 -0800
> From: Joshua Ferraro <jmf@...>
>
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>  97.9   32848 32848.  10170   3.2299 3.2299  Net::Z3950::ZOOM::connection_searc
>                  515                         h_pqf

!!!

There's your culprit, then.  You're spending 39751 of your 40604
seconds doing needless searches, and 853 seconds (14 minutes) doing
the actual updates.  Rip out the searches and you should get a 47-fold
speed increase.

Why are you doing the search?  So far I can see, it's just a probe to
see whether the connection is still alive.  But you don't need to do
that: just go ahead and submit the update request, you'll find out
soon enough if the connection's dead and you can re-forge it then if
necessary.

(Mind you, 14 minutes still seems very slow for 5000 poxy records.  I
think there are bulk-update cache issues going on here as well.)

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <mike@...>  http://www.miketaylor.org.uk
)_v__/\  "One generation's 'amusing dreck' is the next generation's high
         literature.  ('I think the anvil landing on the coyote's head
         symbolizes ...')" -- Bill Cameron.




_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Sebastian Hammer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mike Taylor wrote:

>>Date: Thu, 02 Mar 2006 11:05:44 -0500
>>From: Sebastian Hammer <quinn@...>
>>
>>Importing records one at a time when first building a database, or
>>when doing a batch update that is a substantial percentage of the
>>size of the database is not a good idea. The software has no way to
>>optimize the layout of the index files, so for each record update,
>>things get shuffled around, resulting on very sluggish update
>>performance and a less-than-ideal layout inside the index files.
>>    
>>
>
>Sure, but ...
>
>  
>
>>It would be highly advisable to do at least the initial import from
>>the command-line. I think it would make a lot of sense if this could
>>be done well from the protocol, but AFAIK, the extended service
>>interface at the moment only allows you to insert one record at a
>>time.
>>    
>>
>
>But -- ??  What magic does the command-line import have access to that
>ZOOM update doesn't?  Clearly it's using some kind of in-memory
>caching to hugely reduce the frequency of disk-writes, but why
>shouldn't that also be used the doing a ZOOM update?  Isn't that (part
>of) the purpose of delaying the "commit" call?  If not, then we need
>to add $conn->option("updateCacheSize" => 100*1024*1024)
>  
>
'commit' has nearly nothing to do with it.

When you insert a new record, it is scanned, and keys are extracted from
the record according to the indexing rules found in the .abs file. These
keys are written to the disk, sorted, and then merged into the index. If
you update a thousand records at the same time, in a single operation,
then all of those keys are extracted and written into the indexes in a
single pass, which is much more efficient than doing it in a thousand
passes. When you do a first-time update of a thousand records, things
are even more efficient because the merging can be dropped entirely --
keys are written to the indexes just about as fast as the disk can eat
them with minimum seeks.

Doing multiple, single updates between commits doesn't help here. Keys
are still extracted from the records and merged into the indexes once
for each individual update operation..  the shadow files merely record
and defer the physical 'write' operations, so they can be executed
later, independently of the 'read' operations. Update a thousand records
one at a time between commits and you just end up with some horribly
complex shadow files, and the system will probably have to work pretty
hard to do the commit step.

It would be much better if we had a new stage between the updating of
records and the commit... something to allow us to transfer a large
number of records (preferably more than one per operation to cut down on
the round-trip traffic), THEN index them, THEN commit the changes. Then
we'd be able to do remote updates as efficiently as we can do them locally..

Something like

1. Update, update, update, update.....
2. Index
3. Commit

Etc.

The problem, as far as I can tell, is that you can only transfer records
one at a time in the present extended services system, and they're
extracted and indexed one at a time.. this is a fine way to update,
well, one record at a time, but it's just about the worst way possible
to update 1000 records at a time.

It shouldn't be hard to implement either -- all the hard work has
already been done, the logic just needs to be implemented differently.

Mike, you can speak to Adam about this over lunch if you get a chance..
it is possible that I misrepresent what happens -- but this  reflects my
understanding.

--Seb

> _/|_ ___________________________________________________________________
>/o ) \/  Mike Taylor  <mike@...>  http://www.miketaylor.org.uk
>)_v__/\  "If I write in C++ I probably don't use even 10% of the language,
> and in fact the other 90% I don't think I understand" -- Brian
> W. Kernighan.
>
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Joshua Ferraro-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Mar 02, 2006 at 04:40:16PM +0000, Mike Taylor wrote:

> > Date: Thu, 2 Mar 2006 07:44:22 -0800
> > From: Joshua Ferraro <jmf@...>
> >
> There's your culprit, then.  You're spending 39751 of your 40604
> seconds doing needless searches, and 853 seconds (14 minutes) doing
> the actual updates.  Rip out the searches and you should get a 47-fold
> speed increase.
>
> Why are you doing the search?  So far I can see, it's just a probe to
> see whether the connection is still alive.  But you don't need to do
> that: just go ahead and submit the update request, you'll find out
> soon enough if the connection's dead and you can re-forge it then if
> necessary.
Here's what the connection manager looks like now:

        if (defined($context->{"Zconn"})) {
                $Zconn = $context->{"Zconn"};
                return $context->{"Zconn"};
        } else {
                $context->{"Zconn"} = &new_Zconn();
                return $context->{"Zconn"};
                }
So ... no search ... if one is defined it just returns it and if
it's not alive I assume the app will just crash (no fault tolerance
built into the script).

And here's the new benchmark for those 5000 records:

5000 MARC records imported in 7727.84231996536 seconds

dprofpp tmon.out                           Exporter::export_ok_tags has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 12 unstacked calls in outer
bytes::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
POSIX::__ANON__ has 1 unstacked calls in outer
POSIX::load_imports has 1 unstacked calls in outer
Exporter::export has -12 unstacked calls in outer
utf8::AUTOLOAD has -1 unstacked calls in outer
utf8::SWASHNEW has 1 unstacked calls in outer
Storable::thaw has 1 unstacked calls in outer
bytes::length has 1 unstacked calls in outer
POSIX::AUTOLOAD has -2 unstacked calls in outer
Total Elapsed Time = 6617.861 Seconds
  User+System Time = 706.1013 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 21.4   151.3 817.46 103492   0.0001 0.0008  MARC::Charset::marc8_to_utf8
 18.0   127.3 416.36 126313   0.0000 0.0000  MARC::Charset::Table::get_code
 17.1   121.0 121.08 126295   0.0000 0.0000  Storable::mretrieve
 10.9   77.27  0.000 126295   0.0000 0.0000  Storable::thaw
 10.1   71.52 71.521 126313   0.0000 0.0000  SDBM_File::FETCH
 8.42   59.48 117.80 252590   0.0000 0.0000  Class::Accessor::__ANON__
 8.26   58.31 58.317 252590   0.0000 0.0000  Class::Accessor::get
 7.21   50.88 467.25 126313   0.0000 0.0000  MARC::Charset::Table::lookup_by_ma
                   1                         rc8
 6.15   43.39 97.718 126295   0.0000 0.0000  MARC::Charset::Code::char_value
 4.87   34.35 34.354 126295   0.0000 0.0000  MARC::Charset::_process_escape
 2.71   19.10 19.101 126313   0.0000 0.0000  MARC::Charset::Table::db
 2.26   15.98 30.245 728288   0.0000 0.0000  MARC::Record::field
 2.10   14.79 14.794 802346   0.0000 0.0000  MARC::Field::tag
 1.94   13.69 857.27  25241   0.0005 0.0340  MARC::File::XML::record
 1.44   10.15 11.456 714137   0.0000 0.0000  MARC::Field::subfields

So it's definitely better without the search, but there is still
the question of XML ... being able to import raw marc (which would
only take a few seconds) would be really nice ...

> (Mind you, 14 minutes still seems very slow for 5000 poxy records.  I
> think there are bulk-update cache issues going on here as well.)

--
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS


_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Sebastian Hammer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Joshua,

Done right, a first-time update of 5000 records ought to take less than
a minute, so there is definitely room for improvement.

The big question in my mind is whether the network interface as it
stands is suitable for bulk updates.. we might need Adam's input on
that. The primary problem is not so much the XML as the fact that we are
updating records one at a time, which is Ok if that's what you mean to
do, but it's terrible if you mean to update things in bulk.

--Seb

Joshua Ferraro wrote:

>On Thu, Mar 02, 2006 at 04:40:16PM +0000, Mike Taylor wrote:
>  
>
>>>Date: Thu, 2 Mar 2006 07:44:22 -0800
>>>From: Joshua Ferraro <jmf@...>
>>>
>>>      
>>>
>>There's your culprit, then.  You're spending 39751 of your 40604
>>seconds doing needless searches, and 853 seconds (14 minutes) doing
>>the actual updates.  Rip out the searches and you should get a 47-fold
>>speed increase.
>>
>>Why are you doing the search?  So far I can see, it's just a probe to
>>see whether the connection is still alive.  But you don't need to do
>>that: just go ahead and submit the update request, you'll find out
>>soon enough if the connection's dead and you can re-forge it then if
>>necessary.
>>    
>>
>Here's what the connection manager looks like now:
>
>        if (defined($context->{"Zconn"})) {
>                $Zconn = $context->{"Zconn"};
>                return $context->{"Zconn"};
>        } else {
>                $context->{"Zconn"} = &new_Zconn();
>                return $context->{"Zconn"};
>                }
>So ... no search ... if one is defined it just returns it and if
>it's not alive I assume the app will just crash (no fault tolerance
>built into the script).
>
>And here's the new benchmark for those 5000 records:
>
>5000 MARC records imported in 7727.84231996536 seconds
>
>dprofpp tmon.out                           Exporter::export_ok_tags has -1 unstacked calls in outer
>AutoLoader::AUTOLOAD has -1 unstacked calls in outer
>Exporter::Heavy::heavy_export has 12 unstacked calls in outer
>bytes::AUTOLOAD has -1 unstacked calls in outer
>Exporter::Heavy::heavy_export_ok_tags has 1 unstacked calls in outer
>POSIX::__ANON__ has 1 unstacked calls in outer
>POSIX::load_imports has 1 unstacked calls in outer
>Exporter::export has -12 unstacked calls in outer
>utf8::AUTOLOAD has -1 unstacked calls in outer
>utf8::SWASHNEW has 1 unstacked calls in outer
>Storable::thaw has 1 unstacked calls in outer
>bytes::length has 1 unstacked calls in outer
>POSIX::AUTOLOAD has -2 unstacked calls in outer
>Total Elapsed Time = 6617.861 Seconds
>  User+System Time = 706.1013 Seconds
>Exclusive Times
>%Time ExclSec CumulS #Calls sec/call Csec/c  Name
> 21.4   151.3 817.46 103492   0.0001 0.0008  MARC::Charset::marc8_to_utf8
> 18.0   127.3 416.36 126313   0.0000 0.0000  MARC::Charset::Table::get_code
> 17.1   121.0 121.08 126295   0.0000 0.0000  Storable::mretrieve
> 10.9   77.27  0.000 126295   0.0000 0.0000  Storable::thaw
> 10.1   71.52 71.521 126313   0.0000 0.0000  SDBM_File::FETCH
> 8.42   59.48 117.80 252590   0.0000 0.0000  Class::Accessor::__ANON__
> 8.26   58.31 58.317 252590   0.0000 0.0000  Class::Accessor::get
> 7.21   50.88 467.25 126313   0.0000 0.0000  MARC::Charset::Table::lookup_by_ma
>                   1                         rc8
> 6.15   43.39 97.718 126295   0.0000 0.0000  MARC::Charset::Code::char_value
> 4.87   34.35 34.354 126295   0.0000 0.0000  MARC::Charset::_process_escape
> 2.71   19.10 19.101 126313   0.0000 0.0000  MARC::Charset::Table::db
> 2.26   15.98 30.245 728288   0.0000 0.0000  MARC::Record::field
> 2.10   14.79 14.794 802346   0.0000 0.0000  MARC::Field::tag
> 1.94   13.69 857.27  25241   0.0005 0.0340  MARC::File::XML::record
> 1.44   10.15 11.456 714137   0.0000 0.0000  MARC::Field::subfields
>
>So it's definitely better without the search, but there is still
>the question of XML ... being able to import raw marc (which would
>only take a few seconds) would be really nice ...
>
>  
>
>>(Mind you, 14 minutes still seems very slow for 5000 poxy records.  I
>>think there are bulk-update cache issues going on here as well.)
>>    
>>
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Mike Taylor-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Date: Thu, 02 Mar 2006 14:07:04 -0500
> From: Sebastian Hammer <quinn@...>
>
>> But -- ??  What magic does the command-line import have access to
>> that ZOOM update doesn't?  Clearly it's using some kind of
>> in-memory caching to hugely reduce the frequency of disk-writes,
>> but why shouldn't that also be used the doing a ZOOM update?  Isn't
>> that (part of) the purpose of delaying the "commit" call?  If not,
>> then we need to add $conn->option("updateCacheSize" =>
>> 100*1024*1024)
>
> 'commit' has nearly nothing to do with it.

Right.  I now see that I was envisaging commit as being (in part)
analogous to the "index" operation that you posit below.

> It would be much better if we had a new stage between the updating
> of records and the commit... something to allow us to transfer a
> large number of records (preferably more than one per operation to
> cut down on the round-trip traffic), THEN index them, THEN commit
> the changes.

Precisely.

> Then we'd be able to do remote updates as efficiently as we can do
> them locally..
>
> Something like
>
> 1. Update, update, update, update.....
> 2. Index
> 3. Commit

Or, more often:

        update, update, update ...
        index
        update, update, update ...
        index
        update, update, update ...
        index
        ...
        commit

> Mike, you can speak to Adam about this over lunch if you get a
> chance..  it is possible that I misrepresent what happens -- but
> this reflects my understanding.

I'll see what I can do.

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <mike@...>  http://www.miketaylor.org.uk
)_v__/\  "[The cheese factory] beneath Covenant hung insubstantial,
         lambent nacreous sepulchral vitriol ..." -- Mike Lessacher.



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Mike Taylor-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Date: Thu, 2 Mar 2006 13:29:54 -0800
> From: Joshua Ferraro <jmf@...>
>
>> Why are you doing the search?  So far I can see, it's just a probe
>> to see whether the connection is still alive.  But you don't need
>> to do that: just go ahead and submit the update request, you'll
>> find out soon enough if the connection's dead and you can re-forge
>> it then if x.necessary.
>
> Here's what the connection manager looks like now:
>
>         if (defined($context->{"Zconn"})) {
>                 $Zconn = $context->{"Zconn"};
>                 return $context->{"Zconn"};
>         } else {
>                 $context->{"Zconn"} = &new_Zconn();
>                 return $context->{"Zconn"};
>                 }
> So ... no search ... if one is defined it just returns it and if
> it's not alive I assume the app will just crash (no fault tolerance
> built into the script).

OK.  Of course, you can reintroduce robustness later by checking for
update-failure and re-forging the connection if the error-code
indicates this course of action.

> And here's the new benchmark for those 5000 records:
>
> 5000 MARC records imported in 7727.84231996536 seconds

Hmm.  Well, compared with the previous truly astonishing time of 40604
seconds, that's a better than fivefold improvement, which is not a bad
start.  But, still -- more than one second a record, we still have
_plenty_ of scope for improvement here.

How busy is your disk now?

> So it's definitely better without the search, but there is still the
> question of XML ... being able to import raw marc (which would only
> take a few seconds) would be really nice ...

I agree with Seb that the XML is unlikely to be culprit here: the
actual indexing is the only thing I can think of that would show the
pattern you see of taking longer as the database grows.

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <mike@...>  http://www.miketaylor.org.uk
)_v__/\  It's hard to retro-fit correctness.



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Joshua Ferraro-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Mar 03, 2006 at 09:04:48AM +0000, Mike Taylor wrote:
> Hmm.  Well, compared with the previous truly astonishing time of 40604
> seconds, that's a better than fivefold improvement, which is not a bad
> start.  But, still -- more than one second a record, we still have
> _plenty_ of scope for improvement here.
>
> How busy is your disk now?
It's a remote machine ... do you have suggestions for a utility that
measures disc usage on the fly?

> > So it's definitely better without the search, but there is still the
> > question of XML ... being able to import raw marc (which would only
> > take a few seconds) would be really nice ...
>
> I agree with Seb that the XML is unlikely to be culprit here: the
> actual indexing is the only thing I can think of that would show the
> pattern you see of taking longer as the database grows.
OK ... but if you look back at that benchmark, the majority of our
time is now spent converting from marc21 to MARCXML (it seems the
most proc intensive part of this is the conversion from MARC-8
encoding to UTF-8). So even if Zebra is quite fast indexing XML,
we still have quite a bit of overhead getting the records into
XML. I suppose I should do a test where I pre-process the records
(convert from MARC to XML) and _then_ import. Whadya think?

Cheers,

--
Joshua Ferraro               VENDOR SERVICES FOR OPEN-SOURCE SOFTWARE
President, Technology       migration, training, maintenance, support
LibLime                                Featuring Koha Open-Source ILS
jmf@... |Full Demos at http://liblime.com/koha |1(888)KohaILS


_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Import Speed

by Sebastian Hammer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Joshua Ferraro wrote:

>On Fri, Mar 03, 2006 at 09:04:48AM +0000, Mike Taylor wrote:
>  
>
>>Hmm.  Well, compared with the previous truly astonishing time of 40604
>>seconds, that's a better than fivefold improvement, which is not a bad
>>start.  But, still -- more than one second a record, we still have
>>_plenty_ of scope for improvement here.
>>
>>How busy is your disk now?
>>    
>>
>It's a remote machine ... do you have suggestions for a utility that
>measures disc usage on the fly?
>
>  
>
>>>So it's definitely better without the search, but there is still the
>>>question of XML ... being able to import raw marc (which would only
>>>take a few seconds) would be really nice ...
>>>      
>>>
>>I agree with Seb that the XML is unlikely to be culprit here: the
>>actual indexing is the only thing I can think of that would show the
>>pattern you see of taking longer as the database grows.
>>    
>>
>OK ... but if you look back at that benchmark, the majority of our
>time is now spent converting from marc21 to MARCXML (it seems the
>most proc intensive part of this is the conversion from MARC-8
>encoding to UTF-8). So even if Zebra is quite fast indexing XML,
>we still have quite a bit of overhead getting the records into
>XML. I suppose I should do a test where I pre-process the records
>(convert from MARC to XML) and _then_ import. Whadya think?
>  
>
If that really is the case, we should probably look more aggressively
into enabling Zebra to import MARC directly via the network interface. I
don't know what the issues are, but it must be doable.

That being said, I find it nearly incomprehensible that a mapping from
MARC to MARCXML or MARC8 to UTF-8 should be as demanding as these
numbers indicate.

--Seb

>Cheers,
>
>  
>

--
Sebastian Hammer, Index Data
quinn@...   www.indexdata.com
Ph: (603) 209-6853



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra

Re: Re: Import Speed

by MJ Ray-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Joshua Ferraro <jmf@...>
> It's a remote machine ... do you have suggestions for a utility that
> measures disc usage on the fly?

I think systat or iostat used to do that, but I've not checked today
or whether it gives the right numbers to measure this. Otherwise,
you could monitor /proc/diskstats if you find the format (is it in
kernel sources?)

Hope that helps,
--
MJ Ray - personal email, see http://mjr.towers.org.uk/email.html
Work: http://www.ttllp.co.uk/  irc.oftc.net/slef  Jabber/SIP ask



_______________________________________________
Koha-zebra mailing list
Koha-zebra@...
http://lists.nongnu.org/mailman/listinfo/koha-zebra