EmbeddedSolrServer marshalling and GC optimizations

View: New views
3 Messages — Rating Filter:   Alert me  

EmbeddedSolrServer marshalling and GC optimizations

by aklochkov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all

We use a patched version of Solr in one of the projects, and I want to share
our optimizations.

We use EmbeddedSolrServer in kind of strange way. We always retrieve whole
result set, not using paging. I know that this may be wrong but it works for
our requirements. We have small index (~50k documents) and we need to apply
complex post-processing logic to the whole resultset. We noticed several
bottlenecks in Solr which made the performance quite bad, so we used
profiler and eliminated some of them. So one by one:

1. EmbeddedSolrServer uses BinaryResponseWriter to marshal/unmarshal the
response using byte array, converting Document's to SolrDocument's and
DocLists's to SolrDocumentList's. There is an issue in Solr JIRA (SOLR-797),
I'm attaching my patch to it.

2. Profiler showed that lot of time is spent on GC due to the fact that Solr
creates new SolrDocument instances for every document
it retrieves for every query. We solved this by patching
BinaryResponseWriter (and later InplaceResponseBuilder created by the
previous patch) so it uses custom SolrCache to cache SolrDocument instances.

3. We noticed that a lot of CPU cycles are spent on copying values from one
Map to another (from Lucene Document to SolrDocument instances) when
creating new SolrDocument instances. So we created class SolrDocumentWrapper
which doesn't use own Map instance but works as a wrapper around the given
one, avoiding unnecessary memory usage and data copying.

These changes improved our performance very much. We got rid of load on GC,
and IO load created by reading the index.

What do you think, guys? Does it make sense to include all this stuff into
Solr?

--
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics

Re: EmbeddedSolrServer marshalling and GC optimizations

by ryantxu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>
> 2. Profiler showed that lot of time is spent on GC due to the fact  
> that Solr
> creates new SolrDocument instances for every document
> it retrieves for every query. We solved this by patching
> BinaryResponseWriter (and later InplaceResponseBuilder created by the
> previous patch) so it uses custom SolrCache to cache SolrDocument  
> instances.
>

Would this work in a non-embedded mode?


> 3. We noticed that a lot of CPU cycles are spent on copying values  
> from one
> Map to another (from Lucene Document to SolrDocument instances) when
> creating new SolrDocument instances. So we created class  
> SolrDocumentWrapper
> which doesn't use own Map instance but works as a wrapper around the  
> given
> one, avoiding unnecessary memory usage and data copying.
>
> These changes improved our performance very much. We got rid of load  
> on GC,
> and IO load created by reading the index.
>
> What do you think, guys? Does it make sense to include all this  
> stuff into
> Solr?

Sounds good -- In the EmbeddedSolr design, I think we were mostly  
thinking 'standard' use case where only the first 20-100 results are  
converted to SolrDocument, any improvement that makes this work better  
is welcome!

Do you want to create an issue for 2 & 3?  If the changes you have  
made generally improve EmbeddedSolrServer and do not hurt anything  
else, it would be great to get this into core...

thanks
ryan


Re: EmbeddedSolrServer marshalling and GC optimizations

by hossman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


: > 2. Profiler showed that lot of time is spent on GC due to the fact that Solr
: > creates new SolrDocument instances for every document
: > it retrieves for every query. We solved this by patching
: > BinaryResponseWriter (and later InplaceResponseBuilder created by the
: > previous patch) so it uses custom SolrCache to cache SolrDocument instances.

: Would this work in a non-embedded mode?

I'm wondering if we should just change the documentCache to deal with
SolrDocument objects instead of Document objects.


-Hoss