|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
|
|
problems with large Lucene indexHello,
I am using Lucene via Hibernate Search but the following problem is also seen using Luke. I'd appreciate any suggestions for solving this problem. I have a Lucene index (27Gb in size) that indexes a database table of 286 million rows. While Lucene was able to perform this indexing just fine (albeit very slowly), using the index has proved to be impossible. Any searches conducted on it, either from my Hibernate Search query or by placing the query into Luke give: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) ... The type of queries are simple, of the form: (+value:church +marcField:245 +subField:a) which in this example should only return a few thousand results. The interpreter is already running with the maximum of heap space allowed on for the Java executable running on Windows XP ( java -Xms 1200m -Xmx 1200m) The Lucene index was created using the following Hibernate Search annotations: @Column @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO) private Integer marcField; @Column (length = 2) @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO) private String subField; @Column(length = 2) @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO) private String indicator1; @Column(length = 2) @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO) private String indicator2; @Column(length = 10000) @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, store=Store.NO) private String value; @Column @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO) private Integer recordId; So all of the fields have NO NORMS except for "value" which is contains description text that needs to be tokenised. Is there any way around this? Does Lucene really have such a low limit for how much data it can search (and I consider 286 million documents to be pretty small beer - we were hoping to index a table of over a billion rows)? Or is there something I'm missing? Thanks. |
|
|
Re: problems with large Lucene indexLucene is trying to allocate the contiguous norms array for your index, which should be ~273 MB (=286/1024/1024), when it hits the OOM. Is your search sorting by field value? (Which'd also consume memory.) Or it's just the default (by relevance) sort? The only other biggish consumer of memory should be the deleted docs, but that's a BitVector so it should need ~34 MB RAM. Can you run a memory profiler to see what else is consuming RAM? Mike lucene@... wrote: > Hello, > > I am using Lucene via Hibernate Search but the following problem is > also seen using Luke. I'd appreciate any suggestions for solving > this problem. > > I have a Lucene index (27Gb in size) that indexes a database table > of 286 million rows. While Lucene was able to perform this indexing > just fine (albeit very slowly), using the index has proved to be > impossible. Any searches conducted on it, either from my Hibernate > Search query or by placing the query into Luke give: > > java.lang.OutOfMemoryError: Java heap space > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) > at org.apache.lucene.search.TermQuery > $TermWeight.scorer(TermQuery.java:69) > at org.apache.lucene.search.BooleanQuery > $BooleanWeight.scorer(BooleanQuery.java:230) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: > 131) > ... > > > The type of queries are simple, of the form: > > (+value:church +marcField:245 +subField:a) > > which in this example should only return a few thousand results. > > > The interpreter is already running with the maximum of heap space > allowed on for the Java executable running on Windows XP ( java -Xms > 1200m -Xmx 1200m) > > > The Lucene index was created using the following Hibernate Search > annotations: > > @Column > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private Integer marcField; > > @Column (length = 2) > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private String subField; > > @Column(length = 2) > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private String indicator1; > > @Column(length = 2) > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private String indicator2; > > @Column(length = 10000) > @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, > store=Store.NO) > private String value; > > @Column > @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) > @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, > store=Store.NO) > private Integer recordId; > > > So all of the fields have NO NORMS except for "value" which is > contains description text that needs to be tokenised. > > Is there any way around this? Does Lucene really have such a low > limit for how much data it can search (and I consider 286 million > documents to be pretty small beer - we were hoping to index a table > of over a billion rows)? Or is there something I'm missing? > > Thanks. > > > |
|
|
Re: problems with large Lucene indexThanks Michael,
There is no sorting on the result (adding a sort causes OOM well before the point it runs out for the default). There are no deleted docs - the index was created from a set of docs and no adds or deletes have taken place. Memory isn't being consumed elsewhere in the system. It all comes down to the Lucene call via Hibernate Search. We decided to split our huge index into a set of several smaller indexes. Like the original single index, each smaller index has one field which is tokenized and the other fields have NO_NORMS set. The following, explicitely specifying just one index, works fine: org.hibernate.search.FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class ); But as soon as we start adding further indexes: org.hibernate.search.FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class, MarcText8.class ); We start running into OOM. In our case the MarcText2 index has a total disk size of 5Gb (with 57589069 documents / 75491779 terms) and MarcText8 has a total size of 6.46Gb (with 79339982 documents / 104943977 terms). Adding all 8 indexes (the same as our original single index), either by explicitely naming them or just with: org.hibernate.search.FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery( outerLuceneQuery); results in it becoming completely unusable. One thing I am not sure about is that in Luke it tells me for an index (neither of the indexes mentioned above) that was created with NO_NORMS set on all the fields: "Index functionality: lock-less, single norms, shared doc store, checksum, del count, omitTf" Is this correct? I am not sure what it means by "single norms" - I would have expected it to say "no norms". Any further ideas on where to go from here? Your estimate of what is loaded into memory suggests that we shouldn't really be anywhere near running out of memory with these size indexes! As I said in my OP, Luke also gets a heap error on searching our original single large index which makes me wonder if it is a problem with the construction of the index. Quoting Michael McCandless <lucene@...>: > > Lucene is trying to allocate the contiguous norms array for your index, > which should be ~273 MB (=286/1024/1024), when it hits the OOM. > > Is your search sorting by field value? (Which'd also consume memory.) > Or it's just the default (by relevance) sort? > > The only other biggish consumer of memory should be the deleted docs, > but that's a BitVector so it should need ~34 MB RAM. > > Can you run a memory profiler to see what else is consuming RAM? > > Mike > > lucene@... wrote: > >> Hello, >> >> I am using Lucene via Hibernate Search but the following problem is >> also seen using Luke. I'd appreciate any suggestions for solving >> this problem. >> >> I have a Lucene index (27Gb in size) that indexes a database table >> of 286 million rows. While Lucene was able to perform this indexing >> just fine (albeit very slowly), using the index has proved to be >> impossible. Any searches conducted on it, either from my Hibernate >> Search query or by placing the query into Luke give: >> >> java.lang.OutOfMemoryError: Java heap space >> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) >> at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) >> at >> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230) >> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) >> ... >> >> >> The type of queries are simple, of the form: >> >> (+value:church +marcField:245 +subField:a) >> >> which in this example should only return a few thousand results. >> >> >> The interpreter is already running with the maximum of heap space >> allowed on for the Java executable running on Windows XP ( java >> -Xms 1200m -Xmx 1200m) >> >> >> The Lucene index was created using the following Hibernate Search >> annotations: >> >> @Column >> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >> store=Store.NO) >> private Integer marcField; >> >> @Column (length = 2) >> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >> store=Store.NO) >> private String subField; >> >> @Column(length = 2) >> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >> store=Store.NO) >> private String indicator1; >> >> @Column(length = 2) >> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >> store=Store.NO) >> private String indicator2; >> >> @Column(length = 10000) >> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, >> store=Store.NO) >> private String value; >> >> @Column >> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >> store=Store.NO) >> private Integer recordId; >> >> >> So all of the fields have NO NORMS except for "value" which is >> contains description text that needs to be tokenised. >> >> Is there any way around this? Does Lucene really have such a low >> limit for how much data it can search (and I consider 286 million >> documents to be pretty small beer - we were hoping to index a table >> of over a billion rows)? Or is there something I'm missing? >> >> Thanks. >> >> >> |
|
|
Re: problems with large Lucene indexlucene@... wrote:
> Thanks Michael, > > There is no sorting on the result (adding a sort causes OOM well before > the point it runs out for the default). > > There are no deleted docs - the index was created from a set of docs and > no adds or deletes have taken place. > > Memory isn't being consumed elsewhere in the system. It all comes down > to the Lucene call via Hibernate Search. We decided to split our huge > index into a set of several smaller indexes. Like the original single > index, each smaller index has one field which is tokenized and the other > fields have NO_NORMS set. > > The following, explicitely specifying just one index, works fine: > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class ); > > But as soon as we start adding further indexes: > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class, > MarcText8.class ); > > We start running into OOM. > > In our case the MarcText2 index has a total disk size of 5Gb (with > 57589069 documents / 75491779 terms) and MarcText8 has a total size of > 6.46Gb (with 79339982 documents / 104943977 terms). > > Adding all 8 indexes (the same as our original single index), either by > explicitely naming them or just with: > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( outerLuceneQuery); > > results in it becoming completely unusable. > > > One thing I am not sure about is that in Luke it tells me for an index > (neither of the indexes mentioned above) that was created with NO_NORMS > set on all the fields: > > "Index functionality: lock-less, single norms, shared doc store, > checksum, del count, omitTf" > > Is this correct? I am not sure what it means by "single norms" - I > would have expected it to say "no norms". This is just an expert-level info about the capability of the index format, it doesn't say anything about the actual flags on fields. > Any further ideas on where to go from here? Your estimate of what is > loaded into memory suggests that we shouldn't really be anywhere near > running out of memory with these size indexes! > > As I said in my OP, Luke also gets a heap error on searching our > original single large index which makes me wonder if it is a problem > with the construction of the index. In the open index dialog in Luke set the the "Custom term infos divisor" to a value higher than 1 and try to open the index again. If this still doesn' work, make a copy of the index, and then open a copy of this index in Luke - but try the option "Don't open IndexReader", and then run CheckIndex from menu. PLEASE DO THIS ON A COPY OF THE INDEX. Oh, and of course you could start with increasing the heapsize when running Luke, but I think that's obvious. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: problems with large Lucene indexAt this point, I'd recommend running with a memory profiler, eg YourKit, and posting the resulting output. With norms only on one field, no deletions, and no field sorting, I can't see why you're running out of memory. If you take Hibernate out of the picture, and simply open an IndexSearcher on the underlying index, do you still hit OOM? Can you post the output of CheckIndex? You can run it from the command line: java org.apache.lucene.index.CheckIndex <pathToIndex> (Without -fix, CheckIndex will make no changes to the index, but it's best to do this on a copy of the index to be supremely safe). Mike lucene@... wrote: > Thanks Michael, > > There is no sorting on the result (adding a sort causes OOM well > before the point it runs out for the default). > > There are no deleted docs - the index was created from a set of docs > and no adds or deletes have taken place. > > Memory isn't being consumed elsewhere in the system. It all comes > down to the Lucene call via Hibernate Search. We decided to split > our huge index into a set of several smaller indexes. Like the > original single index, each smaller index has one field which is > tokenized and the other fields have NO_NORMS set. > > The following, explicitely specifying just one index, works fine: > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( outerLuceneQuery, > MarcText2.class ); > > But as soon as we start adding further indexes: > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( outerLuceneQuery, > MarcText2.class, MarcText8.class ); > > We start running into OOM. > > In our case the MarcText2 index has a total disk size of 5Gb (with > 57589069 documents / 75491779 terms) and MarcText8 has a total size > of 6.46Gb (with 79339982 documents / 104943977 terms). > > Adding all 8 indexes (the same as our original single index), either > by explicitely naming them or just with: > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( outerLuceneQuery); > > results in it becoming completely unusable. > > > One thing I am not sure about is that in Luke it tells me for an > index (neither of the indexes mentioned above) that was created with > NO_NORMS set on all the fields: > > "Index functionality: lock-less, single norms, shared doc store, > checksum, del count, omitTf" > > Is this correct? I am not sure what it means by "single norms" - I > would have expected it to say "no norms". > > > Any further ideas on where to go from here? Your estimate of what is > loaded into memory suggests that we shouldn't really be anywhere > near running out of memory with these size indexes! > > As I said in my OP, Luke also gets a heap error on searching our > original single large index which makes me wonder if it is a problem > with the construction of the index. > > > > Quoting Michael McCandless <lucene@...>: > >> >> Lucene is trying to allocate the contiguous norms array for your >> index, >> which should be ~273 MB (=286/1024/1024), when it hits the OOM. >> >> Is your search sorting by field value? (Which'd also consume >> memory.) >> Or it's just the default (by relevance) sort? >> >> The only other biggish consumer of memory should be the deleted docs, >> but that's a BitVector so it should need ~34 MB RAM. >> >> Can you run a memory profiler to see what else is consuming RAM? >> >> Mike >> >> lucene@... wrote: >> >>> Hello, >>> >>> I am using Lucene via Hibernate Search but the following problem >>> is also seen using Luke. I'd appreciate any suggestions for >>> solving this problem. >>> >>> I have a Lucene index (27Gb in size) that indexes a database >>> table of 286 million rows. While Lucene was able to perform this >>> indexing just fine (albeit very slowly), using the index has >>> proved to be impossible. Any searches conducted on it, either >>> from my Hibernate Search query or by placing the query into Luke >>> give: >>> >>> java.lang.OutOfMemoryError: Java heap space >>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) >>> at org.apache.lucene.search.TermQuery >>> $TermWeight.scorer(TermQuery.java:69) >>> at org.apache.lucene.search.BooleanQuery >>> $BooleanWeight.scorer(BooleanQuery.java:230) >>> at >>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: >>> 131) >>> ... >>> >>> >>> The type of queries are simple, of the form: >>> >>> (+value:church +marcField:245 +subField:a) >>> >>> which in this example should only return a few thousand results. >>> >>> >>> The interpreter is already running with the maximum of heap space >>> allowed on for the Java executable running on Windows XP ( java - >>> Xms 1200m -Xmx 1200m) >>> >>> >>> The Lucene index was created using the following Hibernate Search >>> annotations: >>> >>> @Column >>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>> store=Store.NO) >>> private Integer marcField; >>> >>> @Column (length = 2) >>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>> store=Store.NO) >>> private String subField; >>> >>> @Column(length = 2) >>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>> store=Store.NO) >>> private String indicator1; >>> >>> @Column(length = 2) >>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>> store=Store.NO) >>> private String indicator2; >>> >>> @Column(length = 10000) >>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, >>> store=Store.NO) >>> private String value; >>> >>> @Column >>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>> store=Store.NO) >>> private Integer recordId; >>> >>> >>> So all of the fields have NO NORMS except for "value" which is >>> contains description text that needs to be tokenised. >>> >>> Is there any way around this? Does Lucene really have such a low >>> limit for how much data it can search (and I consider 286 million >>> documents to be pretty small beer - we were hoping to index a >>> table of over a billion rows)? Or is there something I'm missing? >>> >>> Thanks. >>> >>> >>> > > > |
|
|
Re: problems with large Lucene indexThanks for the advice.
I haven't got around to profiling the code. Instead, I took your advice and knocked Hibernate out of the equation with a small stand-alone program that calls Lucene directly. I then wrote a similar stand-alone using Hibernate Search to do the same thing. On a small index both work fine: E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky hits = 29410 E:\hibtest>java -Xmx1200m -classpath .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar hibtest size = 29410 Trying it on our huge index works for the straight Lucene version: E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky hits = 320500 but fails for the Hibernate version: E:\hibtest>java -Xmx1200m -classpath .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar hibtest Exception in thread "main" java.lang.OutOfMemoryError at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(Unknown Source) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirec tory.java:596) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp ut.java:136) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal( CompoundFileReader.java:247) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp ut.java:136) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp ut.java:92) at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907) at org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j ava:352) at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:6 9) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) at org.apache.lucene.search.Searcher.search(Searcher.java:136) at org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100 ) at org.hibernate.search.query.QueryHits.<init>(QueryHits.java:61) at org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue ryImpl.java:354) at org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu eryImpl.java:741) at hibtest.main(hibtest.java:45) E:\hibtest> I am not sure why this is occuring. Any ideas? I am calling IndexSearcher.search() and so is Hibernate. Is Hibernate Search telling Lucene to try and read in the entire index into memory? Code for the Lucene version is: public class lucky { public static void main(String[] args) { try { Term term = new Term("value", "church"); Query query = new TermQuery(term); IndexSearcher searcher = new IndexSearcher("E:/lucene/indexes/uk.bl.dportal.marcdb.MarcText"); Hits hits = searcher.search(query); System.out.println("hits = "+hits.length()); searcher.close(); } catch (Exception e) { e.printStackTrace(); } } } and for the Hibernate Search version: public class hibtest { public static void main(String[] args) { hibtest mgr = new hibtest(); Session session = HibernateUtil.getSessionFactory().getCurrentSession(); session.beginTransaction(); FullTextSession fullTextSession = Search.getFullTextSession(session); TermQuery luceneQuery = new TermQuery(new Term("value", "church")); org.hibernate.search.FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery( luceneQuery, MarcText.class ); long resultSize = fullTextQuery.getResultSize(); // this is line 45 System.out.println("size = "+resultSize); session.getTransaction().commit(); HibernateUtil.getSessionFactory().close(); } } Quoting Michael McCandless <lucene@...>: > > At this point, I'd recommend running with a memory profiler, eg > YourKit, and posting the resulting output. > > With norms only on one field, no deletions, and no field sorting, I > can't see why you're running out of memory. > > If you take Hibernate out of the picture, and simply open an > IndexSearcher on the underlying index, do you still hit OOM? > > Can you post the output of CheckIndex? You can run it from the command line: > > java org.apache.lucene.index.CheckIndex <pathToIndex> > > (Without -fix, CheckIndex will make no changes to the index, but it's > best to do this on a copy of the index to be supremely safe). > > Mike > > lucene@... wrote: > >> Thanks Michael, >> >> There is no sorting on the result (adding a sort causes OOM well >> before the point it runs out for the default). >> >> There are no deleted docs - the index was created from a set of >> docs and no adds or deletes have taken place. >> >> Memory isn't being consumed elsewhere in the system. It all comes >> down to the Lucene call via Hibernate Search. We decided to split >> our huge index into a set of several smaller indexes. Like the >> original single index, each smaller index has one field which is >> tokenized and the other fields have NO_NORMS set. >> >> The following, explicitely specifying just one index, works fine: >> >> org.hibernate.search.FullTextQuery fullTextQuery = >> fullTextSession.createFullTextQuery( outerLuceneQuery, >> MarcText2.class ); >> >> But as soon as we start adding further indexes: >> >> org.hibernate.search.FullTextQuery fullTextQuery = >> fullTextSession.createFullTextQuery( outerLuceneQuery, >> MarcText2.class, MarcText8.class ); >> >> We start running into OOM. >> >> In our case the MarcText2 index has a total disk size of 5Gb (with >> 57589069 documents / 75491779 terms) and MarcText8 has a total size >> of 6.46Gb (with 79339982 documents / 104943977 terms). >> >> Adding all 8 indexes (the same as our original single index), >> either by explicitely naming them or just with: >> >> org.hibernate.search.FullTextQuery fullTextQuery = >> fullTextSession.createFullTextQuery( outerLuceneQuery); >> >> results in it becoming completely unusable. >> >> >> One thing I am not sure about is that in Luke it tells me for an >> index (neither of the indexes mentioned above) that was created >> with NO_NORMS set on all the fields: >> >> "Index functionality: lock-less, single norms, shared doc store, >> checksum, del count, omitTf" >> >> Is this correct? I am not sure what it means by "single norms" - I >> would have expected it to say "no norms". >> >> >> Any further ideas on where to go from here? Your estimate of what >> is loaded into memory suggests that we shouldn't really be anywhere >> near running out of memory with these size indexes! >> >> As I said in my OP, Luke also gets a heap error on searching our >> original single large index which makes me wonder if it is a >> problem with the construction of the index. >> >> >> >> Quoting Michael McCandless <lucene@...>: >> >>> >>> Lucene is trying to allocate the contiguous norms array for your index, >>> which should be ~273 MB (=286/1024/1024), when it hits the OOM. >>> >>> Is your search sorting by field value? (Which'd also consume memory.) >>> Or it's just the default (by relevance) sort? >>> >>> The only other biggish consumer of memory should be the deleted docs, >>> but that's a BitVector so it should need ~34 MB RAM. >>> >>> Can you run a memory profiler to see what else is consuming RAM? >>> >>> Mike >>> >>> lucene@... wrote: >>> >>>> Hello, >>>> >>>> I am using Lucene via Hibernate Search but the following problem >>>> is also seen using Luke. I'd appreciate any suggestions for >>>> solving this problem. >>>> >>>> I have a Lucene index (27Gb in size) that indexes a database >>>> table of 286 million rows. While Lucene was able to perform this >>>> indexing just fine (albeit very slowly), using the index has >>>> proved to be impossible. Any searches conducted on it, either >>>> from my Hibernate Search query or by placing the query into Luke >>>> give: >>>> >>>> java.lang.OutOfMemoryError: Java heap space >>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) >>>> at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) >>>> at >>>> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230) >>>> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) >>>> ... >>>> >>>> >>>> The type of queries are simple, of the form: >>>> >>>> (+value:church +marcField:245 +subField:a) >>>> >>>> which in this example should only return a few thousand results. >>>> >>>> >>>> The interpreter is already running with the maximum of heap space >>>> allowed on for the Java executable running on Windows XP ( java >>>> -Xms 1200m -Xmx 1200m) >>>> >>>> >>>> The Lucene index was created using the following Hibernate Search >>>> annotations: >>>> >>>> @Column >>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>> store=Store.NO) >>>> private Integer marcField; >>>> >>>> @Column (length = 2) >>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>> store=Store.NO) >>>> private String subField; >>>> >>>> @Column(length = 2) >>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>> store=Store.NO) >>>> private String indicator1; >>>> >>>> @Column(length = 2) >>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>> store=Store.NO) >>>> private String indicator2; >>>> >>>> @Column(length = 10000) >>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, >>>> store=Store.NO) >>>> private String value; >>>> >>>> @Column >>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>> store=Store.NO) >>>> private Integer recordId; >>>> >>>> >>>> So all of the fields have NO NORMS except for "value" which is >>>> contains description text that needs to be tokenised. >>>> >>>> Is there any way around this? Does Lucene really have such a low >>>> limit for how much data it can search (and I consider 286 >>>> million documents to be pretty small beer - we were hoping to >>>> index a table of over a billion rows)? Or is there something I'm >>>> missing? >>>> >>>> Thanks. >>>> >>>> >>>> >> >> >> |
|
|
Re: problems with large Lucene indexUnfortunately, I'm not familiar with exactly what Hibernate search does with the Lucene APIs. It must be doing something beyond what your standalone Lucene test case does. Maybe ask this question on the Hibernate list? Mike lucene@... wrote: > Thanks for the advice. > > I haven't got around to profiling the code. Instead, I took your > advice and knocked Hibernate out of the equation with a small stand- > alone program that calls Lucene directly. I then wrote a similar > stand-alone using Hibernate Search to do the same thing. > > On a small index both work fine: > > E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky > hits = 29410 > > E:\hibtest>java -Xmx1200m -classpath .;lib/antlr-2.7.6.jar;lib/ > commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3- > persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate- > core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/ > jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j- > api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr- > core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/ > slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql- > connector-java.jar;lib/c3p0-0.9.1.jar hibtest > size = 29410 > > > Trying it on our huge index works for the straight Lucene version: > > E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky > hits = 320500 > > > but fails for the Hibernate version: > > E:\hibtest>java -Xmx1200m -classpath .;lib/antlr-2.7.6.jar;lib/ > commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3- > persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate- > core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/ > jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j- > api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr- > core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/ > slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql- > connector-java.jar;lib/c3p0-0.9.1.jar hibtest > Exception in thread "main" java.lang.OutOfMemoryError > at java.io.RandomAccessFile.readBytes(Native Method) > at java.io.RandomAccessFile.read(Unknown Source) > at org.apache.lucene.store.FSDirectory > $FSIndexInput.readInternal(FSDirec > tory.java:596) > at > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp > ut.java:136) > at org.apache.lucene.index.CompoundFileReader > $CSIndexInput.readInternal( > CompoundFileReader.java:247) > at > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp > ut.java:136) > at > org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp > ut.java:92) > at > org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907) > at > org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j > ava:352) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java: > 273) > at org.apache.lucene.search.TermQuery > $TermWeight.scorer(TermQuery.java:6 > 9) > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > > at > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at > org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100 > ) > at org.hibernate.search.query.QueryHits.<init>(QueryHits.java: > 61) > at > org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue > ryImpl.java:354) > at > org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu > eryImpl.java:741) > at hibtest.main(hibtest.java:45) > > E:\hibtest> > > > I am not sure why this is occuring. Any ideas? I am calling > IndexSearcher.search() and so is Hibernate. Is Hibernate Search > telling Lucene to try and read in the entire index into memory? > > > Code for the Lucene version is: > > public class lucky > { > public static void main(String[] args) > { > try > { > Term term = new Term("value", "church"); > Query query = new TermQuery(term); > IndexSearcher searcher = new IndexSearcher("E:/lucene/indexes/ > uk.bl.dportal.marcdb.MarcText"); > Hits hits = searcher.search(query); > > System.out.println("hits = "+hits.length()); > > searcher.close(); > } > catch (Exception e) > { > e.printStackTrace(); > } > } > } > > and for the Hibernate Search version: > > public class hibtest { > > public static void main(String[] args) { > hibtest mgr = new hibtest(); > > Session session = > HibernateUtil.getSessionFactory().getCurrentSession(); > > session.beginTransaction(); > > FullTextSession fullTextSession = Search.getFullTextSession(session); > TermQuery luceneQuery = new TermQuery(new Term("value", "church")); > > org.hibernate.search.FullTextQuery fullTextQuery = > fullTextSession.createFullTextQuery( luceneQuery, MarcText.class ); > > long resultSize = fullTextQuery.getResultSize(); // this is line 45 > > System.out.println("size = "+resultSize); > > session.getTransaction().commit(); > > HibernateUtil.getSessionFactory().close(); > } > > } > > > > Quoting Michael McCandless <lucene@...>: > >> >> At this point, I'd recommend running with a memory profiler, eg >> YourKit, and posting the resulting output. >> >> With norms only on one field, no deletions, and no field sorting, I >> can't see why you're running out of memory. >> >> If you take Hibernate out of the picture, and simply open an >> IndexSearcher on the underlying index, do you still hit OOM? >> >> Can you post the output of CheckIndex? You can run it from the >> command line: >> >> java org.apache.lucene.index.CheckIndex <pathToIndex> >> >> (Without -fix, CheckIndex will make no changes to the index, but it's >> best to do this on a copy of the index to be supremely safe). >> >> Mike >> >> lucene@... wrote: >> >>> Thanks Michael, >>> >>> There is no sorting on the result (adding a sort causes OOM well >>> before the point it runs out for the default). >>> >>> There are no deleted docs - the index was created from a set of >>> docs and no adds or deletes have taken place. >>> >>> Memory isn't being consumed elsewhere in the system. It all comes >>> down to the Lucene call via Hibernate Search. We decided to split >>> our huge index into a set of several smaller indexes. Like the >>> original single index, each smaller index has one field which is >>> tokenized and the other fields have NO_NORMS set. >>> >>> The following, explicitely specifying just one index, works fine: >>> >>> org.hibernate.search.FullTextQuery fullTextQuery = >>> fullTextSession.createFullTextQuery( outerLuceneQuery, >>> MarcText2.class ); >>> >>> But as soon as we start adding further indexes: >>> >>> org.hibernate.search.FullTextQuery fullTextQuery = >>> fullTextSession.createFullTextQuery( outerLuceneQuery, >>> MarcText2.class, MarcText8.class ); >>> >>> We start running into OOM. >>> >>> In our case the MarcText2 index has a total disk size of 5Gb >>> (with 57589069 documents / 75491779 terms) and MarcText8 has a >>> total size of 6.46Gb (with 79339982 documents / 104943977 terms). >>> >>> Adding all 8 indexes (the same as our original single index), >>> either by explicitely naming them or just with: >>> >>> org.hibernate.search.FullTextQuery fullTextQuery = >>> fullTextSession.createFullTextQuery( outerLuceneQuery); >>> >>> results in it becoming completely unusable. >>> >>> >>> One thing I am not sure about is that in Luke it tells me for an >>> index (neither of the indexes mentioned above) that was created >>> with NO_NORMS set on all the fields: >>> >>> "Index functionality: lock-less, single norms, shared doc store, >>> checksum, del count, omitTf" >>> >>> Is this correct? I am not sure what it means by "single norms" - >>> I would have expected it to say "no norms". >>> >>> >>> Any further ideas on where to go from here? Your estimate of what >>> is loaded into memory suggests that we shouldn't really be >>> anywhere near running out of memory with these size indexes! >>> >>> As I said in my OP, Luke also gets a heap error on searching our >>> original single large index which makes me wonder if it is a >>> problem with the construction of the index. >>> >>> >>> >>> Quoting Michael McCandless <lucene@...>: >>> >>>> >>>> Lucene is trying to allocate the contiguous norms array for your >>>> index, >>>> which should be ~273 MB (=286/1024/1024), when it hits the OOM. >>>> >>>> Is your search sorting by field value? (Which'd also consume >>>> memory.) >>>> Or it's just the default (by relevance) sort? >>>> >>>> The only other biggish consumer of memory should be the deleted >>>> docs, >>>> but that's a BitVector so it should need ~34 MB RAM. >>>> >>>> Can you run a memory profiler to see what else is consuming RAM? >>>> >>>> Mike >>>> >>>> lucene@... wrote: >>>> >>>>> Hello, >>>>> >>>>> I am using Lucene via Hibernate Search but the following >>>>> problem is also seen using Luke. I'd appreciate any >>>>> suggestions for solving this problem. >>>>> >>>>> I have a Lucene index (27Gb in size) that indexes a database >>>>> table of 286 million rows. While Lucene was able to perform >>>>> this indexing just fine (albeit very slowly), using the index >>>>> has proved to be impossible. Any searches conducted on it, >>>>> either from my Hibernate Search query or by placing the query >>>>> into Luke give: >>>>> >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) >>>>> at org.apache.lucene.search.TermQuery >>>>> $TermWeight.scorer(TermQuery.java:69) >>>>> at org.apache.lucene.search.BooleanQuery >>>>> $BooleanWeight.scorer(BooleanQuery.java:230) >>>>> at >>>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: >>>>> 131) >>>>> ... >>>>> >>>>> >>>>> The type of queries are simple, of the form: >>>>> >>>>> (+value:church +marcField:245 +subField:a) >>>>> >>>>> which in this example should only return a few thousand results. >>>>> >>>>> >>>>> The interpreter is already running with the maximum of heap >>>>> space allowed on for the Java executable running on Windows XP >>>>> ( java -Xms 1200m -Xmx 1200m) >>>>> >>>>> >>>>> The Lucene index was created using the following Hibernate >>>>> Search annotations: >>>>> >>>>> @Column >>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>> store=Store.NO) >>>>> private Integer marcField; >>>>> >>>>> @Column (length = 2) >>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>> store=Store.NO) >>>>> private String subField; >>>>> >>>>> @Column(length = 2) >>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>> store=Store.NO) >>>>> private String indicator1; >>>>> >>>>> @Column(length = 2) >>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>> store=Store.NO) >>>>> private String indicator2; >>>>> >>>>> @Column(length = 10000) >>>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, >>>>> store=Store.NO) >>>>> private String value; >>>>> >>>>> @Column >>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>> store=Store.NO) >>>>> private Integer recordId; >>>>> >>>>> >>>>> So all of the fields have NO NORMS except for "value" which is >>>>> contains description text that needs to be tokenised. >>>>> >>>>> Is there any way around this? Does Lucene really have such a >>>>> low limit for how much data it can search (and I consider 286 >>>>> million documents to be pretty small beer - we were hoping to >>>>> index a table of over a billion rows)? Or is there something >>>>> I'm missing? >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>> >>> >>> > > > |
|
|
Re: problems with large Lucene indexi haved a look to the exception, and see that hibernate is using a function
called updateTopDocs to make the search, this function uses a filter, and this may add some more memory requeriments, could you create a filter also with the lucene standalone version? something like this before the search: Queryfilter queryfilt = new QueryFilter(query); and the pass the filter to the search. However, with such a huge index I think you must find the way to get some more memory for the java heap. hibernate queryhits: http://viewvc.jboss.org/cgi-bin/viewvc.cgi/hibernate/search/trunk/src/java/org/hibernate/search/query/QueryHits.java?view=markup&pathrev=15603 On Wed, Mar 11, 2009 at 4:31 PM, Michael McCandless < lucene@...> wrote: > > Unfortunately, I'm not familiar with exactly what Hibernate search does > with the Lucene APIs. > > It must be doing something beyond what your standalone Lucene test case > does. > > Maybe ask this question on the Hibernate list? > > > Mike > > lucene@... wrote: > > Thanks for the advice. >> >> I haven't got around to profiling the code. Instead, I took your advice >> and knocked Hibernate out of the equation with a small stand-alone program >> that calls Lucene directly. I then wrote a similar stand-alone using >> Hibernate Search to do the same thing. >> >> On a small index both work fine: >> >> E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky >> hits = 29410 >> >> E:\hibtest>java -Xmx1200m -classpath >> .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar >> hibtest >> size = 29410 >> >> >> Trying it on our huge index works for the straight Lucene version: >> >> E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky >> hits = 320500 >> >> >> but fails for the Hibernate version: >> >> E:\hibtest>java -Xmx1200m -classpath >> .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar >> hibtest >> Exception in thread "main" java.lang.OutOfMemoryError >> at java.io.RandomAccessFile.readBytes(Native Method) >> at java.io.RandomAccessFile.read(Unknown Source) >> at >> org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirec >> tory.java:596) >> at >> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp >> ut.java:136) >> at >> org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal( >> CompoundFileReader.java:247) >> at >> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp >> ut.java:136) >> at >> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp >> ut.java:92) >> at >> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907) >> at >> org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j >> ava:352) >> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) >> at >> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:6 >> 9) >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) >> >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) >> >> at org.apache.lucene.search.Searcher.search(Searcher.java:136) >> at >> org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100 >> ) >> at org.hibernate.search.query.QueryHits.<init>(QueryHits.java:61) >> at >> org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue >> ryImpl.java:354) >> at >> org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu >> eryImpl.java:741) >> at hibtest.main(hibtest.java:45) >> >> E:\hibtest> >> >> >> I am not sure why this is occuring. Any ideas? I am calling >> IndexSearcher.search() and so is Hibernate. Is Hibernate Search telling >> Lucene to try and read in the entire index into memory? >> >> >> Code for the Lucene version is: >> >> public class lucky >> { >> public static void main(String[] args) >> { >> try >> { >> Term term = new Term("value", "church"); >> Query query = new TermQuery(term); >> IndexSearcher searcher = new >> IndexSearcher("E:/lucene/indexes/uk.bl.dportal.marcdb.MarcText"); >> Hits hits = searcher.search(query); >> >> System.out.println("hits = "+hits.length()); >> >> searcher.close(); >> } >> catch (Exception e) >> { >> e.printStackTrace(); >> } >> } >> } >> >> and for the Hibernate Search version: >> >> public class hibtest { >> >> public static void main(String[] args) { >> hibtest mgr = new hibtest(); >> >> Session session = >> HibernateUtil.getSessionFactory().getCurrentSession(); >> >> session.beginTransaction(); >> >> FullTextSession fullTextSession = >> Search.getFullTextSession(session); >> TermQuery luceneQuery = new TermQuery(new Term("value", "church")); >> >> org.hibernate.search.FullTextQuery fullTextQuery = >> fullTextSession.createFullTextQuery( luceneQuery, MarcText.class ); >> >> long resultSize = fullTextQuery.getResultSize(); // this is line 45 >> >> System.out.println("size = "+resultSize); >> >> session.getTransaction().commit(); >> >> HibernateUtil.getSessionFactory().close(); >> } >> >> } >> >> >> >> Quoting Michael McCandless <lucene@...>: >> >> >>> At this point, I'd recommend running with a memory profiler, eg >>> YourKit, and posting the resulting output. >>> >>> With norms only on one field, no deletions, and no field sorting, I >>> can't see why you're running out of memory. >>> >>> If you take Hibernate out of the picture, and simply open an >>> IndexSearcher on the underlying index, do you still hit OOM? >>> >>> Can you post the output of CheckIndex? You can run it from the command >>> line: >>> >>> java org.apache.lucene.index.CheckIndex <pathToIndex> >>> >>> (Without -fix, CheckIndex will make no changes to the index, but it's >>> best to do this on a copy of the index to be supremely safe). >>> >>> Mike >>> >>> lucene@... wrote: >>> >>> Thanks Michael, >>>> >>>> There is no sorting on the result (adding a sort causes OOM well before >>>> the point it runs out for the default). >>>> >>>> There are no deleted docs - the index was created from a set of docs >>>> and no adds or deletes have taken place. >>>> >>>> Memory isn't being consumed elsewhere in the system. It all comes down >>>> to the Lucene call via Hibernate Search. We decided to split our huge index >>>> into a set of several smaller indexes. Like the original single index, each >>>> smaller index has one field which is tokenized and the other fields have >>>> NO_NORMS set. >>>> >>>> The following, explicitely specifying just one index, works fine: >>>> >>>> org.hibernate.search.FullTextQuery fullTextQuery = >>>> fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class ); >>>> >>>> But as soon as we start adding further indexes: >>>> >>>> org.hibernate.search.FullTextQuery fullTextQuery = >>>> fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class, >>>> MarcText8.class ); >>>> >>>> We start running into OOM. >>>> >>>> In our case the MarcText2 index has a total disk size of 5Gb (with >>>> 57589069 documents / 75491779 terms) and MarcText8 has a total size of >>>> 6.46Gb (with 79339982 documents / 104943977 terms). >>>> >>>> Adding all 8 indexes (the same as our original single index), either by >>>> explicitely naming them or just with: >>>> >>>> org.hibernate.search.FullTextQuery fullTextQuery = >>>> fullTextSession.createFullTextQuery( outerLuceneQuery); >>>> >>>> results in it becoming completely unusable. >>>> >>>> >>>> One thing I am not sure about is that in Luke it tells me for an index >>>> (neither of the indexes mentioned above) that was created with NO_NORMS set >>>> on all the fields: >>>> >>>> "Index functionality: lock-less, single norms, shared doc store, >>>> checksum, del count, omitTf" >>>> >>>> Is this correct? I am not sure what it means by "single norms" - I >>>> would have expected it to say "no norms". >>>> >>>> >>>> Any further ideas on where to go from here? Your estimate of what is >>>> loaded into memory suggests that we shouldn't really be anywhere near >>>> running out of memory with these size indexes! >>>> >>>> As I said in my OP, Luke also gets a heap error on searching our >>>> original single large index which makes me wonder if it is a problem with >>>> the construction of the index. >>>> >>>> >>>> >>>> Quoting Michael McCandless <lucene@...>: >>>> >>>> >>>>> Lucene is trying to allocate the contiguous norms array for your index, >>>>> which should be ~273 MB (=286/1024/1024), when it hits the OOM. >>>>> >>>>> Is your search sorting by field value? (Which'd also consume memory.) >>>>> Or it's just the default (by relevance) sort? >>>>> >>>>> The only other biggish consumer of memory should be the deleted docs, >>>>> but that's a BitVector so it should need ~34 MB RAM. >>>>> >>>>> Can you run a memory profiler to see what else is consuming RAM? >>>>> >>>>> Mike >>>>> >>>>> lucene@... wrote: >>>>> >>>>> Hello, >>>>>> >>>>>> I am using Lucene via Hibernate Search but the following problem is >>>>>> also seen using Luke. I'd appreciate any suggestions for solving this >>>>>> problem. >>>>>> >>>>>> I have a Lucene index (27Gb in size) that indexes a database table >>>>>> of 286 million rows. While Lucene was able to perform this indexing just >>>>>> fine (albeit very slowly), using the index has proved to be impossible. >>>>>> Any searches conducted on it, either from my Hibernate Search query or by >>>>>> placing the query into Luke give: >>>>>> >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271) >>>>>> at >>>>>> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69) >>>>>> at >>>>>> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230) >>>>>> at >>>>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) >>>>>> ... >>>>>> >>>>>> >>>>>> The type of queries are simple, of the form: >>>>>> >>>>>> (+value:church +marcField:245 +subField:a) >>>>>> >>>>>> which in this example should only return a few thousand results. >>>>>> >>>>>> >>>>>> The interpreter is already running with the maximum of heap space >>>>>> allowed on for the Java executable running on Windows XP ( java -Xms 1200m >>>>>> -Xmx 1200m) >>>>>> >>>>>> >>>>>> The Lucene index was created using the following Hibernate Search >>>>>> annotations: >>>>>> >>>>>> @Column >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private Integer marcField; >>>>>> >>>>>> @Column (length = 2) >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private String subField; >>>>>> >>>>>> @Column(length = 2) >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private String indicator1; >>>>>> >>>>>> @Column(length = 2) >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private String indicator2; >>>>>> >>>>>> @Column(length = 10000) >>>>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED, >>>>>> store=Store.NO) >>>>>> private String value; >>>>>> >>>>>> @Column >>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class) >>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS, >>>>>> store=Store.NO) >>>>>> private Integer recordId; >>>>>> >>>>>> >>>>>> So all of the fields have NO NORMS except for "value" which is >>>>>> contains description text that needs to be tokenised. >>>>>> >>>>>> Is there any way around this? Does Lucene really have such a low >>>>>> limit for how much data it can search (and I consider 286 million >>>>>> documents to be pretty small beer - we were hoping to index a table of >>>>>> over a billion rows)? Or is there something I'm missing? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >> >> >> > -- Jokin |
|
|
Re: problems with large Lucene indexThanks Mike and Jokin for your comments on the memory problem. I have
submitted the query to the Hibernate Search list although I haven't seen a response yet. In the meantime I did my own investigating in the code (I'd rather have avoided this!). I'm seeing results that don't make any sense and maybe someone here with more experience of Lucene and the way memory is allocated by the JVM may shed light on, what to me, are quite illogical observations. As you may recall I had a stand-alone Lucene search and a Hibernate Search version. Looking in the HS code did not shed any light on the issue. I took my stand-alone Lucene code and put it in a method and replaced the search in the HS class (the constructor of QueryHits.java) with the call to my method. Bear in mind this method is the same code as posted in my earlier message - it sets up the Lucene search from scratch (i.e.: no data structures created by HS were used). So, effectively I was calling my stand-alone code after any setup done by Hibernate and any memory it may have allocated (which turned out to be a few Mb). I get OOM! Printing the free memory at this point shows bags of memory left. Indeed, the same free memory (+/- a few Mb) as the stand-alone Lucene version! I then instrumented the Lucene method where the OOM is occuring (FSDirectory.readInternal()). I cannot understand the results I am seeing. Below is a snippet of the output of each with the code around FSDirectory line 598 as follows: ... do { long tot = Runtime.getRuntime().totalMemory(); long free =Runtime.getRuntime().freeMemory(); System.out.println("LUCENE: offset="+offset+" total="+total+" len-total="+(len-total)+" free mem="+free+" used ="+(tot-free)); int i = file.read(b, offset+total, len-total); ... The stand-alone version: ... LUCENE: offset=0 total=0 len-total=401 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=883 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used =330080544 LUCENE: offset=0 total=0 len-total=209000000 free mem=631122912 used =617534496 LUCENE: offset=209000000 total=0 len-total=20900000 free mem=631122912 used =617534496 LUCENE: offset=229900000 total=0 len-total=20900000 free mem=631122912 used =617534496 LUCENE: offset=250800000 total=0 len-total=20900000 free mem=631122912 used =617534496 ... completes successfully! The method called via Hibernate Search: ... LUCENE: offset=0 total=0 len-total=401 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=883 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used =334892152 LUCENE: offset=0 total=0 len-total=209000000 free mem=636731528 used =622346104 Exception in thread "main" java.lang.OutOfMemoryError at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(Unknown Source) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:599) ... fails with exception! Note that the HS version has slightly more free memory because I ran it with -Xms1210M as opposed to -Xms1200M for the stand-alone to offset any memory used by HS when it starts up. As you can see, these are identical for all practical purposes. So what gives? I'm stumped, so any suggestions appreciated. Thanks. Quoting Michael McCandless <lucene@...>: > > Unfortunately, I'm not familiar with exactly what Hibernate search does > with the Lucene APIs. > > It must be doing something beyond what your standalone Lucene test case does. > > Maybe ask this question on the Hibernate list? > > Mike |
|
|
Re: problems with large Lucene indextry running with verbose gc. That will give you more details about what is
happening. Even better, run with jconsole on the side so that you get really detailed information on memory pools. On Thu, Mar 12, 2009 at 7:30 AM, <lucene@...> wrote: > Thanks Mike and Jokin for your comments on the memory problem. I have > submitted the query to the Hibernate Search list although I haven't seen a > response yet. > > In the meantime I did my own investigating in the code (I'd rather have > avoided this!). I'm seeing results that don't make any sense and maybe > someone here with more experience of Lucene and the way memory is allocated > by the JVM may shed light on, what to me, are quite illogical observations. > > As you may recall I had a stand-alone Lucene search and a Hibernate Search > version. Looking in the HS code did not shed any light on the issue. I took > my stand-alone Lucene code and put it in a method and replaced the search in > the HS class (the constructor of QueryHits.java) with the call to my method. > Bear in mind this method is the same code as posted in my earlier message - > it sets up the Lucene search from scratch (i.e.: no data structures created > by HS were used). So, effectively I was calling my stand-alone code after > any setup done by Hibernate and any memory it may have allocated (which > turned out to be a few Mb). > > I get OOM! Printing the free memory at this point shows bags of memory > left. Indeed, the same free memory (+/- a few Mb) as the stand-alone Lucene > version! > > I then instrumented the Lucene method where the OOM is occuring > (FSDirectory.readInternal()). I cannot understand the results I am seeing. > Below is a snippet of the output of each with the code around FSDirectory > line 598 as follows: > > ... > do { > long tot = Runtime.getRuntime().totalMemory(); > long free =Runtime.getRuntime().freeMemory(); > System.out.println("LUCENE: offset="+offset+" total="+total+" > len-total="+(len-total)+" free mem="+free+" used ="+(tot-free)); > int i = file.read(b, offset+total, len-total); > ... > > > > The stand-alone version: > > ... > LUCENE: offset=0 total=0 len-total=401 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=883 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used > =330080544 > LUCENE: offset=0 total=0 len-total=209000000 free mem=631122912 used > =617534496 > LUCENE: offset=209000000 total=0 len-total=20900000 free mem=631122912 > used =617534496 > LUCENE: offset=229900000 total=0 len-total=20900000 free mem=631122912 > used =617534496 > LUCENE: offset=250800000 total=0 len-total=20900000 free mem=631122912 > used =617534496 > ... > completes successfully! > > > The method called via Hibernate Search: > > ... > LUCENE: offset=0 total=0 len-total=401 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=883 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used > =334892152 > LUCENE: offset=0 total=0 len-total=209000000 free mem=636731528 used > =622346104 > Exception in thread "main" java.lang.OutOfMemoryError > at java.io.RandomAccessFile.readBytes(Native Method) > at java.io.RandomAccessFile.read(Unknown Source) > at > org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:599) > ... fails with exception! > > > Note that the HS version has slightly more free memory because I ran it > with -Xms1210M as opposed to -Xms1200M for the stand-alone to offset any > memory used by HS when it starts up. > > As you can see, these are identical for all practical purposes. So what > gives? > > I'm stumped, so any suggestions appreciated. > > Thanks. > > > Quoting Michael McCandless <lucene@...>: > > >> Unfortunately, I'm not familiar with exactly what Hibernate search does >> with the Lucene APIs. >> >> It must be doing something beyond what your standalone Lucene test case >> does. >> >> Maybe ask this question on the Hibernate list? >> >> Mike >> > > > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax) |
|
|
Re: problems with large Lucene index (reason found)I now know the cause of the problem. Increasing heap space actually breaks Lucene when reading large indexes. Details on why can be found here: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 Lucene is trying to read a huge block (about 2580Mb) at the point of failure. While it allocates the required bytes in method MultiSegmentReader.norms(), line 334, just fine, it is only when it attempts to use this array in a call to RandomAccessFile.readBytes() that it gets OOM. This is caused by a bug in the native code for the Java IO. As observed in the bug report, large heap space actually causes the bug to appear. When I reduced my heap from 1200M to 1000M the exception was never generated and the code completed correctly and it reported the correct number of search hits in the Hibernate Search version of my program. This isn't good - I need as much memory as possible because I intend to run my search as a web service. The work-around would be to read the file in small chunks, but I am not familiar with the Lucene code so I am unsure how that would be done in a global sense (i.e.: does it really need to allocate a buffer of that size in MultiSegmentReader?) The obvious solution (which I haven't tried yet) would be to patch the point in FSDirectory where the java IO read occurs - looping with a small buffer for the read and then concatenating the result back into Lucene's byte array. Thanks for the comments on this problem from people on this list. Quoting Ted Dunning <ted.dunning@...>: > try running with verbose gc. That will give you more details about what is > happening. > > Even better, run with jconsole on the side so that you get really detailed > information on memory pools. > > On Thu, Mar 12, 2009 at 7:30 AM, <lucene@...> wrote: > >> Thanks Mike and Jokin for your comments on the memory problem. I have >> submitted the query to the Hibernate Search list although I haven't seen a >> response yet. >> >> In the meantime I did my own investigating in the code (I'd rather have >> avoided this!). I'm seeing results that don't make any sense and maybe >> someone here with more experience of Lucene and the way memory is allocated >> by the JVM may shed light on, what to me, are quite illogical observations. >> >> As you may recall I had a stand-alone Lucene search and a Hibernate Search >> version. Looking in the HS code did not shed any light on the issue. I took >> my stand-alone Lucene code and put it in a method and replaced the search in >> the HS class (the constructor of QueryHits.java) with the call to my method. >> Bear in mind this method is the same code as posted in my earlier message - >> it sets up the Lucene search from scratch (i.e.: no data structures created >> by HS were used). So, effectively I was calling my stand-alone code after >> any setup done by Hibernate and any memory it may have allocated (which >> turned out to be a few Mb). >> >> I get OOM! Printing the free memory at this point shows bags of memory >> left. Indeed, the same free memory (+/- a few Mb) as the stand-alone Lucene >> version! >> >> I then instrumented the Lucene method where the OOM is occuring >> (FSDirectory.readInternal()). I cannot understand the results I am seeing. >> Below is a snippet of the output of each with the code around FSDirectory >> line 598 as follows: >> >> ... >> do { >> long tot = Runtime.getRuntime().totalMemory(); >> long free =Runtime.getRuntime().freeMemory(); >> System.out.println("LUCENE: offset="+offset+" total="+total+" >> len-total="+(len-total)+" free mem="+free+" used ="+(tot-free)); >> int i = file.read(b, offset+total, len-total); >> ... >> >> >> >> The stand-alone version: >> >> ... >> LUCENE: offset=0 total=0 len-total=401 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=883 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >> =330080544 >> LUCENE: offset=0 total=0 len-total=209000000 free mem=631122912 used >> =617534496 >> LUCENE: offset=209000000 total=0 len-total=20900000 free mem=631122912 >> used =617534496 >> LUCENE: offset=229900000 total=0 len-total=20900000 free mem=631122912 >> used =617534496 >> LUCENE: offset=250800000 total=0 len-total=20900000 free mem=631122912 >> used =617534496 >> ... >> completes successfully! >> >> >> The method called via Hibernate Search: >> >> ... >> LUCENE: offset=0 total=0 len-total=401 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=883 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >> =334892152 >> LUCENE: offset=0 total=0 len-total=209000000 free mem=636731528 used >> =622346104 >> Exception in thread "main" java.lang.OutOfMemoryError >> at java.io.RandomAccessFile.readBytes(Native Method) >> at java.io.RandomAccessFile.read(Unknown Source) >> at >> org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:599) >> ... fails with exception! >> >> >> Note that the HS version has slightly more free memory because I ran it >> with -Xms1210M as opposed to -Xms1200M for the stand-alone to offset any >> memory used by HS when it starts up. >> >> As you can see, these are identical for all practical purposes. So what >> gives? >> >> I'm stumped, so any suggestions appreciated. >> >> Thanks. >> >> >> Quoting Michael McCandless <lucene@...>: >> >> >>> Unfortunately, I'm not familiar with exactly what Hibernate search does >>> with the Lucene APIs. >>> >>> It must be doing something beyond what your standalone Lucene test case >>> does. >>> >>> Maybe ask this question on the Hibernate list? >>> >>> Mike >>> >> >> >> > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > www.deepdyve.com > 408-773-0110 ext. 738 > 858-414-0013 (m) > 408-773-0220 (fax) > |
|
|
Re: problems with large Lucene index (correction!)Quoting lucene@...:
> Lucene is trying to read a huge block (about 2580Mb) at the point of Typo! Please read "280Mb" not "2580Mb" |
|
|
Re: problems with large Lucene index (reason found)Gak, what a horrible bug! It seems widespread (JRE 1.5, 1.6, on Linux & Windows OSs). And it's been open for almost 2.5 years. I just added a comment & voted for the bug. Does it also occur on a 64 bit JRE? If you still allocate the full array, but read several smaller chunks into it, do you still hit the bug? Mike lucene@... wrote: > > I now know the cause of the problem. Increasing heap space actually > breaks Lucene when reading large indexes. > > Details on why can be found here: > > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 > > > Lucene is trying to read a huge block (about 2580Mb) at the point of > failure. While it allocates the required bytes in method > MultiSegmentReader.norms(), line 334, just fine, it is only when it > attempts to use this array in a call to RandomAccessFile.readBytes() > that it gets OOM. This is caused by a bug in the native code for the > Java IO. > > As observed in the bug report, large heap space actually causes the > bug to appear. When I reduced my heap from 1200M to 1000M the > exception was never generated and the code completed correctly and > it reported the correct number of search hits in the Hibernate > Search version of my program. > > This isn't good - I need as much memory as possible because I intend > to run my search as a web service. > > The work-around would be to read the file in small chunks, but I am > not familiar with the Lucene code so I am unsure how that would be > done in a global sense (i.e.: does it really need to allocate a > buffer of that size in MultiSegmentReader?) > > The obvious solution (which I haven't tried yet) would be to patch > the point in FSDirectory where the java IO read occurs - looping > with a small buffer for the read and then concatenating the result > back into Lucene's byte array. > > Thanks for the comments on this problem from people on this list. > > > > Quoting Ted Dunning <ted.dunning@...>: > >> try running with verbose gc. That will give you more details about >> what is >> happening. >> >> Even better, run with jconsole on the side so that you get really >> detailed >> information on memory pools. >> >> On Thu, Mar 12, 2009 at 7:30 AM, <lucene@...> wrote: >> >>> Thanks Mike and Jokin for your comments on the memory problem. I >>> have >>> submitted the query to the Hibernate Search list although I >>> haven't seen a >>> response yet. >>> >>> In the meantime I did my own investigating in the code (I'd rather >>> have >>> avoided this!). I'm seeing results that don't make any sense and >>> maybe >>> someone here with more experience of Lucene and the way memory is >>> allocated >>> by the JVM may shed light on, what to me, are quite illogical >>> observations. >>> >>> As you may recall I had a stand-alone Lucene search and a >>> Hibernate Search >>> version. Looking in the HS code did not shed any light on the >>> issue. I took >>> my stand-alone Lucene code and put it in a method and replaced the >>> search in >>> the HS class (the constructor of QueryHits.java) with the call to >>> my method. >>> Bear in mind this method is the same code as posted in my earlier >>> message - >>> it sets up the Lucene search from scratch (i.e.: no data >>> structures created >>> by HS were used). So, effectively I was calling my stand-alone >>> code after >>> any setup done by Hibernate and any memory it may have allocated >>> (which >>> turned out to be a few Mb). >>> >>> I get OOM! Printing the free memory at this point shows bags of >>> memory >>> left. Indeed, the same free memory (+/- a few Mb) as the stand- >>> alone Lucene >>> version! >>> >>> I then instrumented the Lucene method where the OOM is occuring >>> (FSDirectory.readInternal()). I cannot understand the results I >>> am seeing. >>> Below is a snippet of the output of each with the code around >>> FSDirectory >>> line 598 as follows: >>> >>> ... >>> do { >>> long tot = Runtime.getRuntime().totalMemory(); >>> long free =Runtime.getRuntime().freeMemory(); >>> System.out.println("LUCENE: offset="+offset+" total="+total+" >>> len-total="+(len-total)+" free mem="+free+" used ="+(tot-free)); >>> int i = file.read(b, offset+total, len-total); >>> ... >>> >>> >>> >>> The stand-alone version: >>> >>> ... >>> LUCENE: offset=0 total=0 len-total=401 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=883 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>> =330080544 >>> LUCENE: offset=0 total=0 len-total=209000000 free >>> mem=631122912 used >>> =617534496 >>> LUCENE: offset=209000000 total=0 len-total=20900000 free >>> mem=631122912 >>> used =617534496 >>> LUCENE: offset=229900000 total=0 len-total=20900000 free >>> mem=631122912 >>> used =617534496 >>> LUCENE: offset=250800000 total=0 len-total=20900000 free >>> mem=631122912 >>> used =617534496 >>> ... >>> completes successfully! >>> >>> >>> The method called via Hibernate Search: >>> >>> ... >>> LUCENE: offset=0 total=0 len-total=401 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=883 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>> =334892152 >>> LUCENE: offset=0 total=0 len-total=209000000 free >>> mem=636731528 used >>> =622346104 >>> Exception in thread "main" java.lang.OutOfMemoryError >>> at java.io.RandomAccessFile.readBytes(Native Method) >>> at java.io.RandomAccessFile.read(Unknown Source) >>> at >>> org.apache.lucene.store.FSDirectory >>> $FSIndexInput.readInternal(FSDirectory.java:599) >>> ... fails with exception! >>> >>> >>> Note that the HS version has slightly more free memory because I >>> ran it >>> with -Xms1210M as opposed to -Xms1200M for the stand-alone to >>> offset any >>> memory used by HS when it starts up. >>> >>> As you can see, these are identical for all practical purposes. >>> So what >>> gives? >>> >>> I'm stumped, so any suggestions appreciated. >>> >>> Thanks. >>> >>> >>> Quoting Michael McCandless <lucene@...>: >>> >>> >>>> Unfortunately, I'm not familiar with exactly what Hibernate >>>> search does >>>> with the Lucene APIs. >>>> >>>> It must be doing something beyond what your standalone Lucene >>>> test case >>>> does. >>>> >>>> Maybe ask this question on the Hibernate list? >>>> >>>> Mike >>>> >>> >>> >>> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> >> 111 West Evelyn Ave. Ste. 202 >> Sunnyvale, CA 94086 >> www.deepdyve.com >> 408-773-0110 ext. 738 >> 858-414-0013 (m) >> 408-773-0220 (fax) >> > > > |
|
|
Re: problems with large Lucene index (reason found)Yes, I overrode the read() method in FSDirectory.FSIndexInput.Descriptor and forced it to read in 50Mb chunks and do an arraycopy() into the array created by Lucene. It now works with any heap size and doesn't get OOM. There may be other areas this could happen in the Lucene code (although at present it seems to be working fine for me on our largest, 17Gb, index but I haven't tried accessing data yet - only getting the result size - so perhaps there are other calls to read() with large buffer sizes). As this bug does not look like it will be fixed in the near future, it might be an idea to put in place a fix in the Lucene code. I think it would be safe to read in chunks of up to 100Mb without a problem and I don't think it will affect performance to any great degree. It's pleasing to see that Lucene can easily handle such huge indexes, although this bug is obviously quite an impediment to doing so. regards, Cameron Newham Quoting Michael McCandless <lucene@...>: > > Gak, what a horrible bug! > > It seems widespread (JRE 1.5, 1.6, on Linux & Windows OSs). And it's > been open for almost 2.5 years. I just added a comment & voted for the > bug. > > Does it also occur on a 64 bit JRE? > > If you still allocate the full array, but read several smaller chunks > into it, do you still hit the bug? > > Mike > > lucene@... wrote: > >> >> I now know the cause of the problem. Increasing heap space actually >> breaks Lucene when reading large indexes. >> >> Details on why can be found here: >> >> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 >> >> >> Lucene is trying to read a huge block (about 2580Mb) at the point >> of failure. While it allocates the required bytes in method >> MultiSegmentReader.norms(), line 334, just fine, it is only when it >> attempts to use this array in a call to >> RandomAccessFile.readBytes() that it gets OOM. This is caused by a >> bug in the native code for the Java IO. >> >> As observed in the bug report, large heap space actually causes the >> bug to appear. When I reduced my heap from 1200M to 1000M the >> exception was never generated and the code completed correctly and >> it reported the correct number of search hits in the Hibernate >> Search version of my program. >> >> This isn't good - I need as much memory as possible because I >> intend to run my search as a web service. >> >> The work-around would be to read the file in small chunks, but I am >> not familiar with the Lucene code so I am unsure how that would be >> done in a global sense (i.e.: does it really need to allocate a >> buffer of that size in MultiSegmentReader?) >> >> The obvious solution (which I haven't tried yet) would be to patch >> the point in FSDirectory where the java IO read occurs - looping >> with a small buffer for the read and then concatenating the result >> back into Lucene's byte array. >> >> Thanks for the comments on this problem from people on this list. >> >> >> >> Quoting Ted Dunning <ted.dunning@...>: >> >>> try running with verbose gc. That will give you more details about what is >>> happening. >>> >>> Even better, run with jconsole on the side so that you get really detailed >>> information on memory pools. >>> >>> On Thu, Mar 12, 2009 at 7:30 AM, <lucene@...> wrote: >>> >>>> Thanks Mike and Jokin for your comments on the memory problem. I have >>>> submitted the query to the Hibernate Search list although I haven't seen a >>>> response yet. >>>> >>>> In the meantime I did my own investigating in the code (I'd rather have >>>> avoided this!). I'm seeing results that don't make any sense and maybe >>>> someone here with more experience of Lucene and the way memory is >>>> allocated >>>> by the JVM may shed light on, what to me, are quite illogical >>>> observations. >>>> >>>> As you may recall I had a stand-alone Lucene search and a Hibernate Search >>>> version. Looking in the HS code did not shed any light on the >>>> issue. I took >>>> my stand-alone Lucene code and put it in a method and replaced >>>> the search in >>>> the HS class (the constructor of QueryHits.java) with the call to >>>> my method. >>>> Bear in mind this method is the same code as posted in my earlier >>>> message - >>>> it sets up the Lucene search from scratch (i.e.: no data >>>> structures created >>>> by HS were used). So, effectively I was calling my stand-alone code after >>>> any setup done by Hibernate and any memory it may have allocated (which >>>> turned out to be a few Mb). >>>> >>>> I get OOM! Printing the free memory at this point shows bags of memory >>>> left. Indeed, the same free memory (+/- a few Mb) as the >>>> stand-alone Lucene >>>> version! >>>> >>>> I then instrumented the Lucene method where the OOM is occuring >>>> (FSDirectory.readInternal()). I cannot understand the results I >>>> am seeing. >>>> Below is a snippet of the output of each with the code around FSDirectory >>>> line 598 as follows: >>>> >>>> ... >>>> do { >>>> long tot = Runtime.getRuntime().totalMemory(); >>>> long free =Runtime.getRuntime().freeMemory(); >>>> System.out.println("LUCENE: offset="+offset+" total="+total+" >>>> len-total="+(len-total)+" free mem="+free+" used ="+(tot-free)); >>>> int i = file.read(b, offset+total, len-total); >>>> ... >>>> >>>> >>>> >>>> The stand-alone version: >>>> >>>> ... >>>> LUCENE: offset=0 total=0 len-total=401 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=883 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 used >>>> =330080544 >>>> LUCENE: offset=0 total=0 len-total=209000000 free mem=631122912 used >>>> =617534496 >>>> LUCENE: offset=209000000 total=0 len-total=20900000 free mem=631122912 >>>> used =617534496 >>>> LUCENE: offset=229900000 total=0 len-total=20900000 free mem=631122912 >>>> used =617534496 >>>> LUCENE: offset=250800000 total=0 len-total=20900000 free mem=631122912 >>>> used =617534496 >>>> ... >>>> completes successfully! >>>> >>>> >>>> The method called via Hibernate Search: >>>> >>>> ... >>>> LUCENE: offset=0 total=0 len-total=401 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=883 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 used >>>> =334892152 >>>> LUCENE: offset=0 total=0 len-total=209000000 free mem=636731528 used >>>> =622346104 >>>> Exception in thread "main" java.lang.OutOfMemoryError >>>> at java.io.RandomAccessFile.readBytes(Native Method) >>>> at java.io.RandomAccessFile.read(Unknown Source) >>>> at >>>> org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:599) >>>> ... fails with exception! >>>> >>>> >>>> Note that the HS version has slightly more free memory because I ran it >>>> with -Xms1210M as opposed to -Xms1200M for the stand-alone to offset any >>>> memory used by HS when it starts up. >>>> >>>> As you can see, these are identical for all practical purposes. So what >>>> gives? >>>> >>>> I'm stumped, so any suggestions appreciated. >>>> >>>> Thanks. >>>> >>>> >>>> Quoting Michael McCandless <lucene@...>: >>>> >>>> >>>>> Unfortunately, I'm not familiar with exactly what Hibernate search does >>>>> with the Lucene APIs. >>>>> >>>>> It must be doing something beyond what your standalone Lucene test case >>>>> does. >>>>> >>>>> Maybe ask this question on the Hibernate list? >>>>> >>>>> Mike >>>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Ted Dunning, CTO >>> DeepDyve >>> >>> 111 West Evelyn Ave. Ste. 202 >>> Sunnyvale, CA 94086 >>> www.deepdyve.com >>> 408-773-0110 ext. 738 >>> 858-414-0013 (m) >>> 408-773-0220 (fax) >>> >> >> >> |
|
|
Re: problems with large Lucene index (reason found)lucene@... wrote: > Yes, I overrode the read() method in > FSDirectory.FSIndexInput.Descriptor and forced it to read in 50Mb > chunks and do an arraycopy() into the array created by Lucene. It > now works with any heap size and doesn't get OOM. You shouldn't need to do the extra arraycopy? RandomAccessFile can read into a particular offset/len inside the array. Does that not work? > There may be other areas this could happen in the Lucene code > (although at present it seems to be working fine for me on our > largest, 17Gb, index but I haven't tried accessing data yet - only > getting the result size - so perhaps there are other calls to read() > with large buffer sizes). > > As this bug does not look like it will be fixed in the near future, > it might be an idea to put in place a fix in the Lucene code. I > think it would be safe to read in chunks of up to 100Mb without a > problem and I don't think it will affect performance to any great > degree. I agree. Can you open a Jira issue and post a patch? > It's pleasing to see that Lucene can easily handle such huge > indexes, although this bug is obviously quite an impediment to doing > so. Yes indeed. This is one crazy bug. Mike |
|
|
Re: problems with large Lucene index (reason found)I've opened: https://issues.apache.org/jira/browse/LUCENE-1566 for this. Cameron, could you attach your patch to that issue? Thanks. Mike lucene@... wrote: > > Yes, I overrode the read() method in > FSDirectory.FSIndexInput.Descriptor and forced it to read in 50Mb > chunks and do an arraycopy() into the array created by Lucene. It > now works with any heap size and doesn't get OOM. > > There may be other areas this could happen in the Lucene code > (although at present it seems to be working fine for me on our > largest, 17Gb, index but I haven't tried accessing data yet - only > getting the result size - so perhaps there are other calls to read() > with large buffer sizes). > > As this bug does not look like it will be fixed in the near future, > it might be an idea to put in place a fix in the Lucene code. I > think it would be safe to read in chunks of up to 100Mb without a > problem and I don't think it will affect performance to any great > degree. > > It's pleasing to see that Lucene can easily handle such huge > indexes, although this bug is obviously quite an impediment to doing > so. > > regards, > Cameron Newham > > Quoting Michael McCandless <lucene@...>: > >> >> Gak, what a horrible bug! >> >> It seems widespread (JRE 1.5, 1.6, on Linux & Windows OSs). And it's >> been open for almost 2.5 years. I just added a comment & voted for >> the >> bug. >> >> Does it also occur on a 64 bit JRE? >> >> If you still allocate the full array, but read several smaller chunks >> into it, do you still hit the bug? >> >> Mike >> >> lucene@... wrote: >> >>> >>> I now know the cause of the problem. Increasing heap space >>> actually breaks Lucene when reading large indexes. >>> >>> Details on why can be found here: >>> >>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6478546 >>> >>> >>> Lucene is trying to read a huge block (about 2580Mb) at the point >>> of failure. While it allocates the required bytes in method >>> MultiSegmentReader.norms(), line 334, just fine, it is only when >>> it attempts to use this array in a call to >>> RandomAccessFile.readBytes() that it gets OOM. This is caused by >>> a bug in the native code for the Java IO. >>> >>> As observed in the bug report, large heap space actually causes >>> the bug to appear. When I reduced my heap from 1200M to 1000M >>> the exception was never generated and the code completed >>> correctly and it reported the correct number of search hits in >>> the Hibernate Search version of my program. >>> >>> This isn't good - I need as much memory as possible because I >>> intend to run my search as a web service. >>> >>> The work-around would be to read the file in small chunks, but I >>> am not familiar with the Lucene code so I am unsure how that >>> would be done in a global sense (i.e.: does it really need to >>> allocate a buffer of that size in MultiSegmentReader?) >>> >>> The obvious solution (which I haven't tried yet) would be to >>> patch the point in FSDirectory where the java IO read occurs - >>> looping with a small buffer for the read and then concatenating >>> the result back into Lucene's byte array. >>> >>> Thanks for the comments on this problem from people on this list. >>> >>> >>> >>> Quoting Ted Dunning <ted.dunning@...>: >>> >>>> try running with verbose gc. That will give you more details >>>> about what is >>>> happening. >>>> >>>> Even better, run with jconsole on the side so that you get really >>>> detailed >>>> information on memory pools. >>>> >>>> On Thu, Mar 12, 2009 at 7:30 AM, <lucene@...> wrote: >>>> >>>>> Thanks Mike and Jokin for your comments on the memory problem. I >>>>> have >>>>> submitted the query to the Hibernate Search list although I >>>>> haven't seen a >>>>> response yet. >>>>> >>>>> In the meantime I did my own investigating in the code (I'd >>>>> rather have >>>>> avoided this!). I'm seeing results that don't make any sense and >>>>> maybe >>>>> someone here with more experience of Lucene and the way memory >>>>> is allocated >>>>> by the JVM may shed light on, what to me, are quite illogical >>>>> observations. >>>>> >>>>> As you may recall I had a stand-alone Lucene search and a >>>>> Hibernate Search >>>>> version. Looking in the HS code did not shed any light on the >>>>> issue. I took >>>>> my stand-alone Lucene code and put it in a method and replaced >>>>> the search in >>>>> the HS class (the constructor of QueryHits.java) with the call >>>>> to my method. >>>>> Bear in mind this method is the same code as posted in my >>>>> earlier message - >>>>> it sets up the Lucene search from scratch (i.e.: no data >>>>> structures created >>>>> by HS were used). So, effectively I was calling my stand-alone >>>>> code after >>>>> any setup done by Hibernate and any memory it may have allocated >>>>> (which >>>>> turned out to be a few Mb). >>>>> >>>>> I get OOM! Printing the free memory at this point shows bags of >>>>> memory >>>>> left. Indeed, the same free memory (+/- a few Mb) as the stand- >>>>> alone Lucene >>>>> version! >>>>> >>>>> I then instrumented the Lucene method where the OOM is occuring >>>>> (FSDirectory.readInternal()). I cannot understand the results >>>>> I am seeing. >>>>> Below is a snippet of the output of each with the code around >>>>> FSDirectory >>>>> line 598 as follows: >>>>> >>>>> ... >>>>> do { >>>>> long tot = Runtime.getRuntime().totalMemory(); >>>>> long free =Runtime.getRuntime().freeMemory(); >>>>> System.out.println("LUCENE: offset="+offset+" total="+total+" >>>>> len-total="+(len-total)+" free mem="+free+" used ="+(tot- >>>>> free)); >>>>> int i = file.read(b, offset+total, len-total); >>>>> ... >>>>> >>>>> >>>>> >>>>> The stand-alone version: >>>>> >>>>> ... >>>>> LUCENE: offset=0 total=0 len-total=401 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=883 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=918576864 >>>>> used >>>>> =330080544 >>>>> LUCENE: offset=0 total=0 len-total=209000000 free >>>>> mem=631122912 used >>>>> =617534496 >>>>> LUCENE: offset=209000000 total=0 len-total=20900000 free >>>>> mem=631122912 >>>>> used =617534496 >>>>> LUCENE: offset=229900000 total=0 len-total=20900000 free >>>>> mem=631122912 >>>>> used =617534496 >>>>> LUCENE: offset=250800000 total=0 len-total=20900000 free >>>>> mem=631122912 >>>>> used =617534496 >>>>> ... >>>>> completes successfully! >>>>> >>>>> >>>>> The method called via Hibernate Search: >>>>> >>>>> ... >>>>> LUCENE: offset=0 total=0 len-total=401 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=883 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=1024 free mem=924185480 >>>>> used >>>>> =334892152 >>>>> LUCENE: offset=0 total=0 len-total=209000000 free >>>>> mem=636731528 used >>>>> =622346104 >>>>> Exception in thread "main" java.lang.OutOfMemoryError >>>>> at java.io.RandomAccessFile.readBytes(Native Method) >>>>> at java.io.RandomAccessFile.read(Unknown Source) >>>>> at >>>>> org.apache.lucene.store.FSDirectory >>>>> $FSIndexInput.readInternal(FSDirectory.java:599) >>>>> ... fails with exception! >>>>> >>>>> >>>>> Note that the HS version has slightly more free memory because I >>>>> ran it >>>>> with -Xms1210M as opposed to -Xms1200M for the stand-alone to >>>>> offset any >>>>> memory used by HS when it starts up. >>>>> >>>>> As you can see, these are identical for all practical purposes. >>>>> So what >>>>> gives? >>>>> >>>>> I'm stumped, so any suggestions appreciated. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> Quoting Michael McCandless <lucene@...>: >>>>> >>>>> >>>>>> Unfortunately, I'm not familiar with exactly what Hibernate >>>>>> search does >>>>>> with the Lucene APIs. >>>>>> >>>>>> It must be doing something beyond what your standalone Lucene >>>>>> test case >>>>>> does. >>>>>> >>>>>> Maybe ask this question on the Hibernate list? >>>>>> >>>>>> Mike >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Ted Dunning, CTO >>>> DeepDyve >>>> >>>> 111 West Evelyn Ave. Ste. 202 >>>> Sunnyvale, CA 94086 >>>> www.deepdyve.com >>>> 408-773-0110 ext. 738 >>>> 858-414-0013 (m) >>>> 408-773-0220 (fax) >>>> >>> >>> >>> > > > |
|
|
Re: problems with large Lucene index (reason found)Yes patch would be very welcome.
TY
|
|
|
Re: problems with large Lucene index (reason found)+1 for this patch.
Is Cameron still out there? On Tue, Jun 16, 2009 at 1:47 PM, agatone <zoran.zoki@...> wrote: > > Yes patch would be very welcome. > TY > > > Michael McCandless-2 wrote: > > > > > > I've opened: > > > > https://issues.apache.org/jira/browse/LUCENE-1566 > > > > for this. Cameron, could you attach your patch to that issue? Thanks. > > > > Mike > > > > ... > > > > -- > View this message in context: > http://www.nabble.com/problems-with-large-Lucene-index-tp22347854p24062530.html > Sent from the Lucene - General mailing list archive at Nabble.com. > > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax) |
|
|
Re: problems with large Lucene index (reason found)Ted (or anyone), do you want to work out a patch?
I wonder if NIOFSDir (which uses java.nio.*'s positional reads), or, MMapDir are affected by this sun bug... Mike On Tue, Jun 16, 2009 at 5:31 PM, Ted Dunning<ted.dunning@...> wrote: > +1 for this patch. > > Is Cameron still out there? > > On Tue, Jun 16, 2009 at 1:47 PM, agatone <zoran.zoki@...> wrote: > >> >> Yes patch would be very welcome. >> TY >> >> >> Michael McCandless-2 wrote: >> > >> > >> > I've opened: >> > >> > https://issues.apache.org/jira/browse/LUCENE-1566 >> > >> > for this. Cameron, could you attach your patch to that issue? Thanks. >> > >> > Mike >> > >> > ... >> > >> >> -- >> View this message in context: >> http://www.nabble.com/problems-with-large-Lucene-index-tp22347854p24062530.html >> Sent from the Lucene - General mailing list archive at Nabble.com. >> >> > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > http://www.deepdyve.com > 858-414-0013 (m) > 408-773-0220 (fax) > |
|
|
Re: problems with large Lucene index (reason found)I would love to, but I am totally swamped right now.
Whether NIOFSDir would help is a very interesting question. On Tue, Jun 23, 2009 at 1:51 PM, Michael McCandless < lucene@...> wrote: > Ted (or anyone), do you want to work out a patch? > > I wonder if NIOFSDir (which uses java.nio.*'s positional reads), or, > MMapDir are affected by this sun bug... > |
| Free embeddable forum powered by Nabble | Forum Help |