|
View:
New views
15 Messages
—
Rating Filter:
Alert me
|
|
|
Search indexes - magnolia 4.1.1Hello
I found an article saying that Lucene engine corrupts search indexes in case invalid PDF file is uploaded to Magnolia's DMS: http://wiki.magnolia-cms.com/display/WIKI/Corrupted+Search+Index Does anyone have a clue if that problem is solved in 4.1.1? Can I enforce search index rebuild procedure from Magnolia GUI / Startup options somehow? Now my DMS module will not find any document by its contents - just short description is used. I assume I have to recreate indexes from scratch somehow. Regards, Denis |
|
|
Re: Search indexes - magnolia 4.1.1Hi Denis, there isn't nothing new about this issue. If you need to rebuild your Lucene indexes: * stop your application server * delete all ../repositories/magnolia/workspaces/*/index folders * during startup of your server the indexes will be recreated - Best regards, Zdenek Skodik Magnolia International Ltd. Magnolia® - Simple Open-Source Content Management On St, 2009-10-14 at 08:50 -0400, Denis Demichev wrote: > Hello > > I found an article saying that Lucene engine corrupts search indexes > in case invalid PDF file is uploaded to Magnolia's DMS: > http://wiki.magnolia-cms.com/display/WIKI/Corrupted+Search+Index > > Does anyone have a clue if that problem is solved in 4.1.1? Can I > enforce search index rebuild procedure from Magnolia GUI / Startup > options somehow? > > Now my DMS module will not find any document by its contents - just > short description is used. I assume I have to recreate indexes from > scratch somehow. > > Regards, > Denis ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Search indexes - magnolia 4.1.1Zdenek Skodik ha scritto: > Hi Denis, > > there isn't nothing new about this issue. > If you need to rebuild your Lucene indexes: > > * stop your application server > * delete all ../repositories/magnolia/workspaces/*/index folders > * during startup of your server the indexes will be recreated Hi Zdenek, in this case, I think that if PDF are stored on DB level, each index rebuild phase will be end with exception, isn't it? Correct me if I am wrong... Matteo ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1Hi Matteo, yep, in order to index PDF files you need to first parse them to extract text that you want to index from them. - Best regards, Zdenek Skodik Magnolia International Ltd. Magnolia® - Simple Open-Source Content Management On Čt, 2009-10-15 at 13:12 +0200, Matteo Pelucco wrote: > Zdenek Skodik ha scritto: > > Hi Denis, > > > > there isn't nothing new about this issue. > > If you need to rebuild your Lucene indexes: > > > > * stop your application server > > * delete all ../repositories/magnolia/workspaces/*/index folders > > * during startup of your server the indexes will be recreated > > Hi Zdenek, > in this case, I think that if PDF are stored on DB level, each index > rebuild phase will be end with exception, isn't it? > Correct me if I am wrong... > > Matteo > > > ---------------------------------------------------------------- > For list details see > http://www.magnolia-cms.com/home/community/mailing-lists.html > To unsubscribe, E-mail to: <user-list-unsubscribe@...> > ---------------------------------------------------------------- ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1That is not what Matteo asked. As he correctly pointed out the presence of corrupted PDF will cause the indexing to fail, which in turn would cause Magnolia to fail at startup. Your only option then is to either remove PDF text extractor so the PDF is not indexed and after indexes get created remove the corrupted PDF file and redo the indexing again, this time with PDF indexer enabled, or if you have EE you can also comment out <SearchIndexer/> section in workspace.xml of affected workspace and use MagnoliaTools to remove the affected node and uncomment the section afterwards again. Jan On Thu, 2009-10-15 at 13:35 +0200, Zdenek Skodik wrote: > Hi Matteo, > > yep, in order to index PDF files you need > to first parse them to extract text that you > want to index from them. > > - > Best regards, > > Zdenek Skodik > Magnolia International Ltd. > > Magnolia® - Simple Open-Source Content Management > > > On Čt, 2009-10-15 at 13:12 +0200, Matteo Pelucco wrote: > > Zdenek Skodik ha scritto: > > > Hi Denis, > > > > > > there isn't nothing new about this issue. > > > If you need to rebuild your Lucene indexes: > > > > > > * stop your application server > > > * delete all ../repositories/magnolia/workspaces/*/index folders > > > * during startup of your server the indexes will be recreated > > > > Hi Zdenek, > > in this case, I think that if PDF are stored on DB level, each index > > rebuild phase will be end with exception, isn't it? > > Correct me if I am wrong... > > > > Matteo > > > > > > ---------------------------------------------------------------- > > For list details see > > http://www.magnolia-cms.com/home/community/mailing-lists.html > > To unsubscribe, E-mail to: <user-list-unsubscribe@...> > > ---------------------------------------------------------------- > > > ---------------------------------------------------------------- > For list details see > http://www.magnolia-cms.com/home/community/mailing-lists.html > To unsubscribe, E-mail to: <user-list-unsubscribe@...> > ---------------------------------------------------------------- ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1Hello All,
Thank you for that update - I tried to delete Lucene indexes in workspace_name/index and they were rebuilt. As for PDF extractor it looks like it won't throw any PDF parser related exception. Here's an excerpt from org.apache.jackrabbit.extractor.PdfTextExtractor: } catch (Exception e) { // it may happen that PDFParser throws a runtime // exception when parsing certain pdf documents logger.warn("Failed to extract PDF text content", e); return new StringReader(""); } finally { stream.close(); } That's what jackrabbit 1.5 has at least. 1.6 is the same. Does anyone know when exactly incoming documents are parsed? Right after upload or maybe during activation procedure? Thanks! Regards, Denis On Thu, Oct 15, 2009 at 7:49 AM, Jan Haderka <jan.haderka@...> wrote:
|
|
|
Re: Search indexes - magnolia 4.1.1Denis Demichev ha scritto: > Hello All, > > Does anyone know when exactly incoming documents are parsed? Right after > upload or maybe during activation procedure? It should be PersistenceManager implementation class which decide when to index storage, isn't it? I'm not sure, but for sure it should NOT be on activation time, otherwise any query on author instance won't return not activated content. Anyway, it is Jackrabbit who decide when to index and activation is a Magnolia action.. M. ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1Hello,
Matteo, thank you for clarification. Actually I'm asking because my Magnolia 4.1.1 doesn't return any search result even if I know that some specific term is inside that specific document (.txt file uploaded to DMS). It looks like my DMS is not indexed at all. However, I can search through website pages, i.e. WEBSITE workspace is indexed, I guess. I tried to delete index folder in default, DMS, website workspaces on both Author and Public instances but this didn't help. Could STK affect this Magnolia's behavior or I just didn't configure it properly? Regards, Denis On Thu, Oct 15, 2009 at 8:58 AM, Matteo Pelucco <matteo.pelucco@...> wrote:
|
|
|
Re: Search indexes - magnolia 4.1.1Denis Demichev ha scritto: > Hello, > > Matteo, thank you for clarification. > Actually I'm asking because my Magnolia 4.1.1 doesn't return any search > result even if I know that some specific term is inside that specific > document (.txt file uploaded to DMS). This can happen probably because of index corruption. Are you sure that indexes exist and that they are not broken? > It looks like my DMS is not > indexed at all. However, I can search through website pages, i.e. > WEBSITE workspace is indexed, I guess. Ok, so it is only related to DMS. Delete DMS indexes, removing with stopped tomcat all files inside repo/magnolia/worskpaces/dms/index folder. Leave other workspaces index folder as it, you will gain time. Then restart tomcat and let luceen scanning your db to reconstruct indexes (you can monitor file creation). You should be able to use query manager and to succesfully execute this query: SELECT * FROM nt:base (select nt:base as result node type and DMS as workspace). > I tried to delete index folder in default, DMS, website workspaces on > both Author and Public instances but this didn't help. But when you restarted tomcat, did you see any exception after the indexes rebuilding phase? > Could STK affect this Magnolia's behavior or I just didn't configure it > properly? STK is a templating module. Nothing related so directly to JCR / Jackrabbit. HTH, matteo ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1Hello Matteo,
Thank you for your quick response. >>You should be able to use query manager and to succesfully execute this query: >>SELECT * FROM nt:base I tried to run it against DMS successfully: 244 nodes returned in 734ms >>But when you restarted tomcat, did you see any exception after the indexes rebuilding phase? No I don't see any exception while booting tomcat. After some time playing with search (activating, making website references etc) I'm able to see .txt files and .doc (Word 97-03) files. Unfortunately no luck with PDF. As STK has majority of PDF documents in DMS that could be the reason why I couldn't search documents.Still I'm not sure when exactly Magnolia will index this or that document in DMS. Regards, Denis On Thu, Oct 15, 2009 at 10:31 AM, Matteo Pelucco <matteo.pelucco@...> wrote:
|
|
|
Re: Search indexes - magnolia 4.1.1Denis Demichev ha scritto: > Hello Matteo, > > Thank you for your quick response. Magnolia give me one T-shirt for each message I write. I have now a shop :-) > >>You should be able to use query manager and to succesfully execute > this query: > >>SELECT * FROM nt:base > > I tried to run it against DMS successfully: 244 nodes returned in 734ms Ok, this is the proof that DMS is indexed. Try now to delete ..workspaces/dms/index/* from filesystem. At next startup you would see something saying: 'loading DMS workspace' (if SearchIndexer is configured correctly for that ws in workspace.xml) and PDFs will be indexed (again). I would like to force re-index to be sure that no exception has been thrown in past index building phase. > Unfortunately no luck with PDF. > As STK has majority of PDF documents in > DMS that could be the reason why I couldn't search documents. Sorry, I missed something, how can you say that STK is related to PDF? STK, afaik, is a "framework" which help to build pages, nothing related to JCR / Lucene indexes, isn't it? Or maybe do you mean the new asset management shipped with Magnolia? > Still I'm > not sure when exactly Magnolia will index this or that document in DMS. It should be at save time, but I'm not 100% sure. Sorry but I have no huge experience with PDF indexing, but are you sure that your PDF are indexable?You can try to wrap PDFIndexer and log something, but it is not a quick debugging option... :-( matteo ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1> > Still I'm > > not sure when exactly Magnolia will index this or that document in DMS. > > It should be at save time, but I'm not 100% sure. Oh so there are things you actually do not know? Tss we will have to revisit this one t-shirt per mail policy *LOL* The indexing happens after saving the content into repository. Whether it happens immediately or not depends on the SearchIndexer configuration in the workspace.xml By default Magnolia is configured to use 3 indexing threads <param name="extractorPoolSize" value="3" /> So the indexing happens asynchronously shortly after the saving (i believe after the volatileIdleTime period is over). There is also an timeout set for the extractor in case it is not finished with the document in given type and there is max size of the backlog set to define how many documents can be waiting for extraction. This is relatively well documented at JackRabbit web and you can look all the parameters and their meaning there. To finish off, if you set the extractorPoolSize to 0, then extraction would happen on the main thread and saving op will not be finished until the extraction is done as well. HTH, Jan > > Sorry but I have no huge experience with PDF indexing, but are you sure > that your PDF are indexable?You can try to wrap PDFIndexer and log > something, but it is not a quick debugging option... > ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Search indexes - magnolia 4.1.1Jan Haderka ha scritto: > Oh so there are things you actually do not know? Tss we will have to > revisit this one t-shirt per mail policy *LOL* Ok, if you insist, 2 per mail are enough... ;-) > The indexing happens after saving the content into repository. As I thought, but how many things to know!!! Thanks for your explaination, Jan. Matteo ---------------------------------------------------------------- For list details see http://www.magnolia-cms.com/home/community/mailing-lists.html To unsubscribe, E-mail to: <user-list-unsubscribe@...> ---------------------------------------------------------------- |
|
|
Re: Re: Search indexes - magnolia 4.1.1Hello All,
Matteo wrote: >>Sorry, I missed something, how can you say that STK is related to PDF? STK has a bunch of sample files in DMS and majority of them are PDF. I still cannot index PDFs even if I delete lucene indexes. However, while indexing a RTF file I have an exception: java.lang.IllegalArgumentException: The document is really a RTF file at org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:114) at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:49) at org.apache.jackrabbit.extractor.MsWordTextExtractor.extractText(MsWordTextExtractor.java:64) at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90) at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195) at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93) It looks like org.apache.jackrabbit.extractor.MSWordTextExtractor is chosen for text extraction instead of org.apache.jackrabbit.extractor.RTFTextExtractor. I.e. an invalid file type is detected here: line 402 of org.apache.jackrabbit.core.query.lucene.NodeIndexer. InternalValue typeValue = getValue(NameConstants.JCR_MIMETYPE); Here's an implementation of getValue: /** * Utility method that extracts the first value of the named property * of the current node. Returns <code>null</code> if the property does * not exist or contains no values. * * @param name property name * @return value of the named property, or <code>null</code> * @throws ItemStateException if the property can not be accessed */ protected InternalValue getValue(Name name) throws ItemStateException { try { PropertyId id = new PropertyId(node.getNodeId(), name); PropertyState property = (PropertyState) stateProvider.getItemState(id); InternalValue[] values = property.getValues(); if (values.length > 0) { return values[0]; } else { return null; } } catch (NoSuchItemStateException e) { return null; } } So my assumption is: JCR node with RTF file contains a wrong MIME type associated with RTF file added... Not sure how to check this MIME value in Magnolia though. Should be "application/rtf" or "text/rtf", but not "application/vnd.ms-word" or "application/msword". Would really appreciate any help with PDF - I don't see any exception and thus cannot research what exactly went wrong. Thank you! Regards, Denis On Fri, Oct 16, 2009 at 2:40 AM, Matteo Pelucco <matteo.pelucco@...> wrote:
|
|
|
Re: Re: Search indexes - magnolia 4.1.1Hello All,
Just as I send this message I came across following line: WARN org.apache.jackrabbit.core.query.lucene.TextExtractorJob 16.10.2009 08:52:31 -- Exception while indexing binary property: java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider This line appeared after PDF file was added to the system. Unfortunately I don't have a full exception stack trace as it was truncated. It looks like I'm missing some jar - probably http://bouncycastle.org/ Crypto API. Regards, Denis On Fri, Oct 16, 2009 at 8:50 AM, Denis Demichev <demichev@...> wrote: Hello All, |
| Free embeddable forum powered by Nabble | Forum Help |