Tika trouble

View: New views
4 Messages — Rating Filter:   Alert me  

Tika trouble

by Markus Jelsma - Buyways B.V. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

List,


I somehow fail to index certain pdf files using the
ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
modified schema. I have a very simple schema for this case using only
and ID field, a timestamp field and two dynamic fields; ignored_* and
attr_* both indexed, stored and multivalued strings. They are
multivalued simple because some HTML files fail when storing multiple
hyperlinks.

I have posted multiple files to
http://.../update/extract?literal.id=doc1 including:
1. the whitepaper at
http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
2. the html file of the frontpage of http://nu.nl/
3. another pdf at
http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A

For each document i have a corresponding select/?q=*:*:


1. No text? Should i see something?

<doc><str name="id">doc1</str>
<arr name="ignored_content_type">
<str>application/octet-stream</str>
</arr>
<arr name="ignored_stream_content_type">
<str>
text/xml; charset=UTF-8;
boundary=----------------------------cf57b4ad644d
</str>
</arr>
<arr name="ignored_stream_size">
<str>491238</str>
</arr>
<arr name="ignored_text">
<str>        </str>
</arr>
<date name="timestamp">2009-11-12T12:17:23.016Z</date>
</doc>


2. Plenty of data, this seems to be ok

<doc>
<str name="id">doc1</str>
<arr name="ignored_content_type">
<str>application/xhtml+xml</str>
</arr>
<arr name="ignored_links">
<str>http://www.nu.nl/</str>
<str>http://www.nu.nl/</str>
<str>http://www.nu.nl/algemeen/</str>
<str>http://www.nu.nl/economie/</str>
....
<arr name="ignored_stream_content_type">
<str>
text/xml; charset=UTF-8;
boundary=----------------------------b6e44d087bdd
</str>
</arr>
<arr name="ignored_stream_size">
<str>36991</str>
</arr>
<arr name="ignored_text">
<str>
A LOT OF TEXT HERE
</str>
</arr>
<date name="timestamp">2009-11-12T12:19:15.415Z</date>
</doc>


3. a lot of garbage

<doc>
<str name="id">doc1</str>
<arr name="ignored_content_encoding">
<str>windows-1252</str>
</arr>
<arr name="ignored_content_language">
<str>fr</str>
</arr>
<arr name="ignored_content_type">
<str>text/plain</str>
</arr>
<arr name="ignored_language">
<str>fr</str>
</arr>
<arr name="ignored_stream_content_type">
<str>
text/xml; charset=UTF-8;
boundary=----------------------------83df0fd4d358
</str>
</arr>
<arr name="ignored_stream_size">
<str>361458</str>
</arr>
<arr name="ignored_text">
<str>
A LOT OF GARBAGE HERE including

ió½·Þp™ó 4­0›
š©xÓ ^CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4
¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)`  Ñ
„Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ
$S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë
MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L  ‡ëŽó©pk_
Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î Ù·p“Œ,Øâ  †¶v
¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€  6E$Q
endstream
endobj
137 0
obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>>
endobj
138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>>
endobj
139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>>
endobj
140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
R]/Type/Pages/Parent 139 0 R>>
endobj
141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
R]/Type/Pages/Parent

....

</str>
</arr>
<date name="timestamp">2009-11-12T12:21:28.306Z</date>
</doc>


Any ideas? Why doesn't the whitepaper produce any results and why is the
next whitepaper full of garbage? At least i'm happy that HTML works
fine.



Regards,

-  
Markus Jelsma          Buyways B.V.            
Technisch Architect    Friesestraatweg 215c    
http://www.buyways.nl  9743 AD Groningen      


Alg. 050-853 6600      KvK  01074105
Tel. 050-853 6620      Fax. 050-3118124
Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


Re: Tika trouble

by Markus Jelsma - Buyways B.V. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anyone has a clue?



> List,
>
>
> I somehow fail to index certain pdf files using the
> ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
> modified schema. I have a very simple schema for this case using only
> and ID field, a timestamp field and two dynamic fields; ignored_* and
> attr_* both indexed, stored and multivalued strings. They are
> multivalued simple because some HTML files fail when storing multiple
> hyperlinks.
>
> I have posted multiple files to
> http://.../update/extract?literal.id=doc1 including:
> 1. the whitepaper at
> http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
> 2. the html file of the frontpage of http://nu.nl/
> 3. another pdf at
> http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A
>
> For each document i have a corresponding select/?q=*:*:
>
>
> 1. No text? Should i see something?
>
> <doc><str name="id">doc1</str>
> <arr name="ignored_content_type">
> <str>application/octet-stream</str>
> </arr>
> <arr name="ignored_stream_content_type">
> <str>
> text/xml; charset=UTF-8;
> boundary=----------------------------cf57b4ad644d
> </str>
> </arr>
> <arr name="ignored_stream_size">
> <str>491238</str>
> </arr>
> <arr name="ignored_text">
> <str>        </str>
> </arr>
> <date name="timestamp">2009-11-12T12:17:23.016Z</date>
> </doc>
>
>
> 2. Plenty of data, this seems to be ok
>
> <doc>
> <str name="id">doc1</str>
> <arr name="ignored_content_type">
> <str>application/xhtml+xml</str>
> </arr>
> <arr name="ignored_links">
> <str>http://www.nu.nl/</str>
> <str>http://www.nu.nl/</str>
> <str>http://www.nu.nl/algemeen/</str>
> <str>http://www.nu.nl/economie/</str>
> ....
> <arr name="ignored_stream_content_type">
> <str>
> text/xml; charset=UTF-8;
> boundary=----------------------------b6e44d087bdd
> </str>
> </arr>
> <arr name="ignored_stream_size">
> <str>36991</str>
> </arr>
> <arr name="ignored_text">
> <str>
> A LOT OF TEXT HERE
> </str>
> </arr>
> <date name="timestamp">2009-11-12T12:19:15.415Z</date>
> </doc>
>
>
> 3. a lot of garbage
>
> <doc>
> <str name="id">doc1</str>
> <arr name="ignored_content_encoding">
> <str>windows-1252</str>
> </arr>
> <arr name="ignored_content_language">
> <str>fr</str>
> </arr>
> <arr name="ignored_content_type">
> <str>text/plain</str>
> </arr>
> <arr name="ignored_language">
> <str>fr</str>
> </arr>
> <arr name="ignored_stream_content_type">
> <str>
> text/xml; charset=UTF-8;
> boundary=----------------------------83df0fd4d358
> </str>
> </arr>
> <arr name="ignored_stream_size">
> <str>361458</str>
> </arr>
> <arr name="ignored_text">
> <str>
> A LOT OF GARBAGE HERE including
>
> ió½·Þp™ó 4­0›
> š©xÓ ^CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4
> ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)`  Ñ
> „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ
> $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë
> MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L  ‡ëŽó©pk_
> Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î Ù·p“Œ,Øâ  †¶v
> ¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€  6E$Q
> endstream
> endobj
> 137 0
> obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>>
> endobj
> 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
> 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
> 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>>
> endobj
> 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>>
> endobj
> 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
> R]/Type/Pages/Parent 139 0 R>>
> endobj
> 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
> R]/Type/Pages/Parent
>
> ....
>
> </str>
> </arr>
> <date name="timestamp">2009-11-12T12:21:28.306Z</date>
> </doc>
>
>
> Any ideas? Why doesn't the whitepaper produce any results and why is the
> next whitepaper full of garbage? At least i'm happy that HTML works
> fine.
>
>
>
> Regards,
>
> -  
> Markus Jelsma          Buyways B.V.            
> Technisch Architect    Friesestraatweg 215c    
> http://www.buyways.nl  9743 AD Groningen      
>
>
> Alg. 050-853 6600      KvK  01074105
> Tel. 050-853 6620      Fax. 050-3118124
> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>

Re: Tika trouble

by Antonio Calò :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What I could try to say is that if you want to index a Pdf, then you should
use a Pdf extractor. A Pdf Extractor is able to extract the text content and
the metadata of the files. I suppose you have just opened and indexed the
pdf as is. So you stored bynary data and stop. For my applciation I've used
PdfExtractor, but also pdfBox project could be used.

Antonio

2009/11/16 Markus Jelsma - Buyways B.V. <markus@...>

> Anyone has a clue?
>
>
>
> > List,
> >
> >
> > I somehow fail to index certain pdf files using the
> > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
> > modified schema. I have a very simple schema for this case using only
> > and ID field, a timestamp field and two dynamic fields; ignored_* and
> > attr_* both indexed, stored and multivalued strings. They are
> > multivalued simple because some HTML files fail when storing multiple
> > hyperlinks.
> >
> > I have posted multiple files to
> > http://.../update/extract?literal.id=doc1 including:
> > 1. the whitepaper at
> > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
> > 2. the html file of the frontpage of http://nu.nl/
> > 3. another pdf at
> >
> http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A<http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A>
> >
> > For each document i have a corresponding select/?q=*:*:
> >
> >
> > 1. No text? Should i see something?
> >
> > <doc><str name="id">doc1</str>
> > <arr name="ignored_content_type">
> > <str>application/octet-stream</str>
> > </arr>
> > <arr name="ignored_stream_content_type">
> > <str>
> > text/xml; charset=UTF-8;
> > boundary=----------------------------cf57b4ad644d
> > </str>
> > </arr>
> > <arr name="ignored_stream_size">
> > <str>491238</str>
> > </arr>
> > <arr name="ignored_text">
> > <str>        </str>
> > </arr>
> > <date name="timestamp">2009-11-12T12:17:23.016Z</date>
> > </doc>
> >
> >
> > 2. Plenty of data, this seems to be ok
> >
> > <doc>
> > <str name="id">doc1</str>
> > <arr name="ignored_content_type">
> > <str>application/xhtml+xml</str>
> > </arr>
> > <arr name="ignored_links">
> > <str>http://www.nu.nl/</str>
> > <str>http://www.nu.nl/</str>
> > <str>http://www.nu.nl/algemeen/</str>
> > <str>http://www.nu.nl/economie/</str>
> > ....
> > <arr name="ignored_stream_content_type">
> > <str>
> > text/xml; charset=UTF-8;
> > boundary=----------------------------b6e44d087bdd
> > </str>
> > </arr>
> > <arr name="ignored_stream_size">
> > <str>36991</str>
> > </arr>
> > <arr name="ignored_text">
> > <str>
> > A LOT OF TEXT HERE
> > </str>
> > </arr>
> > <date name="timestamp">2009-11-12T12:19:15.415Z</date>
> > </doc>
> >
> >
> > 3. a lot of garbage
> >
> > <doc>
> > <str name="id">doc1</str>
> > <arr name="ignored_content_encoding">
> > <str>windows-1252</str>
> > </arr>
> > <arr name="ignored_content_language">
> > <str>fr</str>
> > </arr>
> > <arr name="ignored_content_type">
> > <str>text/plain</str>
> > </arr>
> > <arr name="ignored_language">
> > <str>fr</str>
> > </arr>
> > <arr name="ignored_stream_content_type">
> > <str>
> > text/xml; charset=UTF-8;
> > boundary=----------------------------83df0fd4d358
> > </str>
> > </arr>
> > <arr name="ignored_stream_size">
> > <str>361458</str>
> > </arr>
> > <arr name="ignored_text">
> > <str>
> > A LOT OF GARBAGE HERE including
> >
> > ió½·Þp™ó 4­0›
> > š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4
> > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)`  Ñ
> > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ
> > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë
> > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L  ‡ëŽó©pk _
> > Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î Ù·p“Œ,Øâ  †¶v
> > ¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€  6E$Q
> > endstream
> > endobj
> > 137 0
> >
> obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>>
> > endobj
> > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
> > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
> > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>>
> > endobj
> > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>>
> > endobj
> > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
> > R]/Type/Pages/Parent 139 0 R>>
> > endobj
> > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
> > R]/Type/Pages/Parent
> >
> > ....
> >
> > </str>
> > </arr>
> > <date name="timestamp">2009-11-12T12:21:28.306Z</date>
> > </doc>
> >
> >
> > Any ideas? Why doesn't the whitepaper produce any results and why is the
> > next whitepaper full of garbage? At least i'm happy that HTML works
> > fine.
> >
> >
> >
> > Regards,
> >
> > -
> > Markus Jelsma          Buyways B.V.
> > Technisch Architect    Friesestraatweg 215c
> > http://www.buyways.nl  9743 AD Groningen
> >
> >
> > Alg. 050-853 6600      KvK  01074105
> > Tel. 050-853 6620      Fax. 050-3118124
> > Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
> >
>



--
Antonio Calò
------------------------------------------
Software Developer Engineer
@ Intellisemantic
Mail anton.calo@...
Tel. 011-56.90.429
------------------------------------------

Re: Tika trouble

by Markus Jelsma - Buyways B.V. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thank you for your reply.

I had the assumption Tika could also extract text content from various
documenttypes instead of only meta data. I'll use the CLI tools from
http://www.foolabs.com/xpdf/ to extract text manually.


-  
Markus Jelsma          Buyways B.V.            
Technisch Architect    Friesestraatweg 215c    
http://www.buyways.nl  9743 AD Groningen      


Alg. 050-853 6600      KvK  01074105
Tel. 050-853 6620      Fax. 050-3118124
Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


On Mon, 2009-11-16 at 12:06 +0100, Antonio Calò wrote:

> What I could try to say is that if you want to index a Pdf, then you should
> use a Pdf extractor. A Pdf Extractor is able to extract the text content and
> the metadata of the files. I suppose you have just opened and indexed the
> pdf as is. So you stored bynary data and stop. For my applciation I've used
> PdfExtractor, but also pdfBox project could be used.
>
> Antonio
>
> 2009/11/16 Markus Jelsma - Buyways B.V. <markus@...>
>
> > Anyone has a clue?
> >
> >
> >
> > > List,
> > >
> > >
> > > I somehow fail to index certain pdf files using the
> > > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but
> > > modified schema. I have a very simple schema for this case using only
> > > and ID field, a timestamp field and two dynamic fields; ignored_* and
> > > attr_* both indexed, stored and multivalued strings. They are
> > > multivalued simple because some HTML files fail when storing multiple
> > > hyperlinks.
> > >
> > > I have posted multiple files to
> > > http://.../update/extract?literal.id=doc1 including:
> > > 1. the whitepaper at
> > > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP
> > > 2. the html file of the frontpage of http://nu.nl/
> > > 3. another pdf at
> > >
> > http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A<http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A>
> > >
> > > For each document i have a corresponding select/?q=*:*:
> > >
> > >
> > > 1. No text? Should i see something?
> > >
> > > <doc><str name="id">doc1</str>
> > > <arr name="ignored_content_type">
> > > <str>application/octet-stream</str>
> > > </arr>
> > > <arr name="ignored_stream_content_type">
> > > <str>
> > > text/xml; charset=UTF-8;
> > > boundary=----------------------------cf57b4ad644d
> > > </str>
> > > </arr>
> > > <arr name="ignored_stream_size">
> > > <str>491238</str>
> > > </arr>
> > > <arr name="ignored_text">
> > > <str>        </str>
> > > </arr>
> > > <date name="timestamp">2009-11-12T12:17:23.016Z</date>
> > > </doc>
> > >
> > >
> > > 2. Plenty of data, this seems to be ok
> > >
> > > <doc>
> > > <str name="id">doc1</str>
> > > <arr name="ignored_content_type">
> > > <str>application/xhtml+xml</str>
> > > </arr>
> > > <arr name="ignored_links">
> > > <str>http://www.nu.nl/</str>
> > > <str>http://www.nu.nl/</str>
> > > <str>http://www.nu.nl/algemeen/</str>
> > > <str>http://www.nu.nl/economie/</str>
> > > ....
> > > <arr name="ignored_stream_content_type">
> > > <str>
> > > text/xml; charset=UTF-8;
> > > boundary=----------------------------b6e44d087bdd
> > > </str>
> > > </arr>
> > > <arr name="ignored_stream_size">
> > > <str>36991</str>
> > > </arr>
> > > <arr name="ignored_text">
> > > <str>
> > > A LOT OF TEXT HERE
> > > </str>
> > > </arr>
> > > <date name="timestamp">2009-11-12T12:19:15.415Z</date>
> > > </doc>
> > >
> > >
> > > 3. a lot of garbage
> > >
> > > <doc>
> > > <str name="id">doc1</str>
> > > <arr name="ignored_content_encoding">
> > > <str>windows-1252</str>
> > > </arr>
> > > <arr name="ignored_content_language">
> > > <str>fr</str>
> > > </arr>
> > > <arr name="ignored_content_type">
> > > <str>text/plain</str>
> > > </arr>
> > > <arr name="ignored_language">
> > > <str>fr</str>
> > > </arr>
> > > <arr name="ignored_stream_content_type">
> > > <str>
> > > text/xml; charset=UTF-8;
> > > boundary=----------------------------83df0fd4d358
> > > </str>
> > > </arr>
> > > <arr name="ignored_stream_size">
> > > <str>361458</str>
> > > </arr>
> > > <arr name="ignored_text">
> > > <str>
> > > A LOT OF GARBAGE HERE including
> > >
> > > ió½·Þp™ó 4­0›
> > > š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4
> > > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)`  Ñ
> > > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ
> > > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë
> > > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L  ‡ëŽó©pk _
> > > Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D»   @fI$0°�Î Ù·p“Œ,Øâ  †¶v
> > > ¤v1#8¼0 ›  èð€-†šZ 6¾  ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€  6E$Q
> > > endstream
> > > endobj
> > > 137 0
> > >
> > obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>>
> > > endobj
> > > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942
> > > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV
> > > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>>
> > > endobj
> > > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>>
> > > endobj
> > > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0
> > > R]/Type/Pages/Parent 139 0 R>>
> > > endobj
> > > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0
> > > R]/Type/Pages/Parent
> > >
> > > ....
> > >
> > > </str>
> > > </arr>
> > > <date name="timestamp">2009-11-12T12:21:28.306Z</date>
> > > </doc>
> > >
> > >
> > > Any ideas? Why doesn't the whitepaper produce any results and why is the
> > > next whitepaper full of garbage? At least i'm happy that HTML works
> > > fine.
> > >
> > >
> > >
> > > Regards,
> > >
> > > -
> > > Markus Jelsma          Buyways B.V.
> > > Technisch Architect    Friesestraatweg 215c
> > > http://www.buyways.nl  9743 AD Groningen
> > >
> > >
> > > Alg. 050-853 6600      KvK  01074105
> > > Tel. 050-853 6620      Fax. 050-3118124
> > > Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
> > >
> >
>
>
>