|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Tika troubleList,
I somehow fail to index certain pdf files using the ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but modified schema. I have a very simple schema for this case using only and ID field, a timestamp field and two dynamic fields; ignored_* and attr_* both indexed, stored and multivalued strings. They are multivalued simple because some HTML files fail when storing multiple hyperlinks. I have posted multiple files to http://.../update/extract?literal.id=doc1 including: 1. the whitepaper at http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP 2. the html file of the frontpage of http://nu.nl/ 3. another pdf at http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A For each document i have a corresponding select/?q=*:*: 1. No text? Should i see something? <doc><str name="id">doc1</str> <arr name="ignored_content_type"> <str>application/octet-stream</str> </arr> <arr name="ignored_stream_content_type"> <str> text/xml; charset=UTF-8; boundary=----------------------------cf57b4ad644d </str> </arr> <arr name="ignored_stream_size"> <str>491238</str> </arr> <arr name="ignored_text"> <str> </str> </arr> <date name="timestamp">2009-11-12T12:17:23.016Z</date> </doc> 2. Plenty of data, this seems to be ok <doc> <str name="id">doc1</str> <arr name="ignored_content_type"> <str>application/xhtml+xml</str> </arr> <arr name="ignored_links"> <str>http://www.nu.nl/</str> <str>http://www.nu.nl/</str> <str>http://www.nu.nl/algemeen/</str> <str>http://www.nu.nl/economie/</str> .... <arr name="ignored_stream_content_type"> <str> text/xml; charset=UTF-8; boundary=----------------------------b6e44d087bdd </str> </arr> <arr name="ignored_stream_size"> <str>36991</str> </arr> <arr name="ignored_text"> <str> A LOT OF TEXT HERE </str> </arr> <date name="timestamp">2009-11-12T12:19:15.415Z</date> </doc> 3. a lot of garbage <doc> <str name="id">doc1</str> <arr name="ignored_content_encoding"> <str>windows-1252</str> </arr> <arr name="ignored_content_language"> <str>fr</str> </arr> <arr name="ignored_content_type"> <str>text/plain</str> </arr> <arr name="ignored_language"> <str>fr</str> </arr> <arr name="ignored_stream_content_type"> <str> text/xml; charset=UTF-8; boundary=----------------------------83df0fd4d358 </str> </arr> <arr name="ignored_stream_size"> <str>361458</str> </arr> <arr name="ignored_text"> <str> A LOT OF GARBAGE HERE including ió½·Þp™ó 40› š©xÓ ^CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk_ Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€ 6E$Q endstream endobj 137 0 obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>> endobj 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>> endobj 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>> endobj 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 R]/Type/Pages/Parent 139 0 R>> endobj 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 R]/Type/Pages/Parent .... </str> </arr> <date name="timestamp">2009-11-12T12:21:28.306Z</date> </doc> Any ideas? Why doesn't the whitepaper produce any results and why is the next whitepaper full of garbage? At least i'm happy that HTML works fine. Regards, - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 |
|
|
Re: Tika troubleAnyone has a clue?
> List, > > > I somehow fail to index certain pdf files using the > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but > modified schema. I have a very simple schema for this case using only > and ID field, a timestamp field and two dynamic fields; ignored_* and > attr_* both indexed, stored and multivalued strings. They are > multivalued simple because some HTML files fail when storing multiple > hyperlinks. > > I have posted multiple files to > http://.../update/extract?literal.id=doc1 including: > 1. the whitepaper at > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP > 2. the html file of the frontpage of http://nu.nl/ > 3. another pdf at > http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A > > For each document i have a corresponding select/?q=*:*: > > > 1. No text? Should i see something? > > <doc><str name="id">doc1</str> > <arr name="ignored_content_type"> > <str>application/octet-stream</str> > </arr> > <arr name="ignored_stream_content_type"> > <str> > text/xml; charset=UTF-8; > boundary=----------------------------cf57b4ad644d > </str> > </arr> > <arr name="ignored_stream_size"> > <str>491238</str> > </arr> > <arr name="ignored_text"> > <str> </str> > </arr> > <date name="timestamp">2009-11-12T12:17:23.016Z</date> > </doc> > > > 2. Plenty of data, this seems to be ok > > <doc> > <str name="id">doc1</str> > <arr name="ignored_content_type"> > <str>application/xhtml+xml</str> > </arr> > <arr name="ignored_links"> > <str>http://www.nu.nl/</str> > <str>http://www.nu.nl/</str> > <str>http://www.nu.nl/algemeen/</str> > <str>http://www.nu.nl/economie/</str> > .... > <arr name="ignored_stream_content_type"> > <str> > text/xml; charset=UTF-8; > boundary=----------------------------b6e44d087bdd > </str> > </arr> > <arr name="ignored_stream_size"> > <str>36991</str> > </arr> > <arr name="ignored_text"> > <str> > A LOT OF TEXT HERE > </str> > </arr> > <date name="timestamp">2009-11-12T12:19:15.415Z</date> > </doc> > > > 3. a lot of garbage > > <doc> > <str name="id">doc1</str> > <arr name="ignored_content_encoding"> > <str>windows-1252</str> > </arr> > <arr name="ignored_content_language"> > <str>fr</str> > </arr> > <arr name="ignored_content_type"> > <str>text/plain</str> > </arr> > <arr name="ignored_language"> > <str>fr</str> > </arr> > <arr name="ignored_stream_content_type"> > <str> > text/xml; charset=UTF-8; > boundary=----------------------------83df0fd4d358 > </str> > </arr> > <arr name="ignored_stream_size"> > <str>361458</str> > </arr> > <arr name="ignored_text"> > <str> > A LOT OF GARBAGE HERE including > > ió½·Þp™ó 40› > š©xÓ ^CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk_ > Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v > ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€ 6E$Q > endstream > endobj > 137 0 > obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>> > endobj > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>> > endobj > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>> > endobj > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 > R]/Type/Pages/Parent 139 0 R>> > endobj > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 > R]/Type/Pages/Parent > > .... > > </str> > </arr> > <date name="timestamp">2009-11-12T12:21:28.306Z</date> > </doc> > > > Any ideas? Why doesn't the whitepaper produce any results and why is the > next whitepaper full of garbage? At least i'm happy that HTML works > fine. > > > > Regards, > > - > Markus Jelsma Buyways B.V. > Technisch Architect Friesestraatweg 215c > http://www.buyways.nl 9743 AD Groningen > > > Alg. 050-853 6600 KvK 01074105 > Tel. 050-853 6620 Fax. 050-3118124 > Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 > |
|
|
Re: Tika troubleWhat I could try to say is that if you want to index a Pdf, then you should
use a Pdf extractor. A Pdf Extractor is able to extract the text content and the metadata of the files. I suppose you have just opened and indexed the pdf as is. So you stored bynary data and stop. For my applciation I've used PdfExtractor, but also pdfBox project could be used. Antonio 2009/11/16 Markus Jelsma - Buyways B.V. <markus@...> > Anyone has a clue? > > > > > List, > > > > > > I somehow fail to index certain pdf files using the > > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but > > modified schema. I have a very simple schema for this case using only > > and ID field, a timestamp field and two dynamic fields; ignored_* and > > attr_* both indexed, stored and multivalued strings. They are > > multivalued simple because some HTML files fail when storing multiple > > hyperlinks. > > > > I have posted multiple files to > > http://.../update/extract?literal.id=doc1 including: > > 1. the whitepaper at > > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP > > 2. the html file of the frontpage of http://nu.nl/ > > 3. another pdf at > > > http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A<http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A> > > > > For each document i have a corresponding select/?q=*:*: > > > > > > 1. No text? Should i see something? > > > > <doc><str name="id">doc1</str> > > <arr name="ignored_content_type"> > > <str>application/octet-stream</str> > > </arr> > > <arr name="ignored_stream_content_type"> > > <str> > > text/xml; charset=UTF-8; > > boundary=----------------------------cf57b4ad644d > > </str> > > </arr> > > <arr name="ignored_stream_size"> > > <str>491238</str> > > </arr> > > <arr name="ignored_text"> > > <str> </str> > > </arr> > > <date name="timestamp">2009-11-12T12:17:23.016Z</date> > > </doc> > > > > > > 2. Plenty of data, this seems to be ok > > > > <doc> > > <str name="id">doc1</str> > > <arr name="ignored_content_type"> > > <str>application/xhtml+xml</str> > > </arr> > > <arr name="ignored_links"> > > <str>http://www.nu.nl/</str> > > <str>http://www.nu.nl/</str> > > <str>http://www.nu.nl/algemeen/</str> > > <str>http://www.nu.nl/economie/</str> > > .... > > <arr name="ignored_stream_content_type"> > > <str> > > text/xml; charset=UTF-8; > > boundary=----------------------------b6e44d087bdd > > </str> > > </arr> > > <arr name="ignored_stream_size"> > > <str>36991</str> > > </arr> > > <arr name="ignored_text"> > > <str> > > A LOT OF TEXT HERE > > </str> > > </arr> > > <date name="timestamp">2009-11-12T12:19:15.415Z</date> > > </doc> > > > > > > 3. a lot of garbage > > > > <doc> > > <str name="id">doc1</str> > > <arr name="ignored_content_encoding"> > > <str>windows-1252</str> > > </arr> > > <arr name="ignored_content_language"> > > <str>fr</str> > > </arr> > > <arr name="ignored_content_type"> > > <str>text/plain</str> > > </arr> > > <arr name="ignored_language"> > > <str>fr</str> > > </arr> > > <arr name="ignored_stream_content_type"> > > <str> > > text/xml; charset=UTF-8; > > boundary=----------------------------83df0fd4d358 > > </str> > > </arr> > > <arr name="ignored_stream_size"> > > <str>361458</str> > > </arr> > > <arr name="ignored_text"> > > <str> > > A LOT OF GARBAGE HERE including > > > > ió½·Þp™ó 40› > > š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 > > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ > > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ > > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë > > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk _ > > Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v > > ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€ 6E$Q > > endstream > > endobj > > 137 0 > > > obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>> > > endobj > > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 > > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV > > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>> > > endobj > > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>> > > endobj > > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 > > R]/Type/Pages/Parent 139 0 R>> > > endobj > > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 > > R]/Type/Pages/Parent > > > > .... > > > > </str> > > </arr> > > <date name="timestamp">2009-11-12T12:21:28.306Z</date> > > </doc> > > > > > > Any ideas? Why doesn't the whitepaper produce any results and why is the > > next whitepaper full of garbage? At least i'm happy that HTML works > > fine. > > > > > > > > Regards, > > > > - > > Markus Jelsma Buyways B.V. > > Technisch Architect Friesestraatweg 215c > > http://www.buyways.nl 9743 AD Groningen > > > > > > Alg. 050-853 6600 KvK 01074105 > > Tel. 050-853 6620 Fax. 050-3118124 > > Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 > > > -- Antonio Calò ------------------------------------------ Software Developer Engineer @ Intellisemantic Mail anton.calo@... Tel. 011-56.90.429 ------------------------------------------ |
|
|
Re: Tika troubleThank you for your reply.
I had the assumption Tika could also extract text content from various documenttypes instead of only meta data. I'll use the CLI tools from http://www.foolabs.com/xpdf/ to extract text manually. - Markus Jelsma Buyways B.V. Technisch Architect Friesestraatweg 215c http://www.buyways.nl 9743 AD Groningen Alg. 050-853 6600 KvK 01074105 Tel. 050-853 6620 Fax. 050-3118124 Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 On Mon, 2009-11-16 at 12:06 +0100, Antonio Calò wrote: > What I could try to say is that if you want to index a Pdf, then you should > use a Pdf extractor. A Pdf Extractor is able to extract the text content and > the metadata of the files. I suppose you have just opened and indexed the > pdf as is. So you stored bynary data and stop. For my applciation I've used > PdfExtractor, but also pdfBox project could be used. > > Antonio > > 2009/11/16 Markus Jelsma - Buyways B.V. <markus@...> > > > Anyone has a clue? > > > > > > > > > List, > > > > > > > > > I somehow fail to index certain pdf files using the > > > ExtractingRequestHandler in Solr 1.4 with default solrconfig.xml but > > > modified schema. I have a very simple schema for this case using only > > > and ID field, a timestamp field and two dynamic fields; ignored_* and > > > attr_* both indexed, stored and multivalued strings. They are > > > multivalued simple because some HTML files fail when storing multiple > > > hyperlinks. > > > > > > I have posted multiple files to > > > http://.../update/extract?literal.id=doc1 including: > > > 1. the whitepaper at > > > http://www.lucidimagination.com/whitepaper/whats-new-in-lucene-2-9?sc=AP > > > 2. the html file of the frontpage of http://nu.nl/ > > > 3. another pdf at > > > > > http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F~christos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A<http://www.google.nl/url?sa=t&source=web&ct=res&cd=1&ved=0CAcQFjAA&url=http%3A%2F%2Fcsl.stanford.edu%2F%7Echristos%2Fpublications%2F2007.cmp_mapreduce.hpca.pdf&rct=j&q=2007.cmp_mapreduce.hpca.pdf&ei=PPz7SpiiOM6l4QbZjKjRAw&usg=AFQjCNHs-olxbUQrGCXpNMHfcZvY8aMk8A> > > > > > > For each document i have a corresponding select/?q=*:*: > > > > > > > > > 1. No text? Should i see something? > > > > > > <doc><str name="id">doc1</str> > > > <arr name="ignored_content_type"> > > > <str>application/octet-stream</str> > > > </arr> > > > <arr name="ignored_stream_content_type"> > > > <str> > > > text/xml; charset=UTF-8; > > > boundary=----------------------------cf57b4ad644d > > > </str> > > > </arr> > > > <arr name="ignored_stream_size"> > > > <str>491238</str> > > > </arr> > > > <arr name="ignored_text"> > > > <str> </str> > > > </arr> > > > <date name="timestamp">2009-11-12T12:17:23.016Z</date> > > > </doc> > > > > > > > > > 2. Plenty of data, this seems to be ok > > > > > > <doc> > > > <str name="id">doc1</str> > > > <arr name="ignored_content_type"> > > > <str>application/xhtml+xml</str> > > > </arr> > > > <arr name="ignored_links"> > > > <str>http://www.nu.nl/</str> > > > <str>http://www.nu.nl/</str> > > > <str>http://www.nu.nl/algemeen/</str> > > > <str>http://www.nu.nl/economie/</str> > > > .... > > > <arr name="ignored_stream_content_type"> > > > <str> > > > text/xml; charset=UTF-8; > > > boundary=----------------------------b6e44d087bdd > > > </str> > > > </arr> > > > <arr name="ignored_stream_size"> > > > <str>36991</str> > > > </arr> > > > <arr name="ignored_text"> > > > <str> > > > A LOT OF TEXT HERE > > > </str> > > > </arr> > > > <date name="timestamp">2009-11-12T12:19:15.415Z</date> > > > </doc> > > > > > > > > > 3. a lot of garbage > > > > > > <doc> > > > <str name="id">doc1</str> > > > <arr name="ignored_content_encoding"> > > > <str>windows-1252</str> > > > </arr> > > > <arr name="ignored_content_language"> > > > <str>fr</str> > > > </arr> > > > <arr name="ignored_content_type"> > > > <str>text/plain</str> > > > </arr> > > > <arr name="ignored_language"> > > > <str>fr</str> > > > </arr> > > > <arr name="ignored_stream_content_type"> > > > <str> > > > text/xml; charset=UTF-8; > > > boundary=----------------------------83df0fd4d358 > > > </str> > > > </arr> > > > <arr name="ignored_stream_size"> > > > <str>361458</str> > > > </arr> > > > <arr name="ignored_text"> > > > <str> > > > A LOT OF GARBAGE HERE including > > > > > > ió½·Þp™ó 40› > > > š©xÓ ^ CøùI3람š³î¨V ÚÜ¡yS4 ¹£ ² ›H 6õɨ5¤ ÅÜ磩bädÒøŸ\ �s%OîÐÙIÑYRäŠ ;4 > > > ¢9"r "—!rEôˆÌ {SìûD²à £©ïœ«{‘ínÆ N÷ô¥F»�™ ±¡Ë'ú\³=·m„Þ »ý)³Å=j¶B¢)` Ñ > > > „Ï™hjCu{£É5{¢¯ç6½Ñhr¢ºÃ=J M- AqsøtÜì ÿ^Rl S?¿óšM‰—lv‘Ø›Qüãý´ þžŽ > > > $S;¾¦wze³Ù)qÉú§ ‰› ãqó…Ó ‰ª"U:šBÝ‘GuŠ"ë > > > MM±Òv �~ ‚N‹t¢ä§~Ì ÞŒS—Êòö¼ÊÄQaº¸¿7tñ ¾Áç œãØŒ58$O 3Å~�8¿L ‡ëŽó©pk _ > > > Ša Â=u×; (ä<¹@.œ÷ä ù° µk+ÿ PP~ ¨*ݤ¿Œ™¡D» @fI$0°�Î Ù·p“Œ,Øâ †¶v > > > ¤v1#8¼0 › èð€-†šZ 6¾ ! ñb ˆbˆ¤v)LS)T X² ¬ l!@€ 6E$Q > > > endstream > > > endobj > > > 137 0 > > > > > obj<</Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences[1/W/o/r/d/C/u/n/t/M/a/i/x/l/S/g/c/h/K/m/e/s/R/v/I/P/A/H/L/space/p]>> > > > endobj > > > 138 0 obj<</Type/FontDescriptor/FontFile2 136 0 R/FontBBox[0 -210 942 > > > 728]/FontName/WQHWKD+TTE31911E0t00/Flags 4/MissingWidth 750/StemV > > > 141/CapHeight 728/Ascent 728/Descent -210/ItalicAngle 0>> > > > endobj > > > 139 0 obj<</Count 12/Kids[140 0 R 141 0 R]/Type/Pages>> > > > endobj > > > 140 0 obj<</Count 6/Kids[147 0 R 1 0 R 4 0 R 7 0 R 22 0 R 25 0 > > > R]/Type/Pages/Parent 139 0 R>> > > > endobj > > > 141 0 obj<</Count 6/Kids[39 0 R 42 0 R 45 0 R 82 0 R 92 0 R 122 0 > > > R]/Type/Pages/Parent > > > > > > .... > > > > > > </str> > > > </arr> > > > <date name="timestamp">2009-11-12T12:21:28.306Z</date> > > > </doc> > > > > > > > > > Any ideas? Why doesn't the whitepaper produce any results and why is the > > > next whitepaper full of garbage? At least i'm happy that HTML works > > > fine. > > > > > > > > > > > > Regards, > > > > > > - > > > Markus Jelsma Buyways B.V. > > > Technisch Architect Friesestraatweg 215c > > > http://www.buyways.nl 9743 AD Groningen > > > > > > > > > Alg. 050-853 6600 KvK 01074105 > > > Tel. 050-853 6620 Fax. 050-3118124 > > > Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17 > > > > > > > > |
| Free embeddable forum powered by Nabble | Forum Help |