<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	<id>tag:old.nabble.com,2006:forum-20913</id>
	<title>Nabble - Apache Tika - Development</title>
	<updated>2009-11-27T09:16:20Z</updated>
	<link rel="self" type="application/atom+xml" href="http://old.nabble.com/Apache-Tika---Development-f20913.xml" />
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Apache-Tika---Development-f20913.html" />
	<subtitle type="html">&lt;a href=&quot;http://lucene.apache.org/tika/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;Apache Tika&lt;/a&gt; is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.</subtitle>
	
<entry>
	<id>tag:old.nabble.com,2006:post-26545306</id>
	<title>[jira] Created: (TIKA-338) Trying to use -encoding parameter alwyas results in an exception</title>
	<published>2009-11-27T09:16:20Z</published>
	<updated>2009-11-27T09:16:20Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Trying to use -encoding parameter alwyas results in an exception
&lt;br&gt;----------------------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-338
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-338&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-338&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Bug
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: cli
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Peter Wolanin
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Fix For: 0.6
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;There is a logical error in the CLI code - -encoding can never work and always results in an exception
&lt;br&gt;&lt;br&gt;$ java -jar tika-app/target/tika-app-0.6-SNAPSHOT.jar -encoding=UTF-8 -t test.txt 
&lt;br&gt;&lt;br&gt;Exception in thread &amp;quot;main&amp;quot; java.io.UnsupportedEncodingException: ncoding=UTF-8
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:42)
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-338%29-Trying-to-use--encoding-parameter-alwyas-results-in-an-exception-tp26545306p26545306.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26545139</id>
	<title>[jira] Commented: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)</title>
	<published>2009-11-27T09:02:20Z</published>
	<updated>2009-11-27T09:02:20Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12783134#action_12783134&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12783134#action_12783134&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Peter Wolanin commented on TIKA-324:
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;There is a logical bug in the committed code: -encoding= does not work, fails with exceptions like:
&lt;br&gt;&lt;br&gt;Exception in thread &amp;quot;main&amp;quot; java.io.UnsupportedEncodingException: ncoding=UTF-8
&lt;br&gt;&lt;br&gt;&lt;br&gt;note &amp;quot;ncoding&amp;quot;. &amp;nbsp;Opening follow-up issue.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
&lt;br&gt;&amp;gt; --------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-324
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-324&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-324&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: cli
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.3, 0.4, 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Mac OS 10.5, java version &amp;quot;1.6.0_15&amp;quot;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Peter Wolanin
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Jukka Zitting
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Critical
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 0.6
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; Original Estimate: 2h
&lt;br&gt;&amp;gt; &amp;nbsp;Remaining Estimate: 2h
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; When using the -t flag to tika, multi-byte content is destroyed in the output.
&lt;br&gt;&amp;gt; Example:
&lt;br&gt;&amp;gt; $ java -jar tika-app-0.4.jar -t ./test.txt
&lt;br&gt;&amp;gt; I?t?rn?ti?n?liz?ti?n
&lt;br&gt;&amp;gt; $ java -jar tika-app-0.4.jar -x ./test.txt
&lt;br&gt;&amp;gt; &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;html xmlns=&amp;quot;&lt;a href=&quot;http://www.w3.org/1999/xhtml&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/1999/xhtml&lt;/a&gt;&amp;quot;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;head&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;title/&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/head&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;body&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;p&amp;gt;Iñtërnâtiônàlizætiøn
&lt;br&gt;&amp;gt; &amp;lt;/p&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/body&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/html&amp;gt;
&lt;br&gt;&amp;gt; see also: &amp;nbsp;&lt;a href=&quot;http://drupal.org/node/622508#comment-2267918&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://drupal.org/node/622508#comment-2267918&lt;/a&gt;&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-324%29-Tika-CLI-mangles-utf-8-content-in-text-%28-t%29-mode-tp26361441p26545139.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26543717</id>
	<title>[jira] Resolved: (TIKA-324) Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)</title>
	<published>2009-11-27T07:12:20Z</published>
	<updated>2009-11-27T07:12:20Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Jukka Zitting resolved TIKA-324.
&lt;br&gt;--------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 0.6
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Jukka Zitting
&lt;br&gt;&lt;br&gt;OK. I've committed the latest patch to trunk. The code now never uses the default platform encoding on Mac OS X, opting instead for UTF-8 as the default. People can still override the setting with an explicit --encoding argument.
&lt;br&gt;&lt;br&gt;For the CentOS case I recommend just setting the LANG environment variable correctly, as that's used also by other programs and there is no other easy way for Tika or Java to figure out which encoding should be used on that platform.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
&lt;br&gt;&amp;gt; --------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-324
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-324&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-324&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: cli
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.3, 0.4, 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Mac OS 10.5, java version &amp;quot;1.6.0_15&amp;quot;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Peter Wolanin
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Jukka Zitting
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Critical
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 0.6
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: test.txt, TIKA-324-0.5.patch, TIKA-324-macosx.patch, TIKA-324.patch, TIKA-324.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; Original Estimate: 2h
&lt;br&gt;&amp;gt; &amp;nbsp;Remaining Estimate: 2h
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; When using the -t flag to tika, multi-byte content is destroyed in the output.
&lt;br&gt;&amp;gt; Example:
&lt;br&gt;&amp;gt; $ java -jar tika-app-0.4.jar -t ./test.txt
&lt;br&gt;&amp;gt; I?t?rn?ti?n?liz?ti?n
&lt;br&gt;&amp;gt; $ java -jar tika-app-0.4.jar -x ./test.txt
&lt;br&gt;&amp;gt; &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;html xmlns=&amp;quot;&lt;a href=&quot;http://www.w3.org/1999/xhtml&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/1999/xhtml&lt;/a&gt;&amp;quot;&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;head&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;title/&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/head&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;body&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;p&amp;gt;Iñtërnâtiônàlizætiøn
&lt;br&gt;&amp;gt; &amp;lt;/p&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/body&amp;gt;
&lt;br&gt;&amp;gt; &amp;lt;/html&amp;gt;
&lt;br&gt;&amp;gt; see also: &amp;nbsp;&lt;a href=&quot;http://drupal.org/node/622508#comment-2267918&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://drupal.org/node/622508#comment-2267918&lt;/a&gt;&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-324%29-Tika-CLI-mangles-utf-8-content-in-text-%28-t%29-mode-tp26361441p26543717.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26540444</id>
	<title>[jira] Updated: (TIKA-337) SWF parser</title>
	<published>2009-11-27T02:23:39Z</published>
	<updated>2009-11-27T02:23:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Julien Nioche updated TIKA-337:
&lt;br&gt;-------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: TIKA-337.patch
&lt;br&gt;&lt;br&gt;patch for SWF parser
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; SWF parser
&lt;br&gt;&amp;gt; ----------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-337
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-337&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-337&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: New Feature
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: parser
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Julien Nioche
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: TIKA-337.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Here is an initial implementation of a SWF Parser which uses JavaSWF and has been adapted from &amp;nbsp;A. Bialecki's implementation for Nutch.
&lt;br&gt;&amp;gt; The main differences with the implementation for Nutch is that we use the latest version of JavaSWF and do not try to extract text from the actions or structured URLs. As usual URLs can be obtained from the text extracted using ParserPostProcessor.
&lt;br&gt;&amp;gt; JavaSWF has changed quite a bit since the Nutch integration and I wanted to keep this initial port nice and simple. It should be possible to extract the URLs from the actions using &amp;nbsp;JavaSWF's API, I think this is what they did in Heritrix.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-337%29-SWF-parser-tp26540419p26540444.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26540419</id>
	<title>[jira] Created: (TIKA-337) SWF parser</title>
	<published>2009-11-27T02:21:39Z</published>
	<updated>2009-11-27T02:21:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">SWF parser
&lt;br&gt;----------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-337
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-337&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-337&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: New Feature
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: parser
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Julien Nioche
&lt;br&gt;&lt;br&gt;&lt;br&gt;Here is an initial implementation of a SWF Parser which uses JavaSWF and has been adapted from &amp;nbsp;A. Bialecki's implementation for Nutch.
&lt;br&gt;The main differences with the implementation for Nutch is that we use the latest version of JavaSWF and do not try to extract text from the actions or structured URLs. As usual URLs can be obtained from the text extracted using ParserPostProcessor.
&lt;br&gt;JavaSWF has changed quite a bit since the Nutch integration and I wanted to keep this initial port nice and simple. It should be possible to extract the URLs from the actions using &amp;nbsp;JavaSWF's API, I think this is what they did in Heritrix.
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-337%29-SWF-parser-tp26540419p26540419.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26522337</id>
	<title>[jira] Resolved: (TIKA-336) More issues with RDF mime detection</title>
	<published>2009-11-25T15:45:39Z</published>
	<updated>2009-11-25T15:45:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Chris A. Mattmann resolved TIKA-336.
&lt;br&gt;------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Resolution: Fixed
&lt;br&gt;&lt;br&gt;- fixed in r884340
&lt;br&gt;&lt;br&gt;Yuan-Fang, please test out the latest Tika trunk. I've:
&lt;br&gt;&lt;br&gt;* updated the test-difficult-rdf2.xml file to remove the &amp;lt;?xml header
&lt;br&gt;* updated the tika-mimetypes.xml to detect files that start with &amp;lt;!-- as xml files (as a default magic first check). Then, this forces xmlRoot detection to occur where the specific XML subclass is detected (which is what we want). There, application/rdf+xml is properly detected. Before, since there was no magic header for &amp;lt;!--, the initial magic result check was null and then the mimeTypes detector ended up returning text/plain.
&lt;br&gt;&lt;br&gt;In the future we may want to make:
&lt;br&gt;&lt;br&gt;* xmlRoot extraction occur on text/plain documents
&lt;br&gt;* move the text/plain check to the beginning of the o.a.tika.mime.MimeTypes#getMimeType(byte[] data) function
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; More issues with RDF mime detection
&lt;br&gt;&amp;gt; -----------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-336
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-336&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-336&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: mime
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: several user environments as well as validated in Mattmann's environment.
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Chris A. Mattmann
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Chris A. Mattmann
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 0.6
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; See TIKA-309 for related discussion, but there seems to be further errors in RDF mime detection, on the OWL file located here:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-336%29-More-issues-with-RDF-mime-detection-tp26521684p26522337.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26521684</id>
	<title>[jira] Created: (TIKA-336) More issues with RDF mime detection</title>
	<published>2009-11-25T14:51:40Z</published>
	<updated>2009-11-25T14:51:40Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">More issues with RDF mime detection
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-336
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-336&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-336&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Bug
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: mime
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 0.5
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Environment: several user environments as well as validated in Mattmann's environment.
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Chris A. Mattmann
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Assignee: Chris A. Mattmann
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Fix For: 0.6
&lt;br&gt;&lt;br&gt;&lt;br&gt;See TIKA-309 for related discussion, but there seems to be further errors in RDF mime detection, on the OWL file located here:
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-336%29-More-issues-with-RDF-mime-detection-tp26521684p26521684.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26521662</id>
	<title>[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected</title>
	<published>2009-11-25T14:49:39Z</published>
	<updated>2009-11-25T14:49:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782663#action_12782663&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782663#action_12782663&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Chris A. Mattmann commented on TIKA-309:
&lt;br&gt;----------------------------------------
&lt;br&gt;&lt;br&gt;Yuang-Fang:
&lt;br&gt;&lt;br&gt;I've confirmed what you mentioned. When the XML header first-line is taken out of the test-difficult-rdf2.xml (as the remote URL exists), I get this:
&lt;br&gt;&lt;br&gt;[chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest clean test
&lt;br&gt;[INFO] Scanning for projects...
&lt;br&gt;[INFO] Reactor build order: 
&lt;br&gt;[INFO] &amp;nbsp; Apache Tika parent
&lt;br&gt;[INFO] &amp;nbsp; Apache Tika core
&lt;br&gt;[INFO] &amp;nbsp; Apache Tika parsers
&lt;br&gt;[INFO] &amp;nbsp; Apache Tika application
&lt;br&gt;[INFO] &amp;nbsp; Apache Tika
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] Building Apache Tika parent
&lt;br&gt;[INFO] &amp;nbsp; &amp;nbsp;task-segment: [clean, test]
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] [clean:clean]
&lt;br&gt;[INFO] Setting property: classpath.resource.loader.class =&amp;gt; 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'.
&lt;br&gt;[INFO] Setting property: velocimacro.messages.on =&amp;gt; 'false'.
&lt;br&gt;[INFO] Setting property: resource.loader =&amp;gt; 'classpath'.
&lt;br&gt;[INFO] Setting property: resource.manager.logwhenfound =&amp;gt; 'false'.
&lt;br&gt;[INFO] [remote-resources:process {execution: default}]
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] Building Apache Tika core
&lt;br&gt;[INFO] &amp;nbsp; &amp;nbsp;task-segment: [clean, test]
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] [clean:clean]
&lt;br&gt;[INFO] [remote-resources:process {execution: default}]
&lt;br&gt;[INFO] [resources:resources]
&lt;br&gt;[INFO] Using 'UTF-8' encoding to copy filtered resources.
&lt;br&gt;[INFO] Copying 20 resources
&lt;br&gt;[INFO] Copying 3 resources
&lt;br&gt;[INFO] [compiler:compile]
&lt;br&gt;[INFO] Compiling 86 source files to /Users/mattmann/src/tika/trunk/tika-core/target/classes
&lt;br&gt;[INFO] [resources:testResources]
&lt;br&gt;[INFO] Using 'UTF-8' encoding to copy filtered resources.
&lt;br&gt;[INFO] Copying 24 resources
&lt;br&gt;[INFO] Copying 3 resources
&lt;br&gt;[INFO] [compiler:testCompile]
&lt;br&gt;[INFO] Compiling 19 source files to /Users/mattmann/src/tika/trunk/tika-core/target/test-classes
&lt;br&gt;[INFO] [surefire:test]
&lt;br&gt;[INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports
&lt;br&gt;&lt;br&gt;-------------------------------------------------------
&lt;br&gt;&amp;nbsp;T E S T S
&lt;br&gt;-------------------------------------------------------
&lt;br&gt;Running org.apache.tika.mime.MimeDetectionTest
&lt;br&gt;Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.568 sec &amp;lt;&amp;lt;&amp;lt; FAILURE!
&lt;br&gt;&lt;br&gt;Results :
&lt;br&gt;&lt;br&gt;Failed tests: 
&lt;br&gt;&amp;nbsp; testDetection(org.apache.tika.mime.MimeDetectionTest)
&lt;br&gt;&lt;br&gt;Tests run: 2, Failures: 1, Errors: 0, Skipped: 0
&lt;br&gt;&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[ERROR] BUILD FAILURE
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] There are test failures.
&lt;br&gt;&lt;br&gt;Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results.
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] For more information, run Maven with the -e switch
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[INFO] Total time: 8 seconds
&lt;br&gt;[INFO] Finished at: Wed Nov 25 14:45:52 PST 2009
&lt;br&gt;[INFO] Final Memory: 15M/31M
&lt;br&gt;[INFO] ------------------------------------------------------------------------
&lt;br&gt;[chipotle:~/src/tika/trunk] mattmann% 
&lt;br&gt;&lt;br&gt;[chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt 
&lt;br&gt;-------------------------------------------------------------------------------
&lt;br&gt;Test set: org.apache.tika.mime.MimeDetectionTest
&lt;br&gt;-------------------------------------------------------------------------------
&lt;br&gt;Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.573 sec &amp;lt;&amp;lt;&amp;lt; FAILURE!
&lt;br&gt;testDetection(org.apache.tika.mime.MimeDetectionTest) &amp;nbsp;Time elapsed: 0.44 sec &amp;nbsp;&amp;lt;&amp;lt;&amp;lt; FAILURE!
&lt;br&gt;junit.framework.ComparisonFailure: &lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;&amp;nbsp;is not properly detected. expected:&amp;lt;application/rdf+xml&amp;gt; but w
&lt;br&gt;as:&amp;lt;text/plain&amp;gt;
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at junit.framework.Assert.assertEquals(Assert.java:81)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:87)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.tika.mime.MimeDetectionTest.testUrl(MimeDetectionTest.java:71)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:54)
&lt;br&gt;&lt;br&gt;I'm looking into this right now...I'll file another issue for this..
&lt;br&gt;I'm looking into this now:
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Mime type application/rdf+xml not correctly detected
&lt;br&gt;&amp;gt; ----------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-309
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-309&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-309&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: mime
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Yuan-Fang Li
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Chris A. Mattmann
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 0.5
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Mime type detector using AutoDetectParser and Metadata returns &amp;quot;application/xml&amp;quot; for the URL &lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;, where it should be &amp;quot;application/rdf+xml&amp;quot;. The correct mime type is also suggested here: &lt;a href=&quot;http://www.w3.org/TR/owl-ref/#MIMEType&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/TR/owl-ref/#MIMEType&lt;/a&gt;.
&lt;br&gt;&amp;gt; P.S., Tika was downloaded from svn and built with Maven last week.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-309%29-Mime-type-application-rdf%2Bxml-not-correctly-detected-tp25867121p26521662.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26521404</id>
	<title>[jira] Updated: (TIKA-335) TXTParser should use incoming charset</title>
	<published>2009-11-25T14:29:39Z</published>
	<updated>2009-11-25T14:29:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Ken Krugler updated TIKA-335:
&lt;br&gt;-----------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: TIKA-335.patch
&lt;br&gt;&lt;br&gt;This patch also cleans up some generics warnings (sorry about mixing the two, I was going to open a second issue but the two were co-mingled).
&lt;br&gt;&lt;br&gt;In order to make this work, I had to modify the charset detection code to actually use the hint - weird that ICU never actually implemented this.
&lt;br&gt;&lt;br&gt;Includes a test case for an ambiguous run of text that could be UTF-8 or 8859-1.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; TXTParser should use incoming charset
&lt;br&gt;&amp;gt; -------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-335
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-335&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-335&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Ken Krugler
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: TIKA-335.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-335%29-TXTParser-use-of-CharsetDetector-has-several-bugs-tp26518231p26521404.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26519230</id>
	<title>[jira] Updated: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag</title>
	<published>2009-11-25T11:53:39Z</published>
	<updated>2009-11-25T11:53:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Ken Krugler updated TIKA-334:
&lt;br&gt;-----------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: TIKA-334.patch
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
&lt;br&gt;&amp;gt; ----------------------------------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-334
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-334&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-334&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Ken Krugler
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: TIKA-334.patch
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Currently the HtmlParser will just call TagSoup to parse, without specifying a charset, if no charset is passed in via metadata.
&lt;br&gt;&amp;gt; TagSoup uses the platform encoding in this case, which is often going to be wrong.
&lt;br&gt;&amp;gt; The right thing to do is to first check for a charset specified by a meta tag. If that doesn't exist, then create a CharsetDetector. If there's a charset in the incoming meta-data, use that to call setDeclaredEncoding().
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-334%29-HtmlParser-should-use-CharsetDetector-whenever-no-charset-is-specified-via-meta-http-equiv-tag-tp26518128p26519230.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518608</id>
	<title>Missing href attribute handling</title>
	<published>2009-11-25T11:12:36Z</published>
	<updated>2009-11-25T11:12:36Z</updated>
	<author>
		<name>Ken Krugler</name>
	</author>
	<content type="html">Hi Jukka,
&lt;br&gt;&lt;br&gt;Previously the HtmlParser code (or rather, HtmlHandler) would ensure &amp;nbsp;
&lt;br&gt;that there was always an href attribute for an &amp;lt;a&amp;gt; tag - if it was &amp;nbsp;
&lt;br&gt;missing, it would get set to &amp;quot;&amp;quot;.
&lt;br&gt;&lt;br&gt;The new code no longer does this, which caused some of my code to fail.
&lt;br&gt;&lt;br&gt;I see the new code looks like:
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;String href = atts.getValue(&amp;quot;href&amp;quot;);
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;if (href != null) {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;xhtml.startElement(&amp;quot;a&amp;quot;, &amp;quot;href&amp;quot;, &amp;nbsp;
&lt;br&gt;resolve(href.trim()));
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;} else {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;String anchor = atts.getValue(&amp;quot;name&amp;quot;);
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;if (anchor != null) {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;xhtml.startElement(&amp;quot;a&amp;quot;, &amp;quot;name&amp;quot;, anchor.trim());
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;} else {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;xhtml.startElement(&amp;quot;a&amp;quot;);
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}
&lt;br&gt;&lt;br&gt;I'm assuming this is by design, though for XHTML the use of &amp;quot;name&amp;quot; is &amp;nbsp;
&lt;br&gt;deprecated, with &amp;quot;id&amp;quot; being its replacement.
&lt;br&gt;&lt;br&gt;-- Ken
&lt;br&gt;&lt;br&gt;&lt;br&gt;--------------------------------------------
&lt;br&gt;Ken Krugler
&lt;br&gt;+1 530-210-6378
&lt;br&gt;&lt;a href=&quot;http://bixolabs.com&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://bixolabs.com&lt;/a&gt;&lt;br&gt;e l a s t i c &amp;nbsp; w e b &amp;nbsp; m i n i n g
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Missing-href-attribute-handling-tp26518608p26518608.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518266</id>
	<title>[jira] Updated: (TIKA-335) TXTParser should use incoming charset</title>
	<published>2009-11-25T10:50:39Z</published>
	<updated>2009-11-25T10:50:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Ken Krugler updated TIKA-335:
&lt;br&gt;-----------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor &amp;nbsp;(was: Major)
&lt;br&gt;&amp;nbsp; &amp;nbsp; Description: 
&lt;br&gt;The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
&lt;br&gt;&lt;br&gt;&lt;br&gt;&amp;nbsp; was:
&lt;br&gt;In looking at how TXTParser uses CharsetDetector, I see the following issues:
&lt;br&gt;&lt;br&gt;1. The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
&lt;br&gt;2. The first supported charset should be used, not the last. These are returned in confidence order, from best to worst.
&lt;br&gt;3. The current code might also wind up setting a language from one result, and the charset from another.
&lt;br&gt;&lt;br&gt;So the biggest change is to bail out of the loop once a supported charset has been found. 
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement &amp;nbsp;(was: Bug)
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Summary: TXTParser should use incoming charset &amp;nbsp;(was: TXTParser use of CharsetDetector has several bugs)
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; TXTParser should use incoming charset
&lt;br&gt;&amp;gt; -------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-335
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-335&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-335&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Ken Krugler
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-335%29-TXTParser-use-of-CharsetDetector-has-several-bugs-tp26518231p26518266.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518231</id>
	<title>[jira] Created: (TIKA-335) TXTParser use of CharsetDetector has several bugs</title>
	<published>2009-11-25T10:48:39Z</published>
	<updated>2009-11-25T10:48:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">TXTParser use of CharsetDetector has several bugs
&lt;br&gt;-------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-335
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-335&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-335&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Bug
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 0.5
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Ken Krugler
&lt;br&gt;&lt;br&gt;&lt;br&gt;In looking at how TXTParser uses CharsetDetector, I see the following issues:
&lt;br&gt;&lt;br&gt;1. The incoming charset (if any) from metadata should be passed to CharsetDetector.setDeclaredEncoding().
&lt;br&gt;2. The first supported charset should be used, not the last. These are returned in confidence order, from best to worst.
&lt;br&gt;3. The current code might also wind up setting a language from one result, and the charset from another.
&lt;br&gt;&lt;br&gt;So the biggest change is to bail out of the loop once a supported charset has been found. 
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-335%29-TXTParser-use-of-CharsetDetector-has-several-bugs-tp26518231p26518231.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518128</id>
	<title>[jira] Created: (TIKA-334) HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag</title>
	<published>2009-11-25T10:42:39Z</published>
	<updated>2009-11-25T10:42:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
&lt;br&gt;----------------------------------------------------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-334
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-334&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-334&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 0.5
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Ken Krugler
&lt;br&gt;&lt;br&gt;&lt;br&gt;Currently the HtmlParser will just call TagSoup to parse, without specifying a charset, if no charset is passed in via metadata.
&lt;br&gt;&lt;br&gt;TagSoup uses the platform encoding in this case, which is often going to be wrong.
&lt;br&gt;&lt;br&gt;The right thing to do is to first check for a charset specified by a meta tag. If that doesn't exist, then create a CharsetDetector. If there's a charset in the incoming meta-data, use that to call setDeclaredEncoding().
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-334%29-HtmlParser-should-use-CharsetDetector-whenever-no-charset-is-specified-via-meta-http-equiv-tag-tp26518128p26518128.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518104</id>
	<title>[jira] Updated: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents</title>
	<published>2009-11-25T10:40:39Z</published>
	<updated>2009-11-25T10:40:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Ken Krugler updated TIKA-332:
&lt;br&gt;-----------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Description: 
&lt;br&gt;Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the &amp;lt;meta http-equiv=&amp;quot;Content-type&amp;quot; content=&amp;quot;text/html; charset=xxx&amp;quot;&amp;gt; tag.
&lt;br&gt;&lt;br&gt;If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile(&amp;quot;&amp;lt;meta\\s+http-equiv\\s*=\\s*['\&amp;quot;]\\s*Content-Type['\&amp;quot;]\\s+content\\s*=\\s*['\&amp;quot;][^;]+;\\s*charset\\s*=\\s*([^'\&amp;quot;]+)\&amp;quot;&amp;quot;);
&lt;br&gt;&lt;br&gt;If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
&lt;br&gt;&lt;br&gt;In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
&lt;br&gt;&lt;br&gt;Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.
&lt;br&gt;&lt;br&gt;&amp;nbsp; was:
&lt;br&gt;Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the &amp;lt;meta http-equiv=&amp;quot;Content-type&amp;quot; content=&amp;quot;text/html; charset=xxx&amp;quot;&amp;gt; tag.
&lt;br&gt;&lt;br&gt;If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile(&amp;quot;&amp;lt;meta\\s+http-equiv\\s*=\\s*['\&amp;quot;]\\s*Content-Type['\&amp;quot;]\\s+content\\s*=\\s*['\&amp;quot;][^;]+;\\s*charset\\s*=\\s*([^'\&amp;quot;]+)\&amp;quot;&amp;quot;);
&lt;br&gt;&lt;br&gt;If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
&lt;br&gt;&lt;br&gt;In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
&lt;br&gt;&lt;br&gt;I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages.
&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Use http-equiv meta tag charset info when processing HTML documents
&lt;br&gt;&amp;gt; -------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-332
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-332&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-332&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Ken Krugler
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Critical
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the &amp;lt;meta http-equiv=&amp;quot;Content-type&amp;quot; content=&amp;quot;text/html; charset=xxx&amp;quot;&amp;gt; tag.
&lt;br&gt;&amp;gt; If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile(&amp;quot;&amp;lt;meta\\s+http-equiv\\s*=\\s*['\&amp;quot;]\\s*Content-Type['\&amp;quot;]\\s+content\\s*=\\s*['\&amp;quot;][^;]+;\\s*charset\\s*=\\s*([^'\&amp;quot;]+)\&amp;quot;&amp;quot;);
&lt;br&gt;&amp;gt; If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
&lt;br&gt;&amp;gt; In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
&lt;br&gt;&amp;gt; Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-332%29-Use-http-equiv-meta-tag-charset-info-when-processing-HTML-documents-tp26517341p26518104.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518073</id>
	<title>[jira] Commented: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents</title>
	<published>2009-11-25T10:38:39Z</published>
	<updated>2009-11-25T10:38:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782550#action_12782550&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782550#action_12782550&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Ken Krugler commented on TIKA-332:
&lt;br&gt;----------------------------------
&lt;br&gt;&lt;br&gt;It turns out the HtmlParser code doesn't even use the CharsetDetector support - this is only being used by the TXTParser, as far as I can tell (and incorrectly at that).
&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Use http-equiv meta tag charset info when processing HTML documents
&lt;br&gt;&amp;gt; -------------------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-332
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-332&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-332&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Ken Krugler
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Critical
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the &amp;lt;meta http-equiv=&amp;quot;Content-type&amp;quot; content=&amp;quot;text/html; charset=xxx&amp;quot;&amp;gt; tag.
&lt;br&gt;&amp;gt; If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile(&amp;quot;&amp;lt;meta\\s+http-equiv\\s*=\\s*['\&amp;quot;]\\s*Content-Type['\&amp;quot;]\\s+content\\s*=\\s*['\&amp;quot;][^;]+;\\s*charset\\s*=\\s*([^'\&amp;quot;]+)\&amp;quot;&amp;quot;);
&lt;br&gt;&amp;gt; If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
&lt;br&gt;&amp;gt; In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
&lt;br&gt;&amp;gt; I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-332%29-Use-http-equiv-meta-tag-charset-info-when-processing-HTML-documents-tp26517341p26518073.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26518074</id>
	<title>[jira] Closed: (TIKA-333) Improve accuracy of charset detection for HTML pages</title>
	<published>2009-11-25T10:38:39Z</published>
	<updated>2009-11-25T10:38:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Ken Krugler closed TIKA-333.
&lt;br&gt;----------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Resolution: Not A Problem
&lt;br&gt;&lt;br&gt;In actually walking the parse code, I see that the real problem is that the HtmlParser code doesn't use the CharsetDetector. If no charset is passed in, then it just calls TagSoup, which by default uses the platform encoding. See [&lt;a href=&quot;http://home.ccil.org/~cowan/XML/tagsoup/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://home.ccil.org/~cowan/XML/tagsoup/&lt;/a&gt;].
&lt;br&gt;&lt;br&gt;So I'll open another issue for the HtmlParser.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Improve accuracy of charset detection for HTML pages
&lt;br&gt;&amp;gt; ----------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-333
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-333&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-333&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Ken Krugler
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.
&lt;br&gt;&amp;gt; A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.
&lt;br&gt;&amp;gt; A more complex solution would be to scan for title and body tags, and pass bytes found in each.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-333%29-Improve-accuracy-of-charset-detection-for-HTML-pages-tp26517413p26518074.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26517413</id>
	<title>[jira] Created: (TIKA-333) Improve accuracy of charset detection for HTML pages</title>
	<published>2009-11-25T09:54:42Z</published>
	<updated>2009-11-25T09:54:42Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Improve accuracy of charset detection for HTML pages
&lt;br&gt;----------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-333
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-333&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-333&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 0.5
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Ken Krugler
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Priority: Minor
&lt;br&gt;&lt;br&gt;&lt;br&gt;Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.
&lt;br&gt;&lt;br&gt;A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.
&lt;br&gt;&lt;br&gt;A more complex solution would be to scan for title and body tags, and pass bytes found in each.
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-333%29-Improve-accuracy-of-charset-detection-for-HTML-pages-tp26517413p26517413.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26517341</id>
	<title>[jira] Created: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents</title>
	<published>2009-11-25T09:50:39Z</published>
	<updated>2009-11-25T09:50:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Use http-equiv meta tag charset info when processing HTML documents
&lt;br&gt;-------------------------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-332
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-332&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-332&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 0.5
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Ken Krugler
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Priority: Critical
&lt;br&gt;&lt;br&gt;&lt;br&gt;Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the &amp;lt;meta http-equiv=&amp;quot;Content-type&amp;quot; content=&amp;quot;text/html; charset=xxx&amp;quot;&amp;gt; tag.
&lt;br&gt;&lt;br&gt;If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like:
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile(&amp;quot;&amp;lt;meta\\s+http-equiv\\s*=\\s*['\&amp;quot;]\\s*Content-Type['\&amp;quot;]\\s+content\\s*=\\s*['\&amp;quot;][^;]+;\\s*charset\\s*=\\s*([^'\&amp;quot;]+)\&amp;quot;&amp;quot;);
&lt;br&gt;&lt;br&gt;If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins.
&lt;br&gt;&lt;br&gt;In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages.
&lt;br&gt;&lt;br&gt;I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-332%29-Use-http-equiv-meta-tag-charset-info-when-processing-HTML-documents-tp26517341p26517341.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26501678</id>
	<title>[jira] Commented: (TIKA-331) Windings font recognition in Tika parsing + spacing issue</title>
	<published>2009-11-24T11:13:39Z</published>
	<updated>2009-11-24T11:13:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782100#action_12782100&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782100#action_12782100&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Ken Krugler commented on TIKA-331:
&lt;br&gt;----------------------------------
&lt;br&gt;&lt;br&gt;I believe this is an issue for the PDF parser (PDFBox) that Tika &amp;quot;wraps&amp;quot;.
&lt;br&gt;&lt;br&gt;Please check &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/PDFBOX&lt;/a&gt;&amp;nbsp;to see if this is already filed, and if not, refile it there.
&lt;br&gt;&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Windings font recognition in Tika parsing + spacing issue
&lt;br&gt;&amp;gt; ---------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-331
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Wish
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: parser
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.4
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Windows XP / Java JDK 1.6.0_15
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: MRIT64
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I have PDF files that include some characters in Windings font.
&lt;br&gt;&amp;gt; Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
&lt;br&gt;&amp;gt; Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
&lt;br&gt;&amp;gt; (see &lt;a href=&quot;http://www.alanwood.net/demos/wingdings.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.alanwood.net/demos/wingdings.html&lt;/a&gt;&amp;nbsp;for possible correspondences).
&lt;br&gt;&amp;gt; I will attach examples files when this issue will be created &amp;nbsp;(would it be possible to attach files directly when creating issues ?)
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-331%29-Windings-font-recognition-in-Tika-parsing-%2B-spacing-issue-tp26501260p26501678.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26501649</id>
	<title>[jira] Commented: (TIKA-331) Windings font recognition in Tika parsing + spacing issue</title>
	<published>2009-11-24T11:11:39Z</published>
	<updated>2009-11-24T11:11:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782097#action_12782097&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12782097#action_12782097&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;MRIT64 commented on TIKA-331:
&lt;br&gt;-----------------------------
&lt;br&gt;&lt;br&gt;Spacing issue
&lt;br&gt;--------------------
&lt;br&gt;&lt;br&gt;Look at lines 10 and 11 in test2.pdf.
&lt;br&gt;Look at &amp;nbsp;lines 11 and 12 in &amp;nbsp;Tika parsing result (Parsing_result2.txt) :
&lt;br&gt;&lt;br&gt;ðLocalisation des zones de livraison et de stockage
&lt;br&gt;ðLocalisation des zones dangereuses
&lt;br&gt;&lt;br&gt;There is no space between ð and Localisation (ð is the translation of Winding's &amp;quot;Rightwards white arrow&amp;quot; by Tika).
&lt;br&gt;&lt;br&gt;If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get :
&lt;br&gt;&lt;br&gt;ð Localisation des zones de livraison et de stockage
&lt;br&gt;ð Localisation des zones dangereuses
&lt;br&gt;&lt;br&gt;...with a space between ð and Localisation.
&lt;br&gt;&lt;br&gt;In my case, the missing space after Tika parsing result in considering &amp;quot;ðLocalisation&amp;quot; as a word in following processes.
&lt;br&gt;&lt;br&gt;Regards
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Windings font recognition in Tika parsing + spacing issue
&lt;br&gt;&amp;gt; ---------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-331
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Wish
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: parser
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.4
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Windows XP / Java JDK 1.6.0_15
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: MRIT64
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I have PDF files that include some characters in Windings font.
&lt;br&gt;&amp;gt; Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
&lt;br&gt;&amp;gt; Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
&lt;br&gt;&amp;gt; (see &lt;a href=&quot;http://www.alanwood.net/demos/wingdings.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.alanwood.net/demos/wingdings.html&lt;/a&gt;&amp;nbsp;for possible correspondences).
&lt;br&gt;&amp;gt; I will attach examples files when this issue will be created &amp;nbsp;(would it be possible to attach files directly when creating issues ?)
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-331%29-Windings-font-recognition-in-Tika-parsing-%2B-spacing-issue-tp26501260p26501649.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26501502</id>
	<title>[jira] Updated: (TIKA-331) Windings font recognition in Tika parsing + spacing issue</title>
	<published>2009-11-24T11:00:39Z</published>
	<updated>2009-11-24T11:00:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;MRIT64 updated TIKA-331:
&lt;br&gt;------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: Parsing_Result2.txt
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; test2.pdf
&lt;br&gt;&lt;br&gt;Another example with the same WORD source file converted into PDF with another tool, and the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Windings font recognition in Tika parsing + spacing issue
&lt;br&gt;&amp;gt; ---------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-331
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Wish
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: parser
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.4
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Windows XP / Java JDK 1.6.0_15
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: MRIT64
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I have PDF files that include some characters in Windings font.
&lt;br&gt;&amp;gt; Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
&lt;br&gt;&amp;gt; Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
&lt;br&gt;&amp;gt; (see &lt;a href=&quot;http://www.alanwood.net/demos/wingdings.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.alanwood.net/demos/wingdings.html&lt;/a&gt;&amp;nbsp;for possible correspondences).
&lt;br&gt;&amp;gt; I will attach examples files when this issue will be created &amp;nbsp;(would it be possible to attach files directly when creating issues ?)
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-331%29-Windings-font-recognition-in-Tika-parsing-%2B-spacing-issue-tp26501260p26501502.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26501463</id>
	<title>[jira] Updated: (TIKA-331) Windings font recognition in Tika parsing + spacing issue</title>
	<published>2009-11-24T10:58:40Z</published>
	<updated>2009-11-24T10:58:40Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;MRIT64 updated TIKA-331:
&lt;br&gt;------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; Attachment: Parsing_Result1.txt
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; test1.pdf
&lt;br&gt;&lt;br&gt;test1.pdf is a PDF file including Windings characters. Some are &amp;nbsp;commonly used by people, others less fequently.
&lt;br&gt;&lt;br&gt;Parsing_result1.txt is the text file produced by Tika.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Windings font recognition in Tika parsing + spacing issue
&lt;br&gt;&amp;gt; ---------------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-331
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Wish
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: parser
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.4
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Environment: Windows XP / Java JDK 1.6.0_15
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: MRIT64
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Attachments: Parsing_Result1.txt, test1.pdf
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; I have PDF files that include some characters in Windings font.
&lt;br&gt;&amp;gt; Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
&lt;br&gt;&amp;gt; Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
&lt;br&gt;&amp;gt; (see &lt;a href=&quot;http://www.alanwood.net/demos/wingdings.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.alanwood.net/demos/wingdings.html&lt;/a&gt;&amp;nbsp;for possible correspondences).
&lt;br&gt;&amp;gt; I will attach examples files when this issue will be created &amp;nbsp;(would it be possible to attach files directly when creating issues ?)
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-331%29-Windings-font-recognition-in-Tika-parsing-%2B-spacing-issue-tp26501260p26501463.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26501260</id>
	<title>[jira] Created: (TIKA-331) Windings font recognition in Tika parsing + spacing issue</title>
	<published>2009-11-24T10:42:39Z</published>
	<updated>2009-11-24T10:42:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Windings font recognition in Tika parsing + spacing issue
&lt;br&gt;---------------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-331
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-331&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-331&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Wish
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: parser
&lt;br&gt;&amp;nbsp; &amp;nbsp; Affects Versions: 0.4
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Environment: Windows XP / Java JDK 1.6.0_15
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: MRIT64
&lt;br&gt;&lt;br&gt;&lt;br&gt;I have PDF files that include some characters in Windings font.
&lt;br&gt;Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
&lt;br&gt;Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
&lt;br&gt;(see &lt;a href=&quot;http://www.alanwood.net/demos/wingdings.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.alanwood.net/demos/wingdings.html&lt;/a&gt;&amp;nbsp;for possible correspondences).
&lt;br&gt;&lt;br&gt;I will attach examples files when this issue will be created &amp;nbsp;(would it be possible to attach files directly when creating issues ?)
&lt;br&gt;&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-331%29-Windings-font-recognition-in-Tika-parsing-%2B-spacing-issue-tp26501260p26501260.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26498269</id>
	<title>Re: [ANNOUNCE] Apache Tika 0.5 Released</title>
	<published>2009-11-24T07:45:49Z</published>
	<updated>2009-11-24T07:45:49Z</updated>
	<author>
		<name>Karl Heinz Marbaise</name>
	</author>
	<content type="html">Hi Chris,
&lt;br&gt;&lt;br&gt;about an hour ago it showed the old version now it shows the 0.5 release with the correct &lt;a href=&quot;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&lt;/a&gt;&amp;nbsp;file...
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; &amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika/project-summary.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/project-summary.html&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;gt; show 0.5-SNAPSHOT instead 0.5
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Strangely enough, for me this shows 0.4:
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Build Information
&lt;br&gt;&amp;gt; Field Value 
&lt;br&gt;&amp;gt; GroupId org.apache.tika
&lt;br&gt;&amp;gt; ArtifactId tika-site
&lt;br&gt;&amp;gt; Version 0.4 
&lt;br&gt;&amp;gt; Type pom
&lt;/div&gt;Yeah...that's what i would like to ask for? 0.4 ..i know that the site is a different module (POM) but shouldn't it be in sync with the rest of the package.
&lt;br&gt;&lt;br&gt;And now i found an other point which seemed to be not correct or may be a little bit confusing...
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org/tika/source-repository.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/source-repository.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;The URL for the repository is given with:
&lt;br&gt;&lt;a href=&quot;http://svn.apache.org/repos/asf/maven/pom/tags/apache-4/tika-parent/tika-site&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/repos/asf/maven/pom/tags/apache-4/tika-parent/tika-site&lt;/a&gt;&lt;br&gt;(Click on the Web-Access gives an error message as expected)
&lt;br&gt;&lt;br&gt;Hm...in my opinion it should be more or less like the following:
&lt;br&gt;&lt;a href=&quot;http://svn.apache.org/repos/asf/lucene/tika/tags/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/repos/asf/lucene/tika/tags/0.5/&lt;/a&gt;&lt;br&gt;&lt;br&gt;Kind regards
&lt;br&gt;Karl Heinz Marbaise
&lt;br&gt;-- 
&lt;br&gt;MfG
&lt;br&gt;Karl Heinz Marbaise
&lt;br&gt;-- 
&lt;br&gt;SoftwareEntwicklung Beratung Schulung &amp;nbsp; &amp;nbsp;Tel.: +49 (0) 2405 / 415 893
&lt;br&gt;Dipl.Ing.(FH) Karl Heinz Marbaise &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ICQ#: 135949029
&lt;br&gt;Hauptstrasse 177 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; USt.IdNr: DE191347579
&lt;br&gt;52146 Würselen &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://www.soebes.de&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.soebes.de&lt;/a&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-ANNOUNCE--Apache-Tika-0.5-Released-tp26466425p26498269.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26497494</id>
	<title>Re: [ANNOUNCE] Apache Tika 0.5 Released</title>
	<published>2009-11-24T07:06:00Z</published>
	<updated>2009-11-24T07:06:00Z</updated>
	<author>
		<name>Mattmann, Chris A (388J)</name>
	</author>
	<content type="html">Hi Karl,
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; the given URL (&lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/0.5/&lt;/a&gt;) is not
&lt;br&gt;&amp;gt; ok...cause it should be...
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; URL: &lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/&lt;/a&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; URL/tika-app/0.5
&lt;br&gt;&amp;gt; URL/tika-parent/0.5
&lt;br&gt;&amp;gt; URL/tika-core/0.5
&lt;br&gt;&amp;gt; URL/tika-parsers/0.5
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;Yep, sorry about that &amp;lt; I had a typo on my pasted URL from a prior release
&lt;br&gt;announcement (the package structure for Tika has since changed).
&lt;br&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; an other short information is that at the moment the Web-Site is not
&lt;br&gt;&amp;gt; up-to-date...
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika/download.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/download.html&lt;/a&gt;&lt;br&gt;&amp;gt; shows 0.4 instead....
&lt;br&gt;&lt;br&gt;The download page shows correctly for me:
&lt;br&gt;&amp;nbsp;
&lt;br&gt;Apache Tika 0.5 is now available. See the CHANGES.txt
&lt;br&gt;&amp;lt;&lt;a href=&quot;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&lt;/a&gt;&amp;gt; &amp;nbsp;file for more
&lt;br&gt;information on the list of updates in this initial release.
&lt;br&gt;* apache-tika-0.5-src.zip
&lt;br&gt;&amp;lt;&lt;a href=&quot;http://www.apache.org/dyn/closer.cgi/lucene/tika/apache-tika-0.5-src.zip&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dyn/closer.cgi/lucene/tika/apache-tika-0.5-src.zip&lt;/a&gt;&amp;gt;
&lt;br&gt;(PGP &amp;lt;&lt;a href=&quot;http://www.apache.org/dist/lucene/tika/apache-tika-0.5-src.zip.asc&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/apache-tika-0.5-src.zip.asc&lt;/a&gt;&amp;gt; )
&lt;br&gt;Apache Tika releases are available under the Apache License, Version 2.0
&lt;br&gt;&amp;lt;&lt;a href=&quot;http://www.apache.org/licenses/LICENSE-2.0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/licenses/LICENSE-2.0&lt;/a&gt;&amp;gt; . See the NOTICE.txt file
&lt;br&gt;contained in each release artifact for applicable copyright attribution
&lt;br&gt;notices.
&lt;br&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika/project-summary.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/project-summary.html&lt;/a&gt;&lt;br&gt;&amp;gt; show 0.5-SNAPSHOT instead 0.5
&lt;br&gt;&lt;br&gt;Strangely enough, for me this shows 0.4:
&lt;br&gt;&lt;br&gt;Build Information
&lt;br&gt;Field Value 
&lt;br&gt;GroupId org.apache.tika
&lt;br&gt;ArtifactId tika-site
&lt;br&gt;Version 0.4 
&lt;br&gt;Type pom
&lt;br&gt;&lt;br&gt;I'm not sure what the problem is on this one -- these pages should be
&lt;br&gt;generated automatically every night by Jukka's site generation script after
&lt;br&gt;there are changes to the site portion of Tika SVN.
&lt;br&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; May be it took some time to update the tika web-site so I'm a little bit to
&lt;br&gt;&amp;gt; early...;-)
&lt;br&gt;&lt;br&gt;Nope, you are fine, thanks for the pointers, glad you are looking out!
&lt;br&gt;&lt;br&gt;Cheers,
&lt;br&gt;Chris
&lt;br&gt;&lt;br&gt;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;Chris Mattmann, Ph.D.
&lt;br&gt;Senior Computer Scientist
&lt;br&gt;NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
&lt;br&gt;Office: 171-266B, Mailstop: 171-246
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26497494&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;Chris.Mattmann@...&lt;/a&gt;
&lt;br&gt;WWW: &amp;nbsp; &lt;a href=&quot;http://sunset.usc.edu/~mattmann/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sunset.usc.edu/~mattmann/&lt;/a&gt;&lt;br&gt;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;Adjunct Assistant Professor, Computer Science Department
&lt;br&gt;University of Southern California, Los Angeles, CA 90089 USA
&lt;br&gt;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-ANNOUNCE--Apache-Tika-0.5-Released-tp26466425p26497494.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26493842</id>
	<title>Re: [ANNOUNCE] Apache Tika 0.5 Released</title>
	<published>2009-11-24T02:50:20Z</published>
	<updated>2009-11-24T02:50:20Z</updated>
	<author>
		<name>Jukka Zitting</name>
	</author>
	<content type="html">Hi,
&lt;br&gt;&lt;br&gt;On Tue, Nov 24, 2009 at 11:02 AM, Karl Heinz Marbaise &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26493842&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;khmarbaise@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&amp;gt; an other short information is that at the moment the Web-Site is not up-to-date...
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika/download.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/download.html&lt;/a&gt;&lt;br&gt;&amp;gt; shows 0.4 instead....
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika/project-summary.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/project-summary.html&lt;/a&gt;&lt;br&gt;&amp;gt; show 0.5-SNAPSHOT instead 0.5
&lt;br&gt;&lt;br&gt;Bugger, this must be a result of the recent problems with the Hudson
&lt;br&gt;server. I guess it was restored from an older backup and thus our site
&lt;br&gt;deployment scripts ended up reverting the site back to a previous
&lt;br&gt;state. :-(
&lt;br&gt;&lt;br&gt;I've just regenerated and -deployed the site.
&lt;br&gt;&lt;br&gt;BR,
&lt;br&gt;&lt;br&gt;Jukka Zitting
&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-ANNOUNCE--Apache-Tika-0.5-Released-tp26466425p26493842.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26493266</id>
	<title>Re: [ANNOUNCE] Apache Tika 0.5 Released</title>
	<published>2009-11-24T02:02:41Z</published>
	<updated>2009-11-24T02:02:41Z</updated>
	<author>
		<name>Karl Heinz Marbaise</name>
	</author>
	<content type="html">Hi there,
&lt;br&gt;&lt;br&gt;&lt;br&gt;the given URL (&lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/0.5/&lt;/a&gt;) is not ok...cause it should be...
&lt;br&gt;&lt;br&gt;URL: &lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/&lt;/a&gt;&lt;br&gt;&lt;br&gt;URL/tika-app/0.5
&lt;br&gt;URL/tika-parent/0.5
&lt;br&gt;URL/tika-core/0.5
&lt;br&gt;URL/tika-parsers/0.5
&lt;br&gt;&lt;br&gt;an other short information is that at the moment the Web-Site is not up-to-date...
&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org/tika/download.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/download.html&lt;/a&gt;&lt;br&gt;shows 0.4 instead....
&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org/tika/project-summary.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/project-summary.html&lt;/a&gt;&lt;br&gt;show 0.5-SNAPSHOT instead 0.5
&lt;br&gt;&lt;br&gt;May be it took some time to update the tika web-site so I'm a little bit to early...;-)
&lt;br&gt;Kind regards
&lt;br&gt;Karl Heinz Marbaise
&lt;br&gt;-- 
&lt;br&gt;MfG
&lt;br&gt;Karl Heinz Marbaise
&lt;br&gt;-- 
&lt;br&gt;SoftwareEntwicklung Beratung Schulung &amp;nbsp; &amp;nbsp;Tel.: +49 (0) 2405 / 415 893
&lt;br&gt;Dipl.Ing.(FH) Karl Heinz Marbaise &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ICQ#: 135949029
&lt;br&gt;Hauptstrasse 177 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; USt.IdNr: DE191347579
&lt;br&gt;52146 Würselen &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://www.soebes.de&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.soebes.de&lt;/a&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-ANNOUNCE--Apache-Tika-0.5-Released-tp26466425p26493266.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26480378</id>
	<title>Re: [ANNOUNCE] Apache Tika 0.5 Released</title>
	<published>2009-11-23T07:39:50Z</published>
	<updated>2009-11-23T07:39:50Z</updated>
	<author>
		<name>Mattmann, Chris A (388J)</name>
	</author>
	<content type="html">Hey Steen,
&lt;br&gt;&lt;br&gt;It's already available on: &lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/0.5/&lt;/a&gt;&amp;nbsp;and should be in the other repo shortly...
&lt;br&gt;&lt;br&gt;Cheers,
&lt;br&gt;Chris
&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;On 11/23/09 1:15 AM, &amp;quot;Steen Manniche&amp;quot; &amp;lt;&lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26480378&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;stm@...&lt;/a&gt;&amp;gt; wrote:
&lt;br&gt;&lt;br&gt;Den Sun, Nov 22, 2009 at 07:50:47AM -0800 skrev Mattmann, Chris A (388J):
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; (...apologies for the cross posting...)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The Apache Lucene project is pleased to announce the release of Apache Tika
&lt;br&gt;&amp;gt; 0.5. The release contents have been pushed out to the main Apache release
&lt;br&gt;&amp;gt; site and the m2 ibiblio sync, so the releases should be available as soon as
&lt;br&gt;&amp;gt; the mirrors get the syncs.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
&lt;br&gt;&amp;gt; extracting metadata and structured text content from various documents using
&lt;br&gt;&amp;gt; existing parser libraries.
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Apache Tika 0.5 contains a number of improvements and bug fixes. Details can
&lt;br&gt;&amp;gt; be found in the changes file:
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Apache Tika is available in source form from the following download page:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.apache.org/dyn/closer.cgi/lucene/tika/apache-tika-0.5-src.zip&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dyn/closer.cgi/lucene/tika/apache-tika-0.5-src.zip&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Apache Tika is also available in binary form or for use using Maven 2 from
&lt;br&gt;&amp;gt; the Central Maven Repositories:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/0.5/&lt;/a&gt;&lt;br&gt;&amp;gt; &lt;a href=&quot;http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/tika/0.5/&lt;/a&gt;&lt;/div&gt;&lt;br&gt;The above link and any of the source access links on the page
&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org/tika/source-repository.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/source-repository.html&lt;/a&gt;&amp;nbsp;are broken at the
&lt;br&gt;moment. Where should I point my wget at? Or should I just wait a
&lt;br&gt;while?
&lt;br&gt;&lt;br&gt;Best regards and thanks for the effort,
&lt;br&gt;Steen Manniche
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; In the initial 48 hours, the release may not be available on all mirrors.
&lt;br&gt;&amp;gt; When downloading from a mirror site, please remember to verify the downloads
&lt;br&gt;&amp;gt; using signatures found on the Apache site:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.apache.org/dist/lucene/tika/KEYS-0.5.txt&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/KEYS-0.5.txt&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; For more information on Apache Tika, visit the project home page:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika&lt;/a&gt;&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; -- Chris Mattmann (on behalf of the Apache Lucene community)
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&amp;gt; Chris Mattmann, Ph.D.
&lt;br&gt;&amp;gt; Senior Computer Scientist
&lt;br&gt;&amp;gt; NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
&lt;br&gt;&amp;gt; Office: 171-266B, Mailstop: 171-246
&lt;br&gt;&amp;gt; Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26480378&amp;i=1&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;Chris.Mattmann@...&lt;/a&gt;
&lt;br&gt;&amp;gt; WWW: &amp;nbsp; &lt;a href=&quot;http://sunset.usc.edu/~mattmann/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sunset.usc.edu/~mattmann/&lt;/a&gt;&lt;br&gt;&amp;gt; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&amp;gt; Adjunct Assistant Professor, Computer Science Department
&lt;br&gt;&amp;gt; University of Southern California, Los Angeles, CA 90089 USA
&lt;br&gt;&amp;gt; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;/div&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;Chris Mattmann, Ph.D.
&lt;br&gt;Senior Computer Scientist
&lt;br&gt;NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
&lt;br&gt;Office: 171-266B, Mailstop: 171-246
&lt;br&gt;Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26480378&amp;i=2&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;Chris.Mattmann@...&lt;/a&gt;
&lt;br&gt;WWW: &amp;nbsp; &lt;a href=&quot;http://sunset.usc.edu/~mattmann/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sunset.usc.edu/~mattmann/&lt;/a&gt;&lt;br&gt;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;Adjunct Assistant Professor, Computer Science Department
&lt;br&gt;University of Southern California, Los Angeles, CA 90089 USA
&lt;br&gt;++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-ANNOUNCE--Apache-Tika-0.5-Released-tp26466425p26480378.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26476338</id>
	<title>[jira] Resolved: (TIKA-330) Better HWP (Hangul Word Processor) detection pattern</title>
	<published>2009-11-23T03:29:39Z</published>
	<updated>2009-11-23T03:29:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;[ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&lt;/a&gt;&amp;nbsp;]
&lt;br&gt;&lt;br&gt;Jukka Zitting resolved TIKA-330.
&lt;br&gt;--------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Resolution: Fixed
&lt;br&gt;&amp;nbsp; &amp;nbsp; Fix Version/s: 0.6
&lt;br&gt;&lt;br&gt;Fixed in revision 883306.
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Better HWP (Hangul Word Processor) detection pattern
&lt;br&gt;&amp;gt; ----------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-330
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-330&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-330&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Improvement
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: mime
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Jukka Zitting
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Jukka Zitting
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 0.6
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; The current magic byte pattern we have for the HWP (Hangul Word Processor, application/x-hwp) file format matches also the test-outlook.msg test file we have. I looked for a better detection pattern and found one from OpenOffice.org.
&lt;br&gt;&amp;gt; The hwpfilter/source/hwpfile.cpp file suggests that all HWP files start with the signature string &amp;quot;HWP Document File V&amp;quot;, so I'll change the detection pattern accordingly.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-330%29-Better-HWP-%28Hangul-Word-Processor%29-detection-pattern-tp26476309p26476338.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26476309</id>
	<title>[jira] Created: (TIKA-330) Better HWP (Hangul Word Processor) detection pattern</title>
	<published>2009-11-23T03:27:39Z</published>
	<updated>2009-11-23T03:27:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">Better HWP (Hangul Word Processor) detection pattern
&lt;br&gt;----------------------------------------------------
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Key: TIKA-330
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-330&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-330&lt;/a&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Project: Tika
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Issue Type: Improvement
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Components: mime
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reporter: Jukka Zitting
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Assignee: Jukka Zitting
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Priority: Minor
&lt;br&gt;&lt;br&gt;&lt;br&gt;The current magic byte pattern we have for the HWP (Hangul Word Processor, application/x-hwp) file format matches also the test-outlook.msg test file we have. I looked for a better detection pattern and found one from OpenOffice.org.
&lt;br&gt;&lt;br&gt;The hwpfilter/source/hwpfile.cpp file suggests that all HWP files start with the signature string &amp;quot;HWP Document File V&amp;quot;, so I'll change the detection pattern accordingly.
&lt;br&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-330%29-Better-HWP-%28Hangul-Word-Processor%29-detection-pattern-tp26476309p26476309.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26474715</id>
	<title>Re: [ANNOUNCE] Apache Tika 0.5 Released</title>
	<published>2009-11-23T01:15:39Z</published>
	<updated>2009-11-23T01:15:39Z</updated>
	<author>
		<name>Steen Manniche</name>
	</author>
	<content type="html">Den Sun, Nov 22, 2009 at 07:50:47AM -0800 skrev Mattmann, Chris A (388J):
&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; (...apologies for the cross posting...)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; The Apache Lucene project is pleased to announce the release of Apache Tika
&lt;br&gt;&amp;gt; 0.5. The release contents have been pushed out to the main Apache release
&lt;br&gt;&amp;gt; site and the m2 ibiblio sync, so the releases should be available as soon as
&lt;br&gt;&amp;gt; the mirrors get the syncs.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
&lt;br&gt;&amp;gt; extracting metadata and structured text content from various documents using
&lt;br&gt;&amp;gt; existing parser libraries.
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Apache Tika 0.5 contains a number of improvements and bug fixes. Details can
&lt;br&gt;&amp;gt; be found in the changes file:
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/CHANGES-0.5.txt&lt;/a&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Apache Tika is available in source form from the following download page:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.apache.org/dyn/closer.cgi/lucene/tika/apache-tika-0.5-src.zip&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dyn/closer.cgi/lucene/tika/apache-tika-0.5-src.zip&lt;/a&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; Apache Tika is also available in binary form or for use using Maven 2 from
&lt;br&gt;&amp;gt; the Central Maven Repositories:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://repo1.maven.org/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://repo1.maven.org/maven2/org/apache/tika/0.5/&lt;/a&gt;&lt;br&gt;&amp;gt; &lt;a href=&quot;http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/tika/0.5/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/tika/0.5/&lt;/a&gt;&lt;/div&gt;&lt;br&gt;The above link and any of the source access links on the page
&lt;br&gt;&lt;a href=&quot;http://lucene.apache.org/tika/source-repository.html&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika/source-repository.html&lt;/a&gt;&amp;nbsp;are broken at the
&lt;br&gt;moment. Where should I point my wget at? Or should I just wait a
&lt;br&gt;while?
&lt;br&gt;&lt;br&gt;Best regards and thanks for the effort,
&lt;br&gt;Steen Manniche
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; In the initial 48 hours, the release may not be available on all mirrors.
&lt;br&gt;&amp;gt; When downloading from a mirror site, please remember to verify the downloads
&lt;br&gt;&amp;gt; using signatures found on the Apache site:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://www.apache.org/dist/lucene/tika/KEYS-0.5.txt&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.apache.org/dist/lucene/tika/KEYS-0.5.txt&lt;/a&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; For more information on Apache Tika, visit the project home page:
&lt;br&gt;&amp;gt; &lt;a href=&quot;http://lucene.apache.org/tika&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://lucene.apache.org/tika&lt;/a&gt;&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; -- Chris Mattmann (on behalf of the Apache Lucene community)
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&amp;gt; Chris Mattmann, Ph.D.
&lt;br&gt;&amp;gt; Senior Computer Scientist
&lt;br&gt;&amp;gt; NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
&lt;br&gt;&amp;gt; Office: 171-266B, Mailstop: 171-246
&lt;br&gt;&amp;gt; Email: &lt;a href=&quot;http://old.nabble.com/user/SendEmail.jtp?type=post&amp;post=26474715&amp;i=0&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;Chris.Mattmann@...&lt;/a&gt;
&lt;br&gt;&amp;gt; WWW: &amp;nbsp; &lt;a href=&quot;http://sunset.usc.edu/~mattmann/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://sunset.usc.edu/~mattmann/&lt;/a&gt;&lt;br&gt;&amp;gt; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&amp;gt; Adjunct Assistant Professor, Computer Science Department
&lt;br&gt;&amp;gt; University of Southern California, Los Angeles, CA 90089 USA
&lt;br&gt;&amp;gt; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;br&gt;&amp;gt; 
&lt;/div&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-ANNOUNCE--Apache-Tika-0.5-Released-tp26466425p26474715.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26473079</id>
	<title>[jira] Commented: (TIKA-309) Mime type application/rdf+xml not correctly detected</title>
	<published>2009-11-22T21:38:39Z</published>
	<updated>2009-11-22T21:38:39Z</updated>
	<author>
		<name>JIRA jira@apache.org</name>
	</author>
	<content type="html">&lt;br&gt;&amp;nbsp; &amp;nbsp; [ &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12781311#action_12781311&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=12781311#action_12781311&lt;/a&gt;&amp;nbsp;] 
&lt;br&gt;&lt;br&gt;Yuan-Fang Li commented on TIKA-309:
&lt;br&gt;-----------------------------------
&lt;br&gt;&lt;br&gt;Hi Chris, Jukka,
&lt;br&gt;&lt;br&gt;Yes, the Tika tests are passing for me. However, my test for one of the ontologies (&amp;quot;&lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;&amp;quot;) is still failing, and here is why. 
&lt;br&gt;&lt;br&gt;In test tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java, the method testUrl(String expected, String url, String file) is actually testing the content in the file named &amp;quot;file&amp;quot; with the url being a clue for the detection. My test, however, opens an input stream on the actual url and use that to detect the mime type. For the above URL, tika is testing against the file named &amp;quot;test-difficult-rdf2.xml&amp;quot;. The only difference I can see between this file and the actual content of the URl is the one line at the top: &amp;quot;&amp;lt;?xml version='1.0' encoding='ISO-8859-1'?&amp;gt;&amp;quot;. This line is present in the tika test file but not in the URL.
&lt;br&gt;&lt;br&gt;So. if you remove/comment out that line from &amp;quot;test-difficult-rdf2.xml&amp;quot; and run the following maven command to run the test: mvn -Dtest=MimeDetectionTest test, it will fail. Or, you could use the following test case to test against the real URL.
&lt;br&gt;&lt;br&gt;&amp;nbsp; &amp;nbsp; @Test
&lt;br&gt;&amp;nbsp; &amp;nbsp; public void testRDFStreamMimeType() throws IOException {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL url = new URL(&amp;quot;&lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;&amp;quot;);
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; final InputStream stream = new BufferedInputStream(url.openStream());
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; try {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; MimeTypes mimeTypes = TikaConfig.getDefaultConfig().getMimeRepository();
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Metadata metadata = new Metadata();
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; String mime = mimeTypes.detect(stream, metadata).toString();
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; assertEquals(&amp;quot;application/rdf+xml&amp;quot;, mime);
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; } finally {
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; stream.close();
&lt;br&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }
&lt;br&gt;&amp;nbsp; &amp;nbsp; }
&lt;br&gt;&lt;br&gt;Cheers
&lt;br&gt;Yuan-Fang
&lt;br&gt;&lt;div class='shrinkable-quote'&gt;&lt;br&gt;&amp;gt; Mime type application/rdf+xml not correctly detected
&lt;br&gt;&amp;gt; ----------------------------------------------------
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Key: TIKA-309
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; URL: &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-309&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/TIKA-309&lt;/a&gt;&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Project: Tika
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Issue Type: Bug
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Components: mime
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp;Affects Versions: 0.5
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Reporter: Yuan-Fang Li
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Assignee: Chris A. Mattmann
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Priority: Minor
&lt;br&gt;&amp;gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Fix For: 0.5
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt;
&lt;br&gt;&amp;gt; Mime type detector using AutoDetectParser and Metadata returns &amp;quot;application/xml&amp;quot; for the URL &lt;a href=&quot;http://www.w3.org/2002/07/owl#&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/2002/07/owl#&lt;/a&gt;, where it should be &amp;quot;application/rdf+xml&amp;quot;. The correct mime type is also suggested here: &lt;a href=&quot;http://www.w3.org/TR/owl-ref/#MIMEType&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/TR/owl-ref/#MIMEType&lt;/a&gt;.
&lt;br&gt;&amp;gt; P.S., Tika was downloaded from svn and built with Maven last week.
&lt;/div&gt;&lt;br&gt;-- 
&lt;br&gt;This message is automatically generated by JIRA.
&lt;br&gt;-
&lt;br&gt;You can reply to this email to add a comment to the issue online.
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/-jira--Created%3A-%28TIKA-309%29-Mime-type-application-rdf%2Bxml-not-correctly-detected-tp25867121p26473079.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26471857</id>
	<title>Hudson build is back to normal: Tika-trunk #231</title>
	<published>2009-11-22T18:16:57Z</published>
	<updated>2009-11-22T18:16:57Z</updated>
	<author>
		<name>Apache Hudson Server</name>
	</author>
	<content type="html">See &amp;lt;&lt;a href=&quot;http://hudson.zones.apache.org/hudson/job/Tika-trunk/231/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hudson.zones.apache.org/hudson/job/Tika-trunk/231/&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Build-failed-in-Hudson%3A-Tika-trunk--226-tp26467087p26471857.html" />
</entry>

<entry>
	<id>tag:old.nabble.com,2006:post-26471055</id>
	<title>Build failed in Hudson: Tika-trunk #230</title>
	<published>2009-11-22T16:17:10Z</published>
	<updated>2009-11-22T16:17:10Z</updated>
	<author>
		<name>Apache Hudson Server</name>
	</author>
	<content type="html">See &amp;lt;&lt;a href=&quot;http://hudson.zones.apache.org/hudson/job/Tika-trunk/230/&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://hudson.zones.apache.org/hudson/job/Tika-trunk/230/&lt;/a&gt;&amp;gt;
&lt;br&gt;&lt;br&gt;------------------------------------------
&lt;br&gt;Started by user jukka
&lt;br&gt;Building remotely on minerva.apache.org (Ubuntu)
&lt;br&gt;Updating &lt;a href=&quot;http://svn.apache.org/repos/asf/lucene/tika/trunk&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/repos/asf/lucene/tika/trunk&lt;/a&gt;&lt;br&gt;At revision 883196
&lt;br&gt;no change for &lt;a href=&quot;http://svn.apache.org/repos/asf/lucene/tika/trunk&quot; target=&quot;_top&quot; rel=&quot;nofollow&quot;&gt;http://svn.apache.org/repos/asf/lucene/tika/trunk&lt;/a&gt;&amp;nbsp;since the previous build
&lt;br&gt;Parsing POMs
&lt;br&gt;Exception in thread &amp;quot;main&amp;quot; java.lang.NoClassDefFoundError: hudson/maven/agent/Main
&lt;br&gt;ERROR: Failed to launch Maven. Exit code = 1
&lt;br&gt;&lt;br&gt;</content>
	<link rel="alternate" type="text/html" href="http://old.nabble.com/Build-failed-in-Hudson%3A-Tika-trunk--226-tp26467087p26471055.html" />
</entry>

</feed>
