A .chm file can be
- Chameleon source font
- Microsoft compiled HTML help file
- Chemdraw chemical structure
From the URLs in your email, I'm going to guess these are compiled
HTML help files.
1. You should file a Jira issue to improve the Nutch mime-type
detection, as currently it's flagging the .chm files as Chemdraw
chemical structure documents.
2. I think you'd need to create your own Nutch plugin to parse chm
files. Supposedly there are open source tools available for reading
these files - but it's likely these are C#/VB/some other Microsoft
language.
See
http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help for details.
-- Ken
--
Ken Krugler
<
http://ken-blog.krugler.org>
+1 530-265-2225