« Return to Thread: Problems when deploy nutch-1.0.war

Re: Problems when index .chm files

by Ken Krugler :: Rate this Message:

Reply to Author | View in Thread

>Example1:
>
>Error parsing:
>http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm: 
>org.apache.nutch.parse.ParseException: parser not found for
>contentType=chemical/x-chemdraw
>url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> at
>org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
>Example2:
>
>Error parsing:
>http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm: 
>org.apache.nutch.parse.ParseException: parser not found for
>contentType=chemical/x-chemdraw
>url=http://localhost/mydocs/Programacion/Estructura%20de%20Datos/Uml/Project%20Management%20Methodologies%20(2004)%20(Wiley).chm
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> at
>org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
>and other sames errors...
>
>Any solution ??

A .chm file can be

- Chameleon source font
- Microsoft compiled HTML help file
- Chemdraw chemical structure

 From the URLs in your email, I'm going to guess these are compiled
HTML help files.

1. You should file a Jira issue to improve the Nutch mime-type
detection, as currently it's flagging the .chm files as Chemdraw
chemical structure documents.

2. I think you'd need to create your own Nutch plugin to parse chm
files. Supposedly there are open source tools available for reading
these files - but it's likely these are C#/VB/some other Microsoft
language.

See http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help for details.

-- Ken
--
Ken Krugler
<http://ken-blog.krugler.org>
+1 530-265-2225

 « Return to Thread: Problems when deploy nutch-1.0.war