How to Parse Rss Feed URL

View: New views
3 Messages — Rating Filter:   Alert me  

How to Parse Rss Feed URL

by Saurabh Suman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi
I want to parse feedUrl using nutch.i tried to use org.apache.nutch.parse.feed.FeedParser class. Its input is xml. I put in xml the link below.
http://timesofindia.indiatimes.com/rssfeedsdefault.cms
This url contains all rss feeds for newspaper.When i tried to use it through Rome Feed Parser it was giving me all the permalink, title,date etc. But nutch parser doesnot give anything.
How can i get all the permalink,title,date in this url.

Re: How to Parse Rss Feed URL

by Doğacan Güney-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Jul 8, 2009 at 09:24, Saurabh Suman <saurabhsuman289@...>wrote:

>
> hi
> I want to parse feedUrl using nutch.i tried to use
> org.apache.nutch.parse.feed.FeedParser class. Its input is xml. I put in
> xml
> the link below.
> http://timesofindia.indiatimes.com/rssfeedsdefault.cms
> This url contains all rss feeds for newspaper.When i tried to use it
> through
> Rome Feed Parser it was giving me all the permalink, title,date etc. But
> nutch parser doesnot give anything.
> How can i get all the permalink,title,date in this url.
>


 In conf/parse-plugins.xml:

        <mimeType name="text/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        <plugin id="feed" />
        </mimeType>

The URL you mentioned has a text/xml content-type. And since you probably
also have
parse-html defined in your conf file, parse-html tries to parse the feeds.
Try moving "feed" plugin higher so :

        <mimeType name="text/xml">
               <plugin id="feed" />
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        </mimeType>



>
> --
> View this message in context:
> http://www.nabble.com/How-to-Parse-Rss-Feed-URL-tp24386051p24386051.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
Doğacan Güney

Re: How to Parse Rss Feed URL

by Saurabh Suman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

When I org.apache.nutch.parse.rss.RSSParser , its working fine.Now I am getting URLs.Now i want to get content. How will i do this? Do i need to send to all URLs to crawldb.Then run the crawl command,or there is another way.

hi
I want to parse feedUrl using nutch.i tried to use org.apache.nutch.parse.feed.FeedParser class. Its input is xml. I put in xml the link below.
http://timesofindia.indiatimes.com/rssfeedsdefault.cms
This url contains all rss feeds for newspaper.When i tried to use it through Rome Feed Parser it was giving me all the permalink, title,date etc. But nutch parser doesnot give anything.
How can i get all the permalink,title,date in this url.