Indexing rich documents from websites using ExtractingRequestHandler

View: New views
3 Messages — Rating Filter:   Alert me  

Indexing rich documents from websites using ExtractingRequestHandler

by ahammad :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I can index rich documents like pdf for instance that are on the filesystem. Can we use ExtractingRequestHandler to index files that are accessible on a website?

For example, there is a file that can be reached like so: http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf

How would I go about indexing that file? I tried using the following combinations. I will put the errors in brackets:

stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The filename, directory name, or volume label syntax is incorrect)
stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot find the path specified)
stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of the specified network name is invalid)
stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot find the path specified)
stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path was not found)

I sort of understand why I get those errors. What are the alternative methods of doing this? I am guessing that the stream.file attribute doesn't support web addresses. Is there another attribute that does?

Re: Indexing rich documents from websites using ExtractingRequestHandler

by Glen Newton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Try putting all the PDF URLs into a file, download with something like
'wget' then index locally.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/8 ahammad <ahmed.hammad@...>:

>
> Hello,
>
> I can index rich documents like pdf for instance that are on the filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



--

-

Re: Indexing rich documents from websites using ExtractingRequestHandler

by Jay Hill :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I haven't tried this myself, but it sounds like what you're looking for is
enabling remote streaming:
http://wiki.apache.org/solr/ContentStream#head-7179a128a2fdd5dde6b1af553ed41735402aadbf

As the link above shows you should be able to enable remote streaming like
this: <requestParsers enableRemoteStreaming="true"
multipartUploadLimitInKB="2048" />  and then something like this might work:
stream.url=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf<http://www.sub.mydomain.com/files/pdfdocs/testfile.pdf>

So you use stream.url instead of stream.file.

Hope this helps.

-Jay


On Wed, Jul 8, 2009 at 7:40 AM, ahammad <ahmed.hammad@...> wrote:

>
> Hello,
>
> I can index rich documents like pdf for instance that are on the
> filesystem.
> Can we use ExtractingRequestHandler to index files that are accessible on a
> website?
>
> For example, there is a file that can be reached like so:
> http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf
>
> How would I go about indexing that file? I tried using the following
> combinations. I will put the errors in brackets:
>
> stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
> filename, directory name, or volume label syntax is incorrect)
> stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
> cannot find the path specified)
> stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format
> of
> the specified network name is invalid)
> stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
> find the path specified)
> stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network
> path
> was not found)
>
> I sort of understand why I get those errors. What are the alternative
> methods of doing this? I am guessing that the stream.file attribute doesn't
> support web addresses. Is there another attribute that does?
> --
> View this message in context:
> http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>