|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
Indexing rich documents from websites using ExtractingRequestHandlerHello,
I can index rich documents like pdf for instance that are on the filesystem. Can we use ExtractingRequestHandler to index files that are accessible on a website? For example, there is a file that can be reached like so: http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf How would I go about indexing that file? I tried using the following combinations. I will put the errors in brackets: stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The filename, directory name, or volume label syntax is incorrect) stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot find the path specified) stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of the specified network name is invalid) stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot find the path specified) stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path was not found) I sort of understand why I get those errors. What are the alternative methods of doing this? I am guessing that the stream.file attribute doesn't support web addresses. Is there another attribute that does? |
|
|
Re: Indexing rich documents from websites using ExtractingRequestHandlerTry putting all the PDF URLs into a file, download with something like
'wget' then index locally. Glen Newton http://zzzoot.blogspot.com/ 2009/7/8 ahammad <ahmed.hammad@...>: > > Hello, > > I can index rich documents like pdf for instance that are on the filesystem. > Can we use ExtractingRequestHandler to index files that are accessible on a > website? > > For example, there is a file that can be reached like so: > http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf > > How would I go about indexing that file? I tried using the following > combinations. I will put the errors in brackets: > > stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The > filename, directory name, or volume label syntax is incorrect) > stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system > cannot find the path specified) > stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format of > the specified network name is invalid) > stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot > find the path specified) > stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network path > was not found) > > I sort of understand why I get those errors. What are the alternative > methods of doing this? I am guessing that the stream.file attribute doesn't > support web addresses. Is there another attribute that does? > -- > View this message in context: http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- - |
|
|
Re: Indexing rich documents from websites using ExtractingRequestHandlerI haven't tried this myself, but it sounds like what you're looking for is
enabling remote streaming: http://wiki.apache.org/solr/ContentStream#head-7179a128a2fdd5dde6b1af553ed41735402aadbf As the link above shows you should be able to enable remote streaming like this: <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" /> and then something like this might work: stream.url=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf<http://www.sub.mydomain.com/files/pdfdocs/testfile.pdf> So you use stream.url instead of stream.file. Hope this helps. -Jay On Wed, Jul 8, 2009 at 7:40 AM, ahammad <ahmed.hammad@...> wrote: > > Hello, > > I can index rich documents like pdf for instance that are on the > filesystem. > Can we use ExtractingRequestHandler to index files that are accessible on a > website? > > For example, there is a file that can be reached like so: > http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf > > How would I go about indexing that file? I tried using the following > combinations. I will put the errors in brackets: > > stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The > filename, directory name, or volume label syntax is incorrect) > stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system > cannot find the path specified) > stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format > of > the specified network name is invalid) > stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot > find the path specified) > stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network > path > was not found) > > I sort of understand why I get those errors. What are the alternative > methods of doing this? I am guessing that the stream.file attribute doesn't > support web addresses. Is there another attribute that does? > -- > View this message in context: > http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html > Sent from the Solr - User mailing list archive at Nabble.com. > > |
| Free embeddable forum powered by Nabble | Forum Help |