using TextExtractor

View: New views
8 Messages — Rating Filter:   Alert me  

using TextExtractor

by Bart Van den Abeele-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I would like to use the TextExtractor-classes to extract text from several files, concat this text and then put it into a part of a document.  I have a document that links to several other documents and those documents all have a part-type that contains a file.  I want to find the document that links to these documents by searching on the text that is in the files.  That is why i think i have to concat the text from te files and put this into a part on the first document.  Can you explain how i can reuse the TextExtractor?  It seems that i have to give a configuration with the interface.  Where can i find this?  Is there a better way to do this?

--
Met vriendelijke groeten,
Bart Van den Abeele
Email: bvda@...
Helpdesk : +32 (09) 389 0560
Persoonlijk : +32 (09) 389 0564

**** DISCLAIMER ****
http://www.schaubroeck.be/maildisclaimer.htm


_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

Re: using TextExtractor

by Karel Vervaeke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I don't know what configuration you are referring to.

The TextExtractor interface has only two methods:
    List<String> getMimeTypes();
    String getText(InputStream is) throws Exception
which seem pretty clear to me.

The built-in implementations use AbstractTextExtractor which takes
care of registering the text extractor.

Have a look at the code and spring configuration files under
http://svn.daisycms.org/viewsvn/daisy/trunk/daisy/services/textextraction/impl/src/

HTH,
Karel

On Mon, Aug 24, 2009 at 10:22 AM, Bart Van den
Abeele<bvda@...> wrote:

> Hi,
>
> I would like to use the TextExtractor-classes to extract text from several
> files, concat this text and then put it into a part of a document.  I have a
> document that links to several other documents and those documents all have
> a part-type that contains a file.  I want to find the document that links to
> these documents by searching on the text that is in the files.  That is why
> i think i have to concat the text from te files and put this into a part on
> the first document.  Can you explain how i can reuse the TextExtractor?  It
> seems that i have to give a configuration with the interface.  Where can i
> find this?  Is there a better way to do this?
>
> --
> Met vriendelijke groeten,
> Bart Van den Abeele
> Email: bvda@...
> Helpdesk : +32 (09) 389 0560
> Persoonlijk : +32 (09) 389 0564
>
> **** DISCLAIMER ****
> http://www.schaubroeck.be/maildisclaimer.htm
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

setting layoutType=plain

by chriscoder :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

In my skin document_to_html.xsl I would like to set 'daisyid'.html?layoutType=plain
for specific links.  These links are to document that will appear in a popup box
where I want no left-nav and header/footer.

I see in the docs that  "link transformation happens after the document styling"

Is there a way to add a parameter to a link like ?layoutType=plain

thanks,

-chris
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

Re: using TextExtractor

by Bart Van den Abeele-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I got it to work.  thx! 

I see that there is no support for open office spreadsheet .ods (vnd.sun.xml.calc) or open office persentation .odp.  Will this be added?

At the moment it works on the client-side, but it is way to heavy.  I would like it to run in background on the repository-server.  Any idea's on how i could do this? The server already extracts the text, but i got a different case.  I would like to agregate the text of several documents and put this on an other document that refers to those documents.  Example : I got a Finance-document f1 which refers to an Attachement-document a1 (.odt) and a2 (.ppt).  I want the text of a1 and a2 concatted and then set on document f1.  The goal is to find document f1 when i search on text that is in a1 or a2.

Met vriendelijke groeten,
Bart Van den Abeele
Email: bvda@...
Helpdesk : +32 (09) 389 0560
Persoonlijk : +32 (09) 389 0564


Karel Vervaeke wrote:
I don't know what configuration you are referring to.

The TextExtractor interface has only two methods:
    List<String> getMimeTypes();
    String getText(InputStream is) throws Exception
which seem pretty clear to me.

The built-in implementations use AbstractTextExtractor which takes
care of registering the text extractor.

Have a look at the code and spring configuration files under
http://svn.daisycms.org/viewsvn/daisy/trunk/daisy/services/textextraction/impl/src/

HTH,
Karel

On Mon, Aug 24, 2009 at 10:22 AM, Bart Van den
Abeelebvda@... wrote:
  
Hi,

I would like to use the TextExtractor-classes to extract text from several
files, concat this text and then put it into a part of a document.  I have a
document that links to several other documents and those documents all have
a part-type that contains a file.  I want to find the document that links to
these documents by searching on the text that is in the files.  That is why
i think i have to concat the text from te files and put this into a part on
the first document.  Can you explain how i can reuse the TextExtractor?  It
seems that i have to give a configuration with the interface.  Where can i
find this?  Is there a better way to do this?

--
Met vriendelijke groeten,
Bart Van den Abeele
Email: bvda@...
Helpdesk : +32 (09) 389 0560
Persoonlijk : +32 (09) 389 0564

**** DISCLAIMER ****
http://www.schaubroeck.be/maildisclaimer.htm

_______________________________________________
daisy community mailing list
Professional Daisy support:
http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy


    
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

  

**** DISCLAIMER ****
http://www.schaubroeck.be/maildisclaimer.htm


_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

Re: using TextExtractor

by A Rocha Webmaster :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ter, 2009-08-25 às 16:02 +0200, Bart Van den Abeele escreveu:
I got a Finance-document f1 which refers to an Attachement-document a1 (.odt) and a2 (.ppt).  I want the text of a1 and a2 concatted and then set on document f1.  The goal is to find document f1 when i search on text that is in a1 or a2.
First off (from an information science point of view) you need to determine whether a1 and a2 will ONLY be part of f1, and never part of any other documents. If this is not the case, then of course you have no way of doing that.

Then from a Daisy perspective, you could create a special SubAttachment document type with a special field called ParentDocument which is a link to the parent document. In the case of a1 and a2, ParentDocument would be the ID of f1. Then in /<yourskin>/document-styling/html/SubAttachment.xsl instead of displaying the current document, you retrieve the doc with ID ParentDocument and display it.

I don't know a very smart and Daisy way of doing that, but one way would be to use a client-side HTTP redirect.

HTH
Júlio.
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

Re: using TextExtractor

by Karel Vervaeke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Aug 25, 2009 at 4:02 PM, Bart Van den Abeele<bvda@...> wrote:
> Hi,
>
> I got it to work.  thx!
>
> I see that there is no support for open office spreadsheet .ods
> (vnd.sun.xml.calc) or open office persentation .odp.  Will this be added?

Nothing of the sort is scheduled.  Feel free to create a jira issue
(and if possible, provide a patch).
AFAICT, the built-in OpenOfficeTextExtractor could work for ods files
without changes, so it
would merely be a question of registering the additional mimetypes.
(It needs to be tested to make sure though)

I don't have any suggestions for the link-fulltext search problem unfortunately.

Regards,
Karel

>
> At the moment it works on the client-side, but it is way to heavy.  I would
> like it to run in background on the repository-server.  Any idea's on how i
> could do this? The server already extracts the text, but i got a different
> case.  I would like to agregate the text of several documents and put this
> on an other document that refers to those documents.  Example : I got a
> Finance-document f1 which refers to an Attachement-document a1 (.odt) and a2
> (.ppt).  I want the text of a1 and a2 concatted and then set on document
> f1.  The goal is to find document f1 when i search on text that is in a1 or
> a2.
>
> Met vriendelijke groeten,
> Bart Van den Abeele
> Email: bvda@...
> Helpdesk : +32 (09) 389 0560
> Persoonlijk : +32 (09) 389 0564
>
> Karel Vervaeke wrote:
>
> I don't know what configuration you are referring to.
>
> The TextExtractor interface has only two methods:
>     List<String> getMimeTypes();
>     String getText(InputStream is) throws Exception
> which seem pretty clear to me.
>
> The built-in implementations use AbstractTextExtractor which takes
> care of registering the text extractor.
>
> Have a look at the code and spring configuration files under
> http://svn.daisycms.org/viewsvn/daisy/trunk/daisy/services/textextraction/impl/src/
>
> HTH,
> Karel
>
> On Mon, Aug 24, 2009 at 10:22 AM, Bart Van den
> Abeele<bvda@...> wrote:
>
>
> Hi,
>
> I would like to use the TextExtractor-classes to extract text from several
> files, concat this text and then put it into a part of a document.  I have a
> document that links to several other documents and those documents all have
> a part-type that contains a file.  I want to find the document that links to
> these documents by searching on the text that is in the files.  That is why
> i think i have to concat the text from te files and put this into a part on
> the first document.  Can you explain how i can reuse the TextExtractor?  It
> seems that i have to give a configuration with the interface.  Where can i
> find this?  Is there a better way to do this?
>
> --
> Met vriendelijke groeten,
> Bart Van den Abeele
> Email: bvda@...
> Helpdesk : +32 (09) 389 0560
> Persoonlijk : +32 (09) 389 0564
>
> **** DISCLAIMER ****
> http://www.schaubroeck.be/maildisclaimer.htm
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
>
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
>
> **** DISCLAIMER ****
> http://www.schaubroeck.be/maildisclaimer.htm
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

Re: using TextExtractor

by Bart Van den Abeele-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Perhaps you could use http://lucene.apache.org/tika/ to extract text instead of your own implementation.  Did you evaluate this package?

Met vriendelijke groeten,
Bart Van den Abeele
Email: bvda@...
Helpdesk : +32 (09) 389 0560
Persoonlijk : +32 (09) 389 0564


Karel Vervaeke wrote:
On Tue, Aug 25, 2009 at 4:02 PM, Bart Van den Abeelebvda@... wrote:
  
Hi,

I got it to work.  thx!

I see that there is no support for open office spreadsheet .ods
(vnd.sun.xml.calc) or open office persentation .odp.  Will this be added?
    

Nothing of the sort is scheduled.  Feel free to create a jira issue
(and if possible, provide a patch).
AFAICT, the built-in OpenOfficeTextExtractor could work for ods files
without changes, so it
would merely be a question of registering the additional mimetypes.
(It needs to be tested to make sure though)

I don't have any suggestions for the link-fulltext search problem unfortunately.

Regards,
Karel

  
At the moment it works on the client-side, but it is way to heavy.  I would
like it to run in background on the repository-server.  Any idea's on how i
could do this? The server already extracts the text, but i got a different
case.  I would like to agregate the text of several documents and put this
on an other document that refers to those documents.  Example : I got a
Finance-document f1 which refers to an Attachement-document a1 (.odt) and a2
(.ppt).  I want the text of a1 and a2 concatted and then set on document
f1.  The goal is to find document f1 when i search on text that is in a1 or
a2.

Met vriendelijke groeten,
Bart Van den Abeele
Email: bvda@...
Helpdesk : +32 (09) 389 0560
Persoonlijk : +32 (09) 389 0564

Karel Vervaeke wrote:

I don't know what configuration you are referring to.

The TextExtractor interface has only two methods:
    List<String> getMimeTypes();
    String getText(InputStream is) throws Exception
which seem pretty clear to me.

The built-in implementations use AbstractTextExtractor which takes
care of registering the text extractor.

Have a look at the code and spring configuration files under
http://svn.daisycms.org/viewsvn/daisy/trunk/daisy/services/textextraction/impl/src/

HTH,
Karel

On Mon, Aug 24, 2009 at 10:22 AM, Bart Van den
Abeelebvda@... wrote:


Hi,

I would like to use the TextExtractor-classes to extract text from several
files, concat this text and then put it into a part of a document.  I have a
document that links to several other documents and those documents all have
a part-type that contains a file.  I want to find the document that links to
these documents by searching on the text that is in the files.  That is why
i think i have to concat the text from te files and put this into a part on
the first document.  Can you explain how i can reuse the TextExtractor?  It
seems that i have to give a configuration with the interface.  Where can i
find this?  Is there a better way to do this?

--
Met vriendelijke groeten,
Bart Van den Abeele
Email: bvda@...
Helpdesk : +32 (09) 389 0560
Persoonlijk : +32 (09) 389 0564

**** DISCLAIMER ****
http://www.schaubroeck.be/maildisclaimer.htm

_______________________________________________
daisy community mailing list
Professional Daisy support:
http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy




_______________________________________________
daisy community mailing list
Professional Daisy support:
http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy



**** DISCLAIMER ****
http://www.schaubroeck.be/maildisclaimer.htm

_______________________________________________
daisy community mailing list
Professional Daisy support:
http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy


    
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

  

**** DISCLAIMER ****
http://www.schaubroeck.be/maildisclaimer.htm


_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

Re: using TextExtractor

by Karel Vervaeke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I haven't heard of the project before.  Daisy predates Tika, so we
couldn't have considered it at the time the text extractors were
written.


On Wed, Aug 26, 2009 at 11:04 AM, Bart Van den
Abeele<bvda@...> wrote:

> Perhaps you could use http://lucene.apache.org/tika/ to extract text instead
> of your own implementation.  Did you evaluate this package?
>
> Met vriendelijke groeten,
> Bart Van den Abeele
> Email: bvda@...
> Helpdesk : +32 (09) 389 0560
> Persoonlijk : +32 (09) 389 0564
>
> Karel Vervaeke wrote:
>
> On Tue, Aug 25, 2009 at 4:02 PM, Bart Van den Abeele<bvda@...>
> wrote:
>
>
> Hi,
>
> I got it to work.  thx!
>
> I see that there is no support for open office spreadsheet .ods
> (vnd.sun.xml.calc) or open office persentation .odp.  Will this be added?
>
>
> Nothing of the sort is scheduled.  Feel free to create a jira issue
> (and if possible, provide a patch).
> AFAICT, the built-in OpenOfficeTextExtractor could work for ods files
> without changes, so it
> would merely be a question of registering the additional mimetypes.
> (It needs to be tested to make sure though)
>
> I don't have any suggestions for the link-fulltext search problem
> unfortunately.
>
> Regards,
> Karel
>
>
>
> At the moment it works on the client-side, but it is way to heavy.  I would
> like it to run in background on the repository-server.  Any idea's on how i
> could do this? The server already extracts the text, but i got a different
> case.  I would like to agregate the text of several documents and put this
> on an other document that refers to those documents.  Example : I got a
> Finance-document f1 which refers to an Attachement-document a1 (.odt) and a2
> (.ppt).  I want the text of a1 and a2 concatted and then set on document
> f1.  The goal is to find document f1 when i search on text that is in a1 or
> a2.
>
> Met vriendelijke groeten,
> Bart Van den Abeele
> Email: bvda@...
> Helpdesk : +32 (09) 389 0560
> Persoonlijk : +32 (09) 389 0564
>
> Karel Vervaeke wrote:
>
> I don't know what configuration you are referring to.
>
> The TextExtractor interface has only two methods:
>     List<String> getMimeTypes();
>     String getText(InputStream is) throws Exception
> which seem pretty clear to me.
>
> The built-in implementations use AbstractTextExtractor which takes
> care of registering the text extractor.
>
> Have a look at the code and spring configuration files under
> http://svn.daisycms.org/viewsvn/daisy/trunk/daisy/services/textextraction/impl/src/
>
> HTH,
> Karel
>
> On Mon, Aug 24, 2009 at 10:22 AM, Bart Van den
> Abeele<bvda@...> wrote:
>
>
> Hi,
>
> I would like to use the TextExtractor-classes to extract text from several
> files, concat this text and then put it into a part of a document.  I have a
> document that links to several other documents and those documents all have
> a part-type that contains a file.  I want to find the document that links to
> these documents by searching on the text that is in the files.  That is why
> i think i have to concat the text from te files and put this into a part on
> the first document.  Can you explain how i can reuse the TextExtractor?  It
> seems that i have to give a configuration with the interface.  Where can i
> find this?  Is there a better way to do this?
>
> --
> Met vriendelijke groeten,
> Bart Van den Abeele
> Email: bvda@...
> Helpdesk : +32 (09) 389 0560
> Persoonlijk : +32 (09) 389 0564
>
> **** DISCLAIMER ****
> http://www.schaubroeck.be/maildisclaimer.htm
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
>
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
>
> **** DISCLAIMER ****
> http://www.schaubroeck.be/maildisclaimer.htm
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
>
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
>
> **** DISCLAIMER ****
> http://www.schaubroeck.be/maildisclaimer.htm
>
> _______________________________________________
> daisy community mailing list
> Professional Daisy support:
> http://outerthought.org/en/services/daisy/support.html
> mail to: daisy@...
> list information: http://lists.cocoondev.org/mailman/listinfo/daisy
>
>
_______________________________________________
daisy community mailing list
Professional Daisy support: http://outerthought.org/en/services/daisy/support.html
mail to: daisy@...
list information: http://lists.cocoondev.org/mailman/listinfo/daisy