« Return to Thread: Plugin extracting text from docs

Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View in Thread

Matus UHLAR - fantomas wrote:

>> Ah. I didn't see that option. That's nice. I'm now using pdftotext  
>> instead of pdftohtml here as well. :-)

> I've been thinking about it. The pdftohtml could provide interesting
> infromations like colour informations that could lead to better spam
> detection. Any experiences with this?

You're right. It should be usefult to extract to HTML when possible, and
then use Mail::SpamAssassin::HTML to get and then set properties just
like the rendered method of Mail::SpamAssassin::Message::Node does.

The nice way to do this would IMHO be to make it possible for a plugin
to call the "rendered" method of Mail::SpamAssassin::Message::Node
passing type and extracted data as parameters.

Something like this (completely untested, and watch for wraps):
---8<---
--- Node.pm Thu Jun 12 17:40:48 2008
+++ Node-new.pm Mon Jul 13 17:22:20 2009
@@ -411,16 +411,17 @@
  =cut

  sub rendered {
-  my ($self) = @_;
+  my ($self, $type, $text) = @_;

-  if (!exists $self->{rendered}) {
+  if ((defined($type) && defined($data)) || !exists $self->{rendered}) {
      # We only know how to render text/plain and text/html ...
      # Note: for bug 4843, make sure to skip text/calendar parts
      # we also want to skip things like text/x-vcard
      # text/x-aol is ignored here, but looks like text/html ...
+    $type = $self->{'type'} unless (defined($type));
      return(undef,undef) unless ( $self->{'type'} =~
/^text\/(?:plain|html)$/i );

-    my $text = $self->_normalize($self->decode(), $self->{charset});
+    $text = $self->_normalize($self->decode(), $self->{charset}) unless
(defined($text));
      my $raw = length($text);

      # render text/html always, or any other text|text/plain part as
text/html
---8<---

This way, AFAICT, any extracted (or generated) HTML should be treated
the same way a normal text/html is. Making it available to HTML eval
tests for example.

Otherwise my plugin could of course use Mail::SpamAssassin::HTML itself.
Unfortunately Mail::SpamAssassin::Message::Node has no nice methods for
setting the separate relevant properties though, so either the
set_rendered metod needs to be expanded or complemeted to allow this
anyway, or my plugin will have to directly set the relevant properties
(wich makes it depend on Mail::SpamAssassin::Message::Node not being
changed too much).

I guess I could do the hack version now, and then update it if/when
Mail::SpamAssassin::Message::Node is updated to support this in a nice
way. :-)

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

 « Return to Thread: Plugin extracting text from docs