Matus UHLAR - fantomas wrote:
>> Ah. I didn't see that option. That's nice. I'm now using pdftotext
>> instead of pdftohtml here as well. :-)
> I've been thinking about it. The pdftohtml could provide interesting
> infromations like colour informations that could lead to better spam
> detection. Any experiences with this?
You're right. It should be usefult to extract to HTML when possible, and
then use Mail::SpamAssassin::HTML to get and then set properties just
like the rendered method of Mail::SpamAssassin::Message::Node does.
The nice way to do this would IMHO be to make it possible for a plugin
to call the "rendered" method of Mail::SpamAssassin::Message::Node
passing type and extracted data as parameters.
Something like this (completely untested, and watch for wraps):
---8<---
--- Node.pm Thu Jun 12 17:40:48 2008
+++ Node-new.pm Mon Jul 13 17:22:20 2009
@@ -411,16 +411,17 @@
=cut
sub rendered {
- my ($self) = @_;
+ my ($self, $type, $text) = @_;
- if (!exists $self->{rendered}) {
+ if ((defined($type) && defined($data)) || !exists $self->{rendered}) {
# We only know how to render text/plain and text/html ...
# Note: for bug 4843, make sure to skip text/calendar parts
# we also want to skip things like text/x-vcard
# text/x-aol is ignored here, but looks like text/html ...
+ $type = $self->{'type'} unless (defined($type));
return(undef,undef) unless ( $self->{'type'} =~
/^text\/(?:plain|html)$/i );
- my $text = $self->_normalize($self->decode(), $self->{charset});
+ $text = $self->_normalize($self->decode(), $self->{charset}) unless
(defined($text));
my $raw = length($text);
# render text/html always, or any other text|text/plain part as
text/html
---8<---
This way, AFAICT, any extracted (or generated) HTML should be treated
the same way a normal text/html is. Making it available to HTML eval
tests for example.
Otherwise my plugin could of course use Mail::SpamAssassin::HTML itself.
Unfortunately Mail::SpamAssassin::Message::Node has no nice methods for
setting the separate relevant properties though, so either the
set_rendered metod needs to be expanded or complemeted to allow this
anyway, or my plugin will have to directly set the relevant properties
(wich makes it depend on Mail::SpamAssassin::Message::Node not being
changed too much).
I guess I could do the hack version now, and then update it if/when
Mail::SpamAssassin::Message::Node is updated to support this in a nice
way. :-)
Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/http://www.frukt.org/http://whatever.frukt.org/