Plugin extracting text from docs

View: New views
12 Messages — Rating Filter:   Alert me  

Plugin extracting text from docs

by Benny Pedersen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


just tested this plugin here, all i can say it rooks viagra out of docs rtf files :)

well done, only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works

--
xpoint


Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Benny Pedersen wrote:

> just tested this plugin here, all i can say it rooks viagra out of docs rtf files :)

I just saw it extract a 419 from a word doc so that it was catched by
bayes and a bunch of rules (it would actually have slipped past our
filter otherwise). :-)

 > well done

Thanks.

> only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works

Odd. I don't need ${file} for unrtf here.

I'm using unrtf 0.21.0. Are you using an older version?

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

by Benny Pedersen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Thu, July 2, 2009 15:50, Jonas Eckerman wrote:
> Benny Pedersen wrote:
>> just tested this plugin here, all i can say it rooks viagra out of docs rtf files :)
> I just saw it extract a 419 from a word doc so that it was catched by
> bayes and a bunch of rules (it would actually have slipped past our
> filter otherwise). :-)

well its a good generic plugin to extend scanning when there is content that try to avoid sa hits, i bet this is kids playing with
Msftedit to make viagra rtf files

pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ?

>> only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works
> Odd. I don't need ${file} for unrtf here.
> I'm using unrtf 0.21.0. Are you using an older version?

0.20.5 latest unstable on gentoo, unless i self bump it

one thing i need to know is how to control the tmp file path, i cant find where this is made

--
xpoint


RE: Plugin extracting text from docs

by Rosenbaum, Larry M. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> And, please tell me of problems.
 
> pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ?

It appears that "pdftohtml" is only available as a Windows executable (on Sourceforge).  I need something that will run on Solaris.

Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Benny Pedersen wrote:

> pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ?

I wouldn't know since I haven't got any Gentoo machines.

The "pdftohtml" I'm using is installed from FreeBSD ports.
It can be downloaded from
<http://pdftohtml.sourceforge.net/>

>>> only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works

>> I'm using unrtf 0.21.0. Are you using an older version?

> 0.20.5 latest unstable on gentoo, unless i self bump it

Ah. Then I guess reading from stdin is a new feature in 0.21.

> one thing i need to know is how to control the tmp file path, i cant find where this is made

I'm using Mail::SpamAssassin::Util::secure_tmpfile, so it's SA that
controls the path to the temp files. I don't know if you can set that in
SAs config or not.

Regards
/Jonas

--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

RE: Plugin extracting text from docs

by Martin Gregorie-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 2009-07-02 at 14:15 -0400, Rosenbaum, Larry M. wrote:
> > And, please tell me of problems.
>  
> > pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ?
>
> It appears that "pdftohtml" is only available as a Windows executable
> (on Sourceforge).  I need something that will run on Solaris.
>
Fedora 10 uses pdftotext, which can output raw text or simple html.

Wikipedia says its Open Source and related to the Poppler library which
is behind xpdf.

It seems to be available from Foo labs: http://www.foolabs.com
 

Martin



Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rosenbaum, Larry M. wrote:

> It appears that "pdftohtml" is only available as a Windows executable (on Sourceforge).

If you want a precompiled executable it seems Windows is the only
platform, but AFAICS the source code is also available at
http://sourceforge.net/projects/pdftohtml/files/

 > I need something that will run on Solaris.

I've no idea wether it compiles on Solaris or not, but since I installed
it from ports on FreeBSD I do know that it compiles on at least one Unix
like OS and doesn't require Windows.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

RE: Plugin extracting text from docs

by Rosenbaum, Larry M. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> From: Jonas Eckerman [mailto:jonas_lists@...]
>
> Rosenbaum, Larry M. wrote:
>
> > It appears that "pdftohtml" is only available as a Windows executable
> (on Sourceforge).
>
> If you want a precompiled executable it seems Windows is the only
> platform, but AFAICS the source code is also available at
> http://sourceforge.net/projects/pdftohtml/files/

I have found the Xpdf package, which pdftohtml is based on, has a pdftotext command line utility.  If you build it with the "--without-x" option, you get just the command line utilities without the X-windows stuff, which eliminates the need to install a bunch of font software.

Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rosenbaum, Larry M. wrote:

> I have found the Xpdf package [...] has a pdftotext command line utility.
 > If you build it with the "--without-x" option,

Ah. I didn't see that option. That's nice. I'm now using pdftotext
instead of pdftohtml here as well. :-)

And I've just uploaded a new version of the ExtractText plugin with a
few changes.

Also, it's now included on my SA page at
<http://whatever.frukt.org/spamassassin.text.shtml> as weel as the
CustomPlugin page in the SA wiki.

(For those having problems downloading from the above server, the zip
archive should be automagically mirrored to
<http://mmm.truls.org/m/ExtractText.zip>.)

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

by Matus UHLAR - fantomas :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 10.07.09 16:48, Jonas Eckerman wrote:
> Rosenbaum, Larry M. wrote:
>
>> I have found the Xpdf package [...] has a pdftotext command line utility.
> > If you build it with the "--without-x" option,
>
> Ah. I didn't see that option. That's nice. I'm now using pdftotext  
> instead of pdftohtml here as well. :-)

I've been thinking about it. The pdftohtml could provide interesting
infromations like colour informations that could lead to better spam
detection. Any experiences with this?

--
Matus UHLAR - fantomas, uhlar@... ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Eagles may soar, but weasels don't get sucked into jet engines.

Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Matus UHLAR - fantomas wrote:

>> Ah. I didn't see that option. That's nice. I'm now using pdftotext  
>> instead of pdftohtml here as well. :-)

> I've been thinking about it. The pdftohtml could provide interesting
> infromations like colour informations that could lead to better spam
> detection. Any experiences with this?

You're right. It should be usefult to extract to HTML when possible, and
then use Mail::SpamAssassin::HTML to get and then set properties just
like the rendered method of Mail::SpamAssassin::Message::Node does.

The nice way to do this would IMHO be to make it possible for a plugin
to call the "rendered" method of Mail::SpamAssassin::Message::Node
passing type and extracted data as parameters.

Something like this (completely untested, and watch for wraps):
---8<---
--- Node.pm Thu Jun 12 17:40:48 2008
+++ Node-new.pm Mon Jul 13 17:22:20 2009
@@ -411,16 +411,17 @@
  =cut

  sub rendered {
-  my ($self) = @_;
+  my ($self, $type, $text) = @_;

-  if (!exists $self->{rendered}) {
+  if ((defined($type) && defined($data)) || !exists $self->{rendered}) {
      # We only know how to render text/plain and text/html ...
      # Note: for bug 4843, make sure to skip text/calendar parts
      # we also want to skip things like text/x-vcard
      # text/x-aol is ignored here, but looks like text/html ...
+    $type = $self->{'type'} unless (defined($type));
      return(undef,undef) unless ( $self->{'type'} =~
/^text\/(?:plain|html)$/i );

-    my $text = $self->_normalize($self->decode(), $self->{charset});
+    $text = $self->_normalize($self->decode(), $self->{charset}) unless
(defined($text));
      my $raw = length($text);

      # render text/html always, or any other text|text/plain part as
text/html
---8<---

This way, AFAICT, any extracted (or generated) HTML should be treated
the same way a normal text/html is. Making it available to HTML eval
tests for example.

Otherwise my plugin could of course use Mail::SpamAssassin::HTML itself.
Unfortunately Mail::SpamAssassin::Message::Node has no nice methods for
setting the separate relevant properties though, so either the
set_rendered metod needs to be expanded or complemeted to allow this
anyway, or my plugin will have to directly set the relevant properties
(wich makes it depend on Mail::SpamAssassin::Message::Node not being
changed too much).

I guess I could do the hack version now, and then update it if/when
Mail::SpamAssassin::Message::Node is updated to support this in a nice
way. :-)

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Matus UHLAR - fantomas wrote:

> I've been thinking about it. The pdftohtml could provide interesting
> infromations like colour informations that could lead to better spam
> detection. Any experiences with this?

I've been thinking a bit more about this.

My current plan is to download the trunk version of SA from SVN to a
development system and put a decent way for plugins to ask SA to render
the "extracted" HTML into visible, invisible, meta, etc.

Once done and somewhat tested I'll see what the devs thinks about my patch.

It shouldn't be hard at all, it's a small change to
Mail::SpamAssassin::Message::Node, but I never seem to have as much time
as I need for even half of my work and projects... :-/

If the patch is accepted, my ExtractText plugin will use the opened up
functionality if it's there. If it's not any extracted HTML will be
added using set_rendered as it does now.

/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/