|
View:
New views
12 Messages
—
Rating Filter:
Alert me
|
|
|
Plugin extracting text from docsjust tested this plugin here, all i can say it rooks viagra out of docs rtf files :) well done, only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works -- xpoint |
|
|
Re: Plugin extracting text from docsBenny Pedersen wrote:
> just tested this plugin here, all i can say it rooks viagra out of docs rtf files :) I just saw it extract a 419 from a word doc so that it was catched by bayes and a bunch of rules (it would actually have slipped past our filter otherwise). :-) > well done Thanks. > only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works Odd. I don't need ${file} for unrtf here. I'm using unrtf 0.21.0. Are you using an older version? Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docsOn Thu, July 2, 2009 15:50, Jonas Eckerman wrote: > Benny Pedersen wrote: >> just tested this plugin here, all i can say it rooks viagra out of docs rtf files :) > I just saw it extract a 419 from a word doc so that it was catched by > bayes and a bunch of rules (it would actually have slipped past our > filter otherwise). :-) well its a good generic plugin to extend scanning when there is content that try to avoid sa hits, i bet this is kids playing with Msftedit to make viagra rtf files pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? >> only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works > Odd. I don't need ${file} for unrtf here. > I'm using unrtf 0.21.0. Are you using an older version? 0.20.5 latest unstable on gentoo, unless i self bump it one thing i need to know is how to control the tmp file path, i cant find where this is made -- xpoint |
|
|
RE: Plugin extracting text from docs> And, please tell me of problems.
> pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? It appears that "pdftohtml" is only available as a Windows executable (on Sourceforge). I need something that will run on Solaris. |
|
|
Re: Plugin extracting text from docsBenny Pedersen wrote:
> pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? I wouldn't know since I haven't got any Gentoo machines. The "pdftohtml" I'm using is installed from FreeBSD ports. It can be downloaded from <http://pdftohtml.sourceforge.net/> >>> only problem i had was that unrtf nedd to have ${file} in the example cf to work all else works >> I'm using unrtf 0.21.0. Are you using an older version? > 0.20.5 latest unstable on gentoo, unless i self bump it Ah. Then I guess reading from stdin is a new feature in 0.21. > one thing i need to know is how to control the tmp file path, i cant find where this is made I'm using Mail::SpamAssassin::Util::secure_tmpfile, so it's SA that controls the path to the temp files. I don't know if you can set that in SAs config or not. Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
RE: Plugin extracting text from docsOn Thu, 2009-07-02 at 14:15 -0400, Rosenbaum, Larry M. wrote:
> > And, please tell me of problems. > > > pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? > > It appears that "pdftohtml" is only available as a Windows executable > (on Sourceforge). I need something that will run on Solaris. > Fedora 10 uses pdftotext, which can output raw text or simple html. Wikipedia says its Open Source and related to the Poppler library which is behind xpdf. It seems to be available from Foo labs: http://www.foolabs.com Martin |
|
|
Re: Plugin extracting text from docsRosenbaum, Larry M. wrote:
> It appears that "pdftohtml" is only available as a Windows executable (on Sourceforge). If you want a precompiled executable it seems Windows is the only platform, but AFAICS the source code is also available at http://sourceforge.net/projects/pdftohtml/files/ > I need something that will run on Solaris. I've no idea wether it compiles on Solaris or not, but since I installed it from ports on FreeBSD I do know that it compiles on at least one Unix like OS and doesn't require Windows. Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
RE: Plugin extracting text from docs> From: Jonas Eckerman [mailto:jonas_lists@...]
> > Rosenbaum, Larry M. wrote: > > > It appears that "pdftohtml" is only available as a Windows executable > (on Sourceforge). > > If you want a precompiled executable it seems Windows is the only > platform, but AFAICS the source code is also available at > http://sourceforge.net/projects/pdftohtml/files/ I have found the Xpdf package, which pdftohtml is based on, has a pdftotext command line utility. If you build it with the "--without-x" option, you get just the command line utilities without the X-windows stuff, which eliminates the need to install a bunch of font software. |
|
|
Re: Plugin extracting text from docsRosenbaum, Larry M. wrote:
> I have found the Xpdf package [...] has a pdftotext command line utility. > If you build it with the "--without-x" option, Ah. I didn't see that option. That's nice. I'm now using pdftotext instead of pdftohtml here as well. :-) And I've just uploaded a new version of the ExtractText plugin with a few changes. Also, it's now included on my SA page at <http://whatever.frukt.org/spamassassin.text.shtml> as weel as the CustomPlugin page in the SA wiki. (For those having problems downloading from the above server, the zip archive should be automagically mirrored to <http://mmm.truls.org/m/ExtractText.zip>.) Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docsOn 10.07.09 16:48, Jonas Eckerman wrote:
> Rosenbaum, Larry M. wrote: > >> I have found the Xpdf package [...] has a pdftotext command line utility. > > If you build it with the "--without-x" option, > > Ah. I didn't see that option. That's nice. I'm now using pdftotext > instead of pdftohtml here as well. :-) I've been thinking about it. The pdftohtml could provide interesting infromations like colour informations that could lead to better spam detection. Any experiences with this? -- Matus UHLAR - fantomas, uhlar@... ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Eagles may soar, but weasels don't get sucked into jet engines. |
|
|
Re: Plugin extracting text from docsMatus UHLAR - fantomas wrote:
>> Ah. I didn't see that option. That's nice. I'm now using pdftotext >> instead of pdftohtml here as well. :-) > I've been thinking about it. The pdftohtml could provide interesting > infromations like colour informations that could lead to better spam > detection. Any experiences with this? You're right. It should be usefult to extract to HTML when possible, and then use Mail::SpamAssassin::HTML to get and then set properties just like the rendered method of Mail::SpamAssassin::Message::Node does. The nice way to do this would IMHO be to make it possible for a plugin to call the "rendered" method of Mail::SpamAssassin::Message::Node passing type and extracted data as parameters. Something like this (completely untested, and watch for wraps): ---8<--- --- Node.pm Thu Jun 12 17:40:48 2008 +++ Node-new.pm Mon Jul 13 17:22:20 2009 @@ -411,16 +411,17 @@ =cut sub rendered { - my ($self) = @_; + my ($self, $type, $text) = @_; - if (!exists $self->{rendered}) { + if ((defined($type) && defined($data)) || !exists $self->{rendered}) { # We only know how to render text/plain and text/html ... # Note: for bug 4843, make sure to skip text/calendar parts # we also want to skip things like text/x-vcard # text/x-aol is ignored here, but looks like text/html ... + $type = $self->{'type'} unless (defined($type)); return(undef,undef) unless ( $self->{'type'} =~ /^text\/(?:plain|html)$/i ); - my $text = $self->_normalize($self->decode(), $self->{charset}); + $text = $self->_normalize($self->decode(), $self->{charset}) unless (defined($text)); my $raw = length($text); # render text/html always, or any other text|text/plain part as text/html ---8<--- This way, AFAICT, any extracted (or generated) HTML should be treated the same way a normal text/html is. Making it available to HTML eval tests for example. Otherwise my plugin could of course use Mail::SpamAssassin::HTML itself. Unfortunately Mail::SpamAssassin::Message::Node has no nice methods for setting the separate relevant properties though, so either the set_rendered metod needs to be expanded or complemeted to allow this anyway, or my plugin will have to directly set the relevant properties (wich makes it depend on Mail::SpamAssassin::Message::Node not being changed too much). I guess I could do the hack version now, and then update it if/when Mail::SpamAssassin::Message::Node is updated to support this in a nice way. :-) Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docsMatus UHLAR - fantomas wrote:
> I've been thinking about it. The pdftohtml could provide interesting > infromations like colour informations that could lead to better spam > detection. Any experiences with this? I've been thinking a bit more about this. My current plan is to download the trunk version of SA from SVN to a development system and put a decent way for plugins to ask SA to render the "extracted" HTML into visible, invisible, meta, etc. Once done and somewhat tested I'll see what the devs thinks about my patch. It shouldn't be hard at all, it's a small change to Mail::SpamAssassin::Message::Node, but I never seem to have as much time as I need for even half of my work and projects... :-/ If the patch is accepted, my ExtractText plugin will use the opened up functionality if it's there. If it's not any extracted HTML will be added using set_rendered as it does now. /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
| Free embeddable forum powered by Nabble | Forum Help |