new spam using large images

View: New views
8 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Re: Plugin extracting text from docs

by Theo Van Dinter-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jun 25, 2009 at 3:41 PM, Jonas Eckerman<jonas_lists@...> wrote:
> Matus example was a Word document that contained as PDF wich (might in turn
> contain an image). A plugin that knows how to read word document could
> extract th text of the word document and then use "set_rendered" to make
> that avaiölable to SA. It cannot currently extract the PDF and make it
> available to any plugins that knows how tpo read PDFs though.

My view would be that if someone is going to try making things so
convoluted such as that, a) we've won because no one is going to go
through the trouble of opening that doc, b) the convolution is a
fingerprint that you could write a rule for and then you don't care
what the content actually is.  For example, you'd render something
like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
and they'd all be different tokes.

But yes, you're right, the Message/Message::Node stuff wasn't designed
with the idea of supporting multiple independent data objects from a
single mime part.  I can see the argument for "treat embeded files
similar to multipart", but I still lean towards mime structure only.

> For some stuff coordination would be needed, yes. But not for what I'm
> thinking of.

Why not?  If you have no coordination, you would possibly look for
images first, then pdfs, then word docs, and end up not getting
anywhere.  If it's all your plugin, you can configure the order.  If
it's not, you need coordination.  For example, as from above, if
there's zip file with a doc which has a pdf which has a jpg, and your
plugin doesn't handle zip but another one does ...

> The most common thing to extract apart from text will most likely be images.
> Any OCR text extractor tied into my plugin would get to see those images,
> but any OCR SA plugins run after my plugin won't. It might be good to make
> extracted images available to those, and other image handling plugins.

But yours already ran, so who cares about the others?

Seriously.

If you're expending the resources to OCR the same image in an email
multiple times ...  You clearly either have a lot of hardware or not a
lot of mail.

Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Theo Van Dinter wrote:

> the convolution is a
> fingerprint that you could write a rule for and then you don't care
> what the content actually is.  For example, you'd render something
> like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
> same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
> and they'd all be different tokes.

That's really a good idea. Put the chains of extraction in a
pseudoheader that can be tested in rules and seen as a token by bayes.

I'm putting that in the todo for the plugin.

>> The most common thing to extract apart from text will most likely be images.
>> Any OCR text extractor tied into my plugin would get to see those images,
>> but any OCR SA plugins run after my plugin won't. It might be good to make
>> extracted images available to those, and other image handling plugins.

> But yours already ran, so who cares about the others?

Because they work very differently?

A OCR plugin that adds the rendered text to the message for bayes and
text rules is very different from one that does it's own scoring based
on the OCRed text.

> If you're expending the resources to OCR the same image in an email
> multiple times ...  You clearly either have a lot of hardware or not a
> lot of mail.

*I* don't use any OCR at all. We don't have the resources for that
(beeing a small non-profit NGO), and so far I haven't seen any need for
OCR either since we never had much image spam slip through anyway.

So I will not implement a OCR extractor for my plugin. I'll leave that
for others. This is actually one of the reasons I'd like to let existing
OCR plugins have access to any images extracted by my plugin. So that
those who allready do use OCR can get a benefit from the extraction.

I'm not going to spend much time on it though. I'm happy just extracting
text. :-) And it does extract text (currently from Word, OpenXML,
OpenDocument and RTF documents). :-)

I actually hadn't even thought about this image/OCR etc stuff before
Matus suggested it.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

by David B Funk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 26 Jun 2009, Jonas Eckerman wrote:

> Theo Van Dinter wrote:
>
> > the convolution is a
> > fingerprint that you could write a rule for and then you don't care
> > what the content actually is.  For example, you'd render something
> > like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
> > same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
> > and they'd all be different tokes.
>
> That's really a good idea. Put the chains of extraction in a
> pseudoheader that can be tested in rules and seen as a token by bayes.
>
> I'm putting that in the todo for the plugin.

It would be a bit cumbersome but you could:
create a "pre-filter" program/milter which would parse attachments &
MIME structures, create special pseudoheaders with the analysis
results in them, insert them into the message and then pass it on
to SA. The full power of SA would then be available to attack the
exposed info in any way that you wanted and wouldn't require any
mods to SA.
If you were worried about information leakage you could create a
post-filter that would remove the pseudoheaders.


--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

RE: Plugin extracting text from docs (was: new spam using large images)

by Rosenbaum, Larry M. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We can use antiword to render text from MSWord files, and unrtf to render text from RTF files.  What is the best tool to render text from PDF files?

(We are running Solaris 9)

L

> -----Original Message-----
> From: Jonas Eckerman [mailto:jonas_lists@...]
> Sent: Wednesday, June 24, 2009 1:34 PM
> To: users@...
> Subject: Plugin extracting text from docs (was: new spam using large
> images)
>
> Jason Haar wrote:
>
> > Speaking of image/rtf/word attachment spam; is there any work going
> on
> > to standardize this so that the textual output of such attachments
> could
> > be fed back into SA?
>
> Just as a note:
>
> I'm currently working on a modular plugin for extracting text and add
> it
> to SA message parts.
>
> The plugin can use either external tools or it's own simple plugin
> modules. How to extract text from parts is configurable, and based on
> mime types and file names, so new formats can be added by simply
> configuring for new external tolls or creating a new plugin module.
>
> My *far* from finished module currently manages to extract text from
> Word documents (using antiword), OpenXML text documents (using a simple
> plugin) and RTF (using unrtf).
>
> I haven't tested where and how the extracted text is available to
> SpamAssassin yet (as noted, it's *far* from finished), but I am using
>        "set_rendered" method as in the example, so it should work. ;-)
>
> Regards
> /Jonas
> --
> Jonas Eckerman
> Fruktträdet & Förbundet Sveriges Dövblinda
> http://www.fsdb.org/
> http://www.frukt.org/
> http://whatever.frukt.org/

RE: Plugin extracting text from docs (was: new spam using large images)

by Giampaolo Tomassoni :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> We can use antiword to render text from MSWord files, and unrtf to
> render text from RTF files.  What is the best tool to render text from
> PDF files?
>
> (We are running Solaris 9)

FWIK, antiword is the best tradeoff between speed and conversion quality.

The best converter I know of, even for batch use, is actually OpenOffice with its "uno" interface, but it isn't that easy to handle from perl since it uses some kind of Java jndi in order to exchange word files and converted text with any implementation of a conversion controller. Also, it tends to consume a lot of memory since current versions keep "growing" core size for each document you convert (even when you close them...).

Antiword seems more resource conscious in this...

Giampaolo



>
> L
>
> > -----Original Message-----
> > From: Jonas Eckerman [mailto:jonas_lists@...]
> > Sent: Wednesday, June 24, 2009 1:34 PM
> > To: users@...
> > Subject: Plugin extracting text from docs (was: new spam using large
> > images)
> >
> > Jason Haar wrote:
> >
> > > Speaking of image/rtf/word attachment spam; is there any work going
> > on
> > > to standardize this so that the textual output of such attachments
> > could
> > > be fed back into SA?
> >
> > Just as a note:
> >
> > I'm currently working on a modular plugin for extracting text and add
> > it
> > to SA message parts.
> >
> > The plugin can use either external tools or it's own simple plugin
> > modules. How to extract text from parts is configurable, and based on
> > mime types and file names, so new formats can be added by simply
> > configuring for new external tolls or creating a new plugin module.
> >
> > My *far* from finished module currently manages to extract text from
> > Word documents (using antiword), OpenXML text documents (using a
> simple
> > plugin) and RTF (using unrtf).
> >
> > I haven't tested where and how the extracted text is available to
> > SpamAssassin yet (as noted, it's *far* from finished), but I am using
> >        "set_rendered" method as in the example, so it should work. ;-
> )
> >
> > Regards
> > /Jonas
> > --
> > Jonas Eckerman
> > Fruktträdet & Förbundet Sveriges Dövblinda
> > http://www.fsdb.org/
> > http://www.frukt.org/
> > http://whatever.frukt.org/


Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rosenbaum, Larry M. wrote:

> We can use antiword to render text from MSWord files, and unrtf to render text from RTF files.  What is the best tool to render text from PDF files?

I don't know what the best tool is, but I'm currently using pdftohtml in
XML mode (and then stripping the XML) in my ExtractText plugin.

(For more info about the plugin, see my post with subject "ExtractText
plugin", or download it from
<http://whatever.frukt.org/graphdefang/ExtractText.zip>).

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/

Re: Plugin extracting text from docs

by Benny Pedersen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Wed, July 1, 2009 21:51, Jonas Eckerman wrote:

> <http://whatever.frukt.org/graphdefang/ExtractText.zip>).

i had to use wget --continue to get it downloaded, is this a firewall limit ?

stalls in 8k here, so multiple wget try to get the full zip down :(

--
xpoint


Re: Plugin extracting text from docs

by Jonas Eckerman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Benny Pedersen wrote:

>> <http://whatever.frukt.org/graphdefang/ExtractText.zip>).

I've now mirrored the file as
<http://mmm.truls.org/m/ExtractText.zip>

I hope that will work better.

Regards
/Jonas
--
Jonas Eckerman
Fruktträdet & Förbundet Sveriges Dövblinda
http://www.fsdb.org/
http://www.frukt.org/
http://whatever.frukt.org/
< Prev | 1 - 2 | Next >