|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
new spam using large imagesHi there, just a FYI
I just received this: http://pastebin.com/m54006b68 420K in size - standard configuration of SA wouldn't have even run over this message. Also the inline image is too large for FuzzyOCR to trigger - I would guess FuzzyOCR has the (screen) size limit as a mechanism to reduce FPs. Anyway, if you increase focr_max_height/focr_max_width then FuzzyOCR grabs the text out just fine - and it looks like your standard "you're a w...r!" scam. This is the sort of thing that always worried me. As spammer don't care about the load their apps put on stolen PCs, they can just increase the size of their email formats until antispam tools start to break. Speaking of image/rtf/word attachment spam; is there any work going on to standardize this so that the textual output of such attachments could be fed back into SA? -- Cheers Jason Haar Information Security Manager, Trimble Navigation Ltd. Phone: +64 3 9635 377 Fax: +64 3 9635 417 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1 |
|
|
Re: new spam using large imagesOn Fri, Jun 19, 2009 at 3:04 AM, Jason Haar<Jason.Haar@...> wrote:
> Speaking of image/rtf/word attachment spam; is there any work going on > to standardize this so that the textual output of such attachments could > be fed back into SA? That functionality already exists (has for almost 3 years, actually), but as in the past (list archives) the documentation hasn't improved for it. :( Here's my last(?) post about it which has some sample code and everything: http://www.nabble.com/Re:-PDFText-Plugin-for-PDF-file-scoring---not-for-PDF-images-p11595641.html |
|
|
Re: new spam using large imagesOn Fri, 2009-06-19 at 13:04 +1200, Jason Haar wrote:
> Hi there, just a FYI > > I just received this: http://pastebin.com/m54006b68 > > 420K in size - standard configuration of SA wouldn't have even run over > this message. [...] SA would have scanned it by default just fine. The default size limit for spamc is 500 KB. No size limit imposed by spamd. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}} |
|
|
Re: new spam using large imagesOn Fri, 19 Jun 2009, Jason Haar wrote:
> Hi there, just a FYI > I just received this: http://pastebin.com/m54006b68 > 420K in size... Hmmmm. Big question for developers: Does the performance 'burden' of a large e-mail come from the 'reading' of that mail into spamassassin and initial processing? Or is the 'cost' of a large message only 'paid' when SA attempts to run 'rawbody' or 'full' rules against the entire message? I am *hoping* it is the latter, and that a parameter value can be coded within the spamassassin config (or as a command line option) that will amount to 'ignore attachements larger than...', while still allowing the headers and any text body parts to be scanned. In particular, given the success of RBL's, it seems reasonable to have a way to process the headers from *all* messages, as long as loading the oversize message does not (for example) tie up memory merely by loading the message into spamassassin.... Yes, I already use RBL's at the MTA level for the ones I trust to be a poison pill. But I often still see spam hit multiple 'lower trust' RBL's in spamassassin, adding up to a rejection score. So it's worth figuring a way to check larger mails if that is what spammers are going to do. If the cost has more to do with SA reading the mail at the beginning, then perhaps we could figure a 'subfunction' of spamassassin that would accept a command line option to only read the headers (all lines up to the first blank line) and then return a score as a result code? Obviously it could not modify the message in that case, but if the spammers are going to just make their spew over-sized, then its something that may be needed.... and it would at least help with the rejection of mails that surpass the 'auto reject' threshold. Thoughts? - Charles |
|
|
Re: new spam using large imagesOn Fri, Jun 19, 2009 at 4:42 PM, Charles Gregory<cgregory@...> wrote:
> Hmmmm. Big question for developers: Does the performance 'burden' of a large > e-mail come from the 'reading' of that mail into spamassassin and initial > processing? Or is the 'cost' of a large message only 'paid' when SA attempts > to run 'rawbody' or 'full' rules against the entire message? There is very little load for reading in a message. It's all about the running of rules. Some rules "cost" more than others, "full", for example. |
|
|
RE: new spam using large images> From: felicity@... On Behalf Of Theo Van Dinter
> > On Fri, Jun 19, 2009 at 3:04 AM, Jason Haar<Jason.Haar@...> > wrote: > > Speaking of image/rtf/word attachment spam; is there any work going > on > > to standardize this so that the textual output of such attachments > could > > be fed back into SA? > > That functionality already exists (has for almost 3 years, actually), > but as in the past (list archives) the documentation hasn't improved > for it. :( > > Here's my last(?) post about it which has some sample code and > everything: > > http://www.nabble.com/Re:-PDFText-Plugin-for-PDF-file-scoring---not- > for-PDF-images-p11595641.html Thanks for the sample code. Once you get the $p object from $msg->find_parts(), how do you extract the contents of the message part to run it through antiword or whatever? L |
|
|
Re: new spam using large imagesOnce you have a part you can use the documented methods in
Message::Node to access data (see "perldoc Mail::SpamAssassin::Message::Node"). You will probably want $p->decode() which returns a decoded (base64, quoted-printable) string of the part contents. On Fri, Jun 19, 2009 at 7:00 PM, Rosenbaum, Larry M.<rosenbaumlm@...> wrote: >> From: felicity@... On Behalf Of Theo Van Dinter >> >> On Fri, Jun 19, 2009 at 3:04 AM, Jason Haar<Jason.Haar@...> >> wrote: >> > Speaking of image/rtf/word attachment spam; is there any work going >> on >> > to standardize this so that the textual output of such attachments >> could >> > be fed back into SA? >> >> That functionality already exists (has for almost 3 years, actually), >> but as in the past (list archives) the documentation hasn't improved >> for it. :( >> >> Here's my last(?) post about it which has some sample code and >> everything: >> >> http://www.nabble.com/Re:-PDFText-Plugin-for-PDF-file-scoring---not- >> for-PDF-images-p11595641.html > > Thanks for the sample code. Once you get the $p object from $msg->find_parts(), how do you extract the contents of the message part to run it through antiword or whatever? > > L > |
|
|
Re: new spam using large imagesOn 19 Jun, 2009, at 06:12 , Karsten Bräckelmann wrote:
> On Fri, 2009-06-19 at 13:04 +1200, Jason Haar wrote: >> Hi there, just a FYI >> >> I just received this: http://pastebin.com/m54006b68 >> >> 420K in size - standard configuration of SA wouldn't have even run >> over >> this message. [...] > > SA would have scanned it by default just fine. The default size limit > for spamc is 500 KB. No size limit imposed by spamd. A 420K image will result in an email well over 500K in size. -- Rincewind had always been happy to think of himself as a racist. The One Hundred Meters, the Mile, the Marathon -- he'd run them all. |
|
|
Re: new spam using large imagesOn Fri, 2009-06-19 at 13:57 -0600, LuKreme wrote:
> On 19 Jun, 2009, at 06:12 , Karsten Bräckelmann wrote: > >> I just received this: http://pastebin.com/m54006b68 > >> > >> 420K in size - standard configuration of SA wouldn't have even run over > >> this message. [...] > > > > SA would have scanned it by default just fine. The default size limit > > for spamc is 500 KB. No size limit imposed by spamd. > > A 420K image will result in an email well over 500K in size. Next time, please do your homework, Lu. We're talking about a message that's 427,868 Byte large in total. Including the base64 *encoded* image. You probably would have noticed that, if you would have bothered to check any detail at all before posting. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}} |
|
|
Re: new spam using large imagesOn 19 Jun, 2009, at 14:38 , Karsten Bräckelmann wrote:
> On Fri, 2009-06-19 at 13:57 -0600, LuKreme wrote: >> On 19 Jun, 2009, at 06:12 , Karsten Bräckelmann wrote: > >>>> I just received this: http://pastebin.com/m54006b68 >>>> >>>> 420K in size - standard configuration of SA wouldn't have even >>>> run over >>>> this message. [...] >>> >>> SA would have scanned it by default just fine. The default size >>> limit >>> for spamc is 500 KB. No size limit imposed by spamd. >> >> A 420K image will result in an email well over 500K in size. > > Next time, please do your homework, Lu. > > We're talking about a message that's 427,868 Byte large in total. Oh, I just looked at the numbers in the emails and figured they would be right, I did not look at the actual message. If the image was, in fact 420K, it would have to be over 500K encoded. > Including the base64 *encoded* image. You probably would have noticed > that, if you would have bothered to check any detail at all before > posting. Yep, had I downloaded the email in question and checked its byte count I would have found the *encoded* image was 417,920 bytes (408KB), and not that the actual image was 420KB (430,080). I trusted OP to post the right information. For the record, the image was 304KB (309,272) for an inflation rate that would have easily put any 420KB image over the 500KB threshold. -- Advance and attack! Attack and destroy! Destroy and rejoice! |
|
|
Plugin extracting text from docs (was: new spam using large images)Jason Haar wrote:
> Speaking of image/rtf/word attachment spam; is there any work going on > to standardize this so that the textual output of such attachments could > be fed back into SA? Just as a note: I'm currently working on a modular plugin for extracting text and add it to SA message parts. The plugin can use either external tools or it's own simple plugin modules. How to extract text from parts is configurable, and based on mime types and file names, so new formats can be added by simply configuring for new external tolls or creating a new plugin module. My *far* from finished module currently manages to extract text from Word documents (using antiword), OpenXML text documents (using a simple plugin) and RTF (using unrtf). I haven't tested where and how the extracted text is available to SpamAssassin yet (as noted, it's *far* from finished), but I am using "set_rendered" method as in the example, so it should work. ;-) Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docs (was: new spam using large images)> Jason Haar wrote:
> >> Speaking of image/rtf/word attachment spam; is there any work going on >> to standardize this so that the textual output of such attachments could >> be fed back into SA? On 24.06.09 19:33, Jonas Eckerman wrote: > Just as a note: > > I'm currently working on a modular plugin for extracting text and add it > to SA message parts. if possible, extract images too, so the fuzzyocr and similar plugins would be able to look at that too. IIRC spammers did even put PDF's to .doc files to make the stuff harder, but if you manage the above, it shouldn't be hard to extract PDF's too :) (and then extracting text/images from PDF's too) > The plugin can use either external tools or it's own simple plugin > modules. How to extract text from parts is configurable, and based on > mime types and file names, so new formats can be added by simply > configuring for new external tolls or creating a new plugin module. > > My *far* from finished module currently manages to extract text from > Word documents (using antiword), OpenXML text documents (using a simple > plugin) and RTF (using unrtf). > > I haven't tested where and how the extracted text is available to > SpamAssassin yet (as noted, it's *far* from finished), but I am using > "set_rendered" method as in the example, so it should work. ;-) great! -- Matus UHLAR - fantomas, uhlar@... ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. If Barbie is so popular, why do you have to buy her friends? |
|
|
Re: Plugin extracting text from docsMatus UHLAR - fantomas wrote:
>> I'm currently working on a modular plugin for extracting text and add it >> to SA message parts. > > if possible, extract images too, so the fuzzyocr and similar plugins would > be able to look at that too. You meen extract images and add them as parts to the message? I guess that should be doable. I know that "unrtf" can extract images from RTF files. I'll probably implement support for this, but I'll probably not implement actually doing it right away. > IIRC spammers did even put PDF's to .doc files to make the stuff harder, but > if you manage the above, it shouldn't be hard to extract PDF's too :) This I don't understand. Do they put PDFs inside .doc files as if the ..doc was an archive? Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docsJonas Eckerman wrote:
> You meen extract images and add them as parts to the message? > > I guess that should be doable. I know that "unrtf" can extract images > from RTF files. I'll probably implement support for this, but I'll > probably not implement actually doing it right away. This'll probably have to wait. Browsing the POD and source of Mail::SpamAssassin::Message::Node and Mail::SpamAssassin::Message I found no obvious way of adding new parts to a message node. Especially if the node is a leaf node (I'm guessing that singlepart messages only has a leaf node). Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docs> Matus UHLAR - fantomas wrote:
> >>> I'm currently working on a modular plugin for extracting text and add >>> it to SA message parts. >> >> if possible, extract images too, so the fuzzyocr and similar plugins would >> be able to look at that too. > > You meen extract images and add them as parts to the message? > > I guess that should be doable. I know that "unrtf" can extract images > from RTF files. I'll probably implement support for this, but I'll > probably not implement actually doing it right away. > >> IIRC spammers did even put PDF's to .doc files to make the stuff harder, but >> if you manage the above, it shouldn't be hard to extract PDF's too :) On 25.06.09 14:44, Jonas Eckerman wrote: > This I don't understand. Do they put PDFs inside .doc files as if the > ..doc was an archive? I am not sure but I think something alike was done. What I mean is to have generic chain of format converters, where at the end would be plain image or even text, that could be processed by classic rules like bayes, replacetags etc. -- Matus UHLAR - fantomas, uhlar@... ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759 |
|
|
Re: Plugin extracting text from docsOn Thu, Jun 25, 2009 at 11:48 AM, Matus UHLAR -
fantomas<uhlar@...> wrote: > I am not sure but I think something alike was done. What I mean is to have > generic chain of format converters, where at the end would be plain image > or even text, that could be processed by classic rules like bayes, > replacetags etc. Already exists, check recent list history for "set_rendered". :) |
|
|
Re: Plugin extracting text from docsMatus UHLAR - fantomas wrote:
>> This I don't understand. Do they put PDFs inside .doc files as if the >> ..doc was an archive? > > I am not sure but I think something alike was done. Considering that an OpenXML format is basically a zip file with XML files inside and that the actual document can contain hyperlinks I guess it could be possible to do something like that. Don't know enough about the format to know though. > What I mean is to have > generic chain of format converters, where at the end would be plain image > or even text, that could be processed by classic rules like bayes, > replacetags etc. If I manage to figure out how to add new parts to a message from within the "post_message_parse" method, that should work just fine. An extractor plugin can return a list of parts to be added to the message, and my module will keep looping through the message parts if new parts are added. So, if a Word extractor extracts a PDF and returns it, the PDF woudl be added to a new part, and in the next loop the PDF part will be sent to a PDF extractor if that exists. And so on. I'm running "post_message_parse" at priority -1 so any added image parts should be available to plugins like FuzzyOCR as well as plugins running "post_message_parse" at default priority. The missing parts are: 1: How do I add a new part to a parsed message (including a singlepart one). This is of course the main problem. 2: The actual extractor plugin that extracts whatever files are included in the word document. Antiword only extracts text, and my extractor for OpenXML is little more than an extremely basic XML remover. Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docsTheo Van Dinter wrote:
>> I am not sure but I think something alike was done. What I mean is to have >> generic chain of format converters, where at the end would be plain image >> or even text, that could be processed by classic rules like bayes, >> replacetags etc. > Already exists, check recent list history for "set_rendered". > :) I though that was for text only. In any case, any plugin looking for images, or a PDF, will most likely look at MIME type and/or file name, and then use the "decode" method to get the data, and AFAICT the "set_rendered" method doesn't have any impact on any of that. I can't see how "set_rendered" would help in creating a fucntioning chain where one converter could put an arbitrary extracted object (image, pdf, whatever) where another converter could have a go at it. Since the "set_rendered" method seems very undocumented I could of course be wrong here. In that case I hope to be verbosely corrected. :-) /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
|
|
Re: Plugin extracting text from docsOn Thu, Jun 25, 2009 at 1:12 PM, Jonas Eckerman<jonas_lists@...> wrote:
>> Already exists, check recent list history for "set_rendered". > > I though that was for text only. It is only for text. > In any case, any plugin looking for images, or a PDF, will most likely look > at MIME type and/or file name, and then use the "decode" method to get the > data, and AFAICT the "set_rendered" method doesn't have any impact on any of > that. Of course. There are three states for the data in a Message::Node object: - raw: whatever the email had originally. may be encoded, etc. - decoded: the raw content, decoded (ie: base64 or quoted-printable). may be binary. - rendered: the text content. if it was a text part, it's the same as decoded. if it was a html part, the decoded data gets "rendered" into text. if it's anything else, the rendered text is blank because nothing else is supported. The goal with the plugin calls and set_rendered is to allow other plugins to find parts that they understand how to convert into text, and set the rendered version of the part to whatever as appropriate. So if you want to do OCR on image/*, you can do that. If you want to convert PDF/DOC/whatever to text, you can do that. I would comment that plugins should probably skip parts they want to render that already has rendered text available. Rules, Bayes, etc, then take all the rendered parts and use them. > I can't see how "set_rendered" would help in creating a fucntioning chain > where one converter could put an arbitrary extracted object (image, pdf, > whatever) where another converter could have a go at it. Well, you wouldn't do that because there's no point. ;) (feel free to disagree with me though) If a plugin wants to get image/* parts and do something with the contents, they can do that already. If a plugin wants to get application/octet-stream w/ filename "*.pdf" and do something with the contents, they can do that already. If you want to have a plugin do some work on a part's contents, then store that result and let another plugin pick up and continue doing other work ... There's no official method to do that. You can store data as part of the Node object. You could potentially also write a tempfile, though you'll want to be careful to clean up the tempfile as necessary. But what would be a use case for that? I guess something like converting a PDF to a TIFF, then OCR the TIFF? I'd probably implement that as a single plugin w/ "ocr" as a function that gets called from both the PDF and TIFF handlers. Arguably, there could be multiple people developing plugins for different types, but you'd need some coordination for the register_method_priority calls to figure out who goes in what order. (btw: I just found the register_method_priority() method. \o/) Note: Do not try to add or remove parts in the tree. The tree is meant to represent the mime structure of the mail, and each node relates to that specific mime part. The tree is not meant to be a temporary data storage mechanism. Hope this helps. |
|
|
Re: Plugin extracting text from docsTheo Van Dinter wrote:
> I would comment that plugins should probably skip parts they want to > render that already has rendered text available. Ah. That's a good idea. Now I'll have to search for a nice way to check that. :-) >> I can't see how "set_rendered" would help in creating a fucntioning chain >> where one converter could put an arbitrary extracted object (image, pdf, >> whatever) where another converter could have a go at it. > If a plugin wants to get image/* parts and do something with the > contents, they can do that already. Not if the image/* parts are actually inside a document. > If you want to have a plugin do some work on a part's contents, then > store that result and let another plugin pick up and continue doing > other work ... There's no official method to do that. I guessed as much. This however is what me and Matus were talking about. > You can store > data as part of the Node object. > But what would be a use case for that? Matus example was a Word document that contained as PDF wich (might in turn contain an image). A plugin that knows how to read word document could extract th text of the word document and then use "set_rendered" to make that avaiölable to SA. It cannot currently extract the PDF and make it available to any plugins that knows how tpo read PDFs though. Matus idea about chains would be that in this example the the plugin reading the Word document would store any other objects somehow. In this case a PDF. After that, any plugin that knows how to handle PDFs will get to look at the PDF and extract text and other stuff from it. In case it extracts an image, it would then store it the same way, and any image handling plugins would find it. I really don't know how common that is. I have never seen a Word document with a PDF inside it myself. I have however seen many documents that contain images, and I think it would be a good idea to make those images available to things like FuzzyOCR and ImageInfo. > Arguably, there could be multiple people developing plugins for > different types, but you'd need some coordination for the > register_method_priority calls to figure out who goes in what order. For some stuff coordination would be needed, yes. But not for what I'm thinking of. The text extraction plugin I'm working on (wich started this) itself have simple extractor plugins. These plugins will be able to return arbitrary objects as well as text, and my plugin will check the return objects the same way it checks the original message parts. This way, all the extractors that are tied into my plugins will be able to extract stuff from objects extracted by other extractors. So far so good. The most common thing to extract apart from text will most likely be images. Any OCR text extractor tied into my plugin would get to see those images, but any OCR SA plugins run after my plugin won't. It might be good to make extracted images available to those, and other image handling plugins. My plugin is called after the message is parsed, wich is very good for a text extractor. FuzzyOCR (as an example) however works by scoring OCR output (wich may well be very different from the text in the image as we see it), and therefore has to be called at a later stage. The same gioes for ImageInfo. It might therefore be a good idea to make the extracted images and other objects available to scoring plugins as well. > I just found the register_method_priority() method. \o/) It's nice, isn't it? :-) I'm using it in my URLRedirect plugin. > Note: Do not try to add or remove parts in the tree. The tree is > meant to represent the mime structure of the mail, and each node > relates to that specific mime part. The tree is not meant to be a > temporary data storage mechanism. Ok. That makes things easier and less easy for me. I know that I'll have to implement my own list of stuff to loop though when extractors return additional parts in my plugin. That's the easy part. The difficult part is how to make extracted stuff available to other plugins in a way they understand. I see two main ways to do this: 1: Invent a new way. This would require modifications of any plugins that should check the extracted objects. 2: Add a container part somewhere that "find_parts" would find, but wich is not actually a member of the message tree, and then add a simple way to add parts to that container. This would require modification of Mail::SpamAssassin::Message, but not of the plugins. Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/ |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |