The "clean out spam from archives" effort is lagging

View: New views
10 Messages — Rating Filter:   Alert me  

The "clean out spam from archives" effort is lagging

by Christian Perrier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

As one can see on http://wiki.debian.org/DebianInstaller/SpamClean,
this effort initiated by Frans back in April is lagging.

Last 3 months of debian-boot archives have been reviewed by 3 persons
only (Frans, Giacomo Catenazzi and me) and are thus missing at least
two more people to review them so that spams are nominated...and can
later be processed in the cleaning second step.

Old archives are also missing reviews, particularly a few from 2005
and nearly all from 2004, not to mention older archives.

Please take some time to do this work. This is not that time
consuming: one month can be reviewed in about 10-15 minutes....even
less when you're used to methods for spotting spams.

--




signature.asc (205 bytes) Download Attachment

Re: The "clean out spam from archives" effort is lagging

by Lee Winter :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 1, 2009 at 10:02 AM, Christian Perrier <bubulle@...> wrote:
> As one can see on http://wiki.debian.org/DebianInstaller/SpamClean,
> this effort initiated by Frans back in April is lagging.
>
> Last 3 months of debian-boot archives have been reviewed by 3 persons
> only (Frans, Giacomo Catenazzi and me) and are thus missing at least
> two more people to review them so that spams are nominated...and can
> later be processed in the cleaning second step.

I did the most recent three months of 2009, but the density was pretty low.

> Old archives are also missing reviews, particularly a few from 2005
> and nearly all from 2004, not to mention older archives.

So I started at the beginning (part of 1998) and went to the end of
2002.  If I have time this week I will look at 2003-2005.

> Please take some time to do this work. This is not that time
> consuming: one month can be reviewed in about 10-15 minutes....even
> less when you're used to methods for spotting spams.

The work is pretty tedious and reviewing non-spam emails five time is
extremely inefficient.  Consider a solution that would allow one
person to scan the archive to generate a list of spam targets.  If the
other four reviewers only had to review the listed spam candidates
they would not have to waste their time reviewing non-spam.

-- Lee


--
To UNSUBSCRIBE, email to debian-boot-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


Re: The "clean out spam from archives" effort is lagging

by Christian Perrier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Quoting Lee Winter (lee.j.i.winter@...):

> I did the most recent three months of 2009, but the density was pretty low.

I haven't checked the wiki and  I'm not online right now, but please
take care to register this in the page.

>
> > Old archives are also missing reviews, particularly a few from 2005
> > and nearly all from 2004, not to mention older archives.
>
> So I started at the beginning (part of 1998) and went to the end of
> 2002.  If I have time this week I will look at 2003-2005.

Ditto.

> > Please take some time to do this work. This is not that time
> > consuming: one month can be reviewed in about 10-15 minutes....even
> > less when you're used to methods for spotting spams.
>
> The work is pretty tedious and reviewing non-spam emails five time is
> extremely inefficient.  Consider a solution that would allow one
> person to scan the archive to generate a list of spam targets.  If the
> other four reviewers only had to review the listed spam candidates
> they would not have to waste their time reviewing non-spam.

I'm sure the listmasters would welcome such improvements but, well, we
already have a very good tool.

Also, restricting the list to what the first person has identified
would increase the risk of missing some spams.

When I worked on the entire archive, I finally dropped the web
interface and used an alternative method:

- download the list archives as mailboxes
- pass them through my CRM114 spam filter
- open them in my MUA (mutt)
- tag spam messages (being processed by CRM114, most spams are already
identified by CRM114 markers)
- bounce them to the spam report mail addresse
(report-listspam@...) with the following key macro:

macro index \eL "breport-listspam@...\no\nq" "report as spam to Debian lists"

I found this much more efficient.

Downloading list archives as mailboxes is only accessible to Debian
developers but I can provide them to people who might need them.




signature.asc (205 bytes) Download Attachment

Re: The "clean out spam from archives" effort is lagging

by Lee Winter :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 2, 2009 at 1:01 AM, Christian Perrier <bubulle@...> wrote:
> Quoting Lee Winter (lee.j.i.winter@...):
>
>> I did the most recent three months of 2009, but the density was pretty low.
>
> I haven't checked the wiki and  I'm not online right now, but please
> take care to register this in the page.

I am a little hesitant to edit the page because I don't understand the
process and found no doc or howto.

>
>>
>> > Old archives are also missing reviews, particularly a few from 2005
>> > and nearly all from 2004, not to mention older archives.
>>
>> So I started at the beginning (part of 1998) and went to the end of
>> 2002.  If I have time this week I will look at 2003-2005.
>
> Ditto.
>
>> > Please take some time to do this work. This is not that time
>> > consuming: one month can be reviewed in about 10-15 minutes....even
>> > less when you're used to methods for spotting spams.
>>
>> The work is pretty tedious and reviewing non-spam emails five time is
>> extremely inefficient.  Consider a solution that would allow one
>> person to scan the archive to generate a list of spam targets.  If the
>> other four reviewers only had to review the listed spam candidates
>> they would not have to waste their time reviewing non-spam.
>
> I'm sure the listmasters would welcome such improvements but, well, we
> already have a very good tool.
>
> Also, restricting the list to what the first person has identified
> would increase the risk of missing some spams.
>
> When I worked on the entire archive, I finally dropped the web
> interface and used an alternative method:
>
> - download the list archives as mailboxes
> - pass them through my CRM114 spam filter
> - open them in my MUA (mutt)
> - tag spam messages (being processed by CRM114, most spams are already
> identified by CRM114 markers)
> - bounce them to the spam report mail addresse
> (report-listspam@...) with the following key macro:
>
> macro index \eL "breport-listspam@...\no\nq" "report as spam to Debian lists"
>
> I found this much more efficient.

Sounds like the beginning/foundation of an automation script.  If the
candidates can be found mechanically, then there is a potential
tradeoff available.  We have 11 years = 132 months; times 5 reviewers
= 660 reviewer-months.  At 10-15 min each that is 110-165 man-hours.
That's a lot of manual effort.

Just how important are the last few messages that would make it
through a (purposfully loose) mechanical filter?  If the whole mess
could be 98% cleaned up with say, 5 man-hours then it would be a
tremendous efficiency improvement.

> Downloading list archives as mailboxes is only accessible to Debian
> developers but I can provide them to people who might need them.

In the '80s I spent a lot of time doing natural language processing
software, so I may be more tuned up than the typical reviewer.  But I
find it more efficient to review the author/subject/thread indicies
and inspect message content only to confirm the presence of spam in a
suspect message.  So offline access to the archive would not help me.

-- Lee


--
To UNSUBSCRIBE, email to debian-boot-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


Re: The "clean out spam from archives" effort is lagging

by Christian Perrier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Quoting Lee Winter (lee.j.i.winter@...):

> > I haven't checked the wiki and  I'm not online right now, but please
> > take care to register this in the page.
>
> I am a little hesitant to edit the page because I don't understand the
> process and found no doc or howto.

Well, that's a wiki, soso basically enter edit mode et make the
required changes.

> > macro index \eL "breport-listspam@...\no\nq" "report as spam to Debian lists"
> >
> > I found this much more efficient.
>
> Sounds like the beginning/foundation of an automation script.  If the
> candidates can be found mechanically, then there is a potential
> tradeoff available.  We have 11 years = 132 months; times 5 reviewers
> = 660 reviewer-months.  At 10-15 min each that is 110-165 man-hours.
> That's a lot of manual effort.

We have done a big part of the effort already.

The main point is that automated recognition is not reliable enough
and manual review is still needed...


> Just how important are the last few messages that would make it
> through a (purposfully loose) mechanical filter?  If the whole mess
> could be 98% cleaned up with say, 5 man-hours then it would be a
> tremendous efficiency improvement.

If someonen is considering investing some time on this,
maybe. However, I'm not sure we'll find such volunteer.

Please also note that processing the current traffic that flows
through the list is even easier: if a few people just commi
tthemselves to bounce to the reporting address every spam they find in
debian-boot while they read the list...then processing the incoming
traffic is just done on the fly.

For instance, when I registered that I "processed" October 2009, I
actually just record that during the entire month I bounce every
incoming spam mail in the list to the spam reporting address.


--




signature.asc (205 bytes) Download Attachment

Re: The "clean out spam from archives" effort is lagging

by Lee Winter :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 2, 2009 at 1:01 AM, Christian Perrier <bubulle@...> wrote:

> I haven't checked the wiki and  I'm not online right now, but please
> take care to register this in the page.

Done.


--
To UNSUBSCRIBE, email to debian-boot-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


Re: The "clean out spam from archives" effort is lagging

by Holger Wansing-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Christian Perrier <bubulle@...> wrote:
> As one can see on http://wiki.debian.org/DebianInstaller/SpamClean,
> this effort initiated by Frans back in April is lagging.
>
> Last 3 months of debian-boot archives have been reviewed by 3 persons
> only (Frans, Giacomo Catenazzi and me) and are thus missing at least
> two more people to review them so that spams are nominated...and can
> later be processed in the cleaning second step.

October is reviewed by 5 now.

I will work on the other targets soon.


 
Lee: when you change the wiki page for this to add your name to a
     month, please remember to increase the number of reviewers for
     that month, too (that's the number in the second column).
     (I already did that for the entries you made until now)


--

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Created with Sylpheed 2.5.0
    under DEBIAN GNU/LINUX 5.0.0 - L e n n y
        Registered LinuxUser #311290 - http://counter.li.org/
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =


--
To UNSUBSCRIBE, email to debian-boot-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


The "clean out spam from archives" effort is *no longer* lagging

by Christian Perrier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Quoting Christian Perrier (bubulle@...):
> As one can see on http://wiki.debian.org/DebianInstaller/SpamClean,
> this effort initiated by Frans back in April is lagging.

Wow.

After this mail and one week of work, I found 447 proposed spams to
review this morning, after the weekly script run (this script collects
signalled spam and, for those that have ben signalled at least 5
times, it adds them to a list of spams to review).

So, now, the DDs of us have to review those messages and confirm that
they're spam (I confirmed *all* of them!). It needs at least 3 people
to do this for the messages to be really removed at the next run of
the weekly script.

I guess that Frans will do such review so it needs only another DD to
do it so that we have more than 400 spams removed from the archive
next Sunday.

Congratulations to all people who worked on this. Keep up with the
good work!




signature.asc (205 bytes) Download Attachment

The "clean out spam from archives" effort is *no longer* lagging (UPDATE)

by Christian Perrier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Quoting Christian Perrier (bubulle@...):

> I guess that Frans will do such review so it needs only another DD to
> do it so that we have more than 400 spams removed from the archive
> next Sunday.


And that was apparently done.

This Sunday, 364 more spam mails were removed, so the total number of
removed posts is now 3986.

See statistics at the bottom of
http://wiki.debian.org/DebianInstaller/SpamClean


208 more spam "nominations" were considered and are proposed this week
to reviewers (I already did my review).

Thanks again to those people who resumed that work. We're now not that
far from being able to say "we reviewed the entire archive of
debian-boot and had NNNN spams removed".

Kudos!




signature.asc (205 bytes) Download Attachment

"Clean out spam from archives" : November 22nd update

by Christian Perrier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

This week, some more report work happened, though a little bit less
actively than last week.

As a consequence, when I ran my review step today, I "only" had 7 more
nominated posts to review. I suspect this is because a few months
listed in http://wiki.debian.org/DebianInstaller/SpamClean are missing
one or two people to process them. The bump that happened when Lee
entered the game is slowing down as we need people *other than him* to
review the list archives, now (particularly the old months).

Frans, maybe consider looking at the 2001 archives? You've bene very
busy in doing coding last weeks so I'm reluctant to distract you with
this...

Franklin, Giacomo, maybe?

Holger Wansing can't do more as he did process everything or nearly
everything...


Concerning the final removal step, this week saw a great bump, which
is, as expected, the result of the effort during the week before. 253
spam mails were thus removed from the archive.

It means that the "review by a DD step" seems to be working nearly
nominally. I'm doing such reviews. I suspect that Frans is, also. And,
apparently, a third DD is doing DD reviews as well (Bastian?).

Still, there are about 60 mails that I did review the week before that
weren't removed. So I think that one of the other 2 DD's didn't review
all what (s)he had to review. No shame for this, of course..:-)

Again, please continue the good work. We're close to announce that we
cleaned out the list archive entirely (or as entirely as we can do it
with semi-manual methods).



signature.asc (205 bytes) Download Attachment