Content Inspection - Statistical methods

View: New views
5 Messages — Rating Filter:   Alert me  

Content Inspection - Statistical methods

by Glenn Wilkinson-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello IDS folks,

I'm currently doing a mini-project involving applying machine learning
techniques to the identification of hostile network traffic. My focus
is on TCP traffic, and I'm looking at header and content based
inspection. I'm wrapping up my feature extraction code now, whereby
I've imported all TCP sessions from the DARPA training sets into a DB
and have tagged the hostile sessions.

My question is, does anyone have any bright ideas of some useful,
simple content analysis attributes? As it's a statistical/ML approach
I'm trying to come up with as generic as possible ideas. So far I'm
calculating things like session data entropy, most frequent character,
counts of certain characters.

I'm brand new to this field, but am really excited about this project.
Any feedback/advice would be greatly appreciated.

Thanks!
G

-----------------------------------------------------------------
Securing Your Online Data Transfer with SSL.
A guide to understanding SSL certificates, how they operate and their application. By making use of an SSL certificate on your web server, you can securely collect sensitive information online, and increase business by giving your customers confidence that their transactions are safe.
http://www.dinclinx.com/Redirect.aspx?36;5001;25;1371;0;1;946;9a80e04e1a17f194



Re: Content Inspection - Statistical methods

by Maggi Federico :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 08/ago/2009, at 19.45, Glenn Wilkinson <glenn.wilkinson@...>  
wrote:

> My question is, does anyone have any bright ideas of some useful,
> simple content analysis attributes? As it's a statistical/ML approach
> I'm trying to come up with as generic as possible ideas. So far I'm
> calculating things like session data entropy, most frequent character,
> counts of certain characters.

The IDS literature is over-filled of techniques (both deterministic  
and stochastic, or ML-based) of any sort to model "good" traffic that  
may inspire your project.

I don't have the exact references with me but a quick Google Scholar  
for terms like "tcp" "anomaly" "payload" narrowed between 2003 and  
2006 (when anomaly-based NIDS were a hot topic) will spot out the main  
contributions.

I feel there's even a little room for improvements to the existing  
approaches.

Cheers,

-- Fede

-----------------------------------------------------------------
Securing Your Online Data Transfer with SSL.
A guide to understanding SSL certificates, how they operate and their application. By making use of an SSL certificate on your web server, you can securely collect sensitive information online, and increase business by giving your customers confidence that their transactions are safe.
http://www.dinclinx.com/Redirect.aspx?36;5001;25;1371;0;1;946;9a80e04e1a17f194



Re: Content Inspection - Statistical methods

by Richard Bejtlich :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Aug 8, 2009 at 1:45 PM, Glenn
Wilkinson<glenn.wilkinson@...> wrote:

> Hello IDS folks,
>
> I'm currently doing a mini-project involving applying machine learning
> techniques to the identification of hostile network traffic. My focus
> is on TCP traffic, and I'm looking at header and content based
> inspection. I'm wrapping up my feature extraction code now, whereby
> I've imported all TCP sessions from the DARPA training sets into a DB
> and have tagged the hostile sessions.
>
> My question is, does anyone have any bright ideas of some useful,
> simple content analysis attributes? As it's a statistical/ML approach
> I'm trying to come up with as generic as possible ideas. So far I'm
> calculating things like session data entropy, most frequent character,
> counts of certain characters.
>
> I'm brand new to this field, but am really excited about this project.
> Any feedback/advice would be greatly appreciated.
>
> Thanks!
> G
>

Hi Glenn,

How about NOT using the DARPA data sets?  Maybe something more modern?

http://taosecurity.blogspot.com/2009/08/2009-cdx-data-sets-posted.html

Sincerely,

Richard

-----------------------------------------------------------------
Securing Your Online Data Transfer with SSL.
A guide to understanding SSL certificates, how they operate and their application. By making use of an SSL certificate on your web server, you can securely collect sensitive information online, and increase business by giving your customers confidence that their transactions are safe.
http://www.dinclinx.com/Redirect.aspx?36;5001;25;1371;0;1;946;9a80e04e1a17f194



Re: Content Inspection - Statistical methods

by Jamie Riden :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/8/11 Richard Bejtlich <taosecurity@...>:

> On Sat, Aug 8, 2009 at 1:45 PM, Glenn
> Wilkinson<glenn.wilkinson@...> wrote:
>> Hello IDS folks,
>>
>> I'm currently doing a mini-project involving applying machine learning
>> techniques to the identification of hostile network traffic. My focus
>> is on TCP traffic, and I'm looking at header and content based
>> inspection. I'm wrapping up my feature extraction code now, whereby
>> I've imported all TCP sessions from the DARPA training sets into a DB
>> and have tagged the hostile sessions.
>>
>> My question is, does anyone have any bright ideas of some useful,
>> simple content analysis attributes? As it's a statistical/ML approach
>> I'm trying to come up with as generic as possible ideas. So far I'm
>> calculating things like session data entropy, most frequent character,
>> counts of certain characters.
>>
>> I'm brand new to this field, but am really excited about this project.
>> Any feedback/advice would be greatly appreciated.
>>
>> Thanks!
>> G
>>
>
> Hi Glenn,
>
> How about NOT using the DARPA data sets?  Maybe something more modern?
>
> http://taosecurity.blogspot.com/2009/08/2009-cdx-data-sets-posted.html

Agreed - I think I remember using those for some coursework in 2001.
They were a bit limited in the features extracted from the packet and
the eventual winning solution - a combination of bagged/boosted
decision trees - I don't think would work very well in the real world.
This is all going from memory, so could be absolute rubbish.

The real problem in ML seems to be finding good, accurately labelled
training data :(

cheers,
 Jamie

-----------------------------------------------------------------
Securing Your Online Data Transfer with SSL.
A guide to understanding SSL certificates, how they operate and their application. By making use of an SSL certificate on your web server, you can securely collect sensitive information online, and increase business by giving your customers confidence that their transactions are safe.
http://www.dinclinx.com/Redirect.aspx?36;5001;25;1371;0;1;946;9a80e04e1a17f194



Re: Content Inspection - Statistical methods

by Stefano Zanero :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jamie Riden wrote:

> The real problem in ML seems to be finding good, accurately labelled
> training data :(

On the other hand, any system which needs clean (or accurately labeled)
training data will be of no use in the real world, as you will never
have clean training data on real world systems :-)

The problem is much more general, and related to evaluation of ANY IDS,
not just machine learning approaches.

Don't get me started on this once more... :p

Stefano

-----------------------------------------------------------------
Securing Your Online Data Transfer with SSL.
A guide to understanding SSL certificates, how they operate and their application. By making use of an SSL certificate on your web server, you can securely collect sensitive information online, and increase business by giving your customers confidence that their transactions are safe.
http://www.dinclinx.com/Redirect.aspx?36;5001;25;1371;0;1;946;9a80e04e1a17f194