Directory Crawler that looks like a user-agent

View: New views
3 Messages — Rating Filter:   Alert me  

Directory Crawler that looks like a user-agent

by TinaRock :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Sorry if this posted more than once.

At my job we have a secure website. Every hit to the site is captured by the tracking system to the SQL Server database.

We need to create an inventory system that can look at the data and tell us about the assets on the site.

To get the appropriate data into the database, we need to use a directory crawler that can hit every asset and every item that it finds in the directory structure.

Is there such a crawler, that can appear to be a user-agent, that can crawl a secure website? Is there such a crawler in ColdFusion?

Any ideas or pointers to such a crawler would be very much appreciated.

Thanks in advance,

Jo-Anne Head


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324328
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4

Re: Directory Crawler that looks like a user-agent

by Dave Watts :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> At my job we have a secure website. Every hit to the site is captured by the tracking
> system to the SQL Server database.
>
> We need to create an inventory system that can look at the data and tell us about
> the assets on the site.
>
> To get the appropriate data into the database, we need to use a directory crawler that
> can hit every asset and every item that it finds in the directory structure.
>
> Is there such a crawler, that can appear to be a user-agent, that can crawl a secure
> website? Is there such a crawler in ColdFusion?

If you want to do this using CF, you can use the CFHTTP tag as Michael
mentioned. But it's not clear to me that you're capturing the
information on each page; if you just need to make an HTTP request to
each page so that the existing logging system logs visits, there are
far easier approaches, such as wget.

Of course, crawlers typically only follow plain ol' HTML links, so if
you have forms-driven navigation or JavaScript-driven navigation
you'll have to figure out how to get to everything.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/

Fig Leaf Software provides the highest caliber vendor-authorized
instruction at our training centers in Washington DC, Atlanta,
Chicago, Baltimore, Northern Virginia, or on-site at your location.
Visit http://training.figleaf.com/ for more information!

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324332
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4

Re: Directory Crawler that looks like a user-agent

by Michael Dinowitz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


And you have to take into account travel via javascript, where links
could be inside included .js files. An intelligent cfhttp based bot
can do it all but you need to code it.
There is probably anon CF based software solution already out there to
do just this.

On Tue, Jul 7, 2009 at 7:04 PM, Dave Watts<dwatts@...> wrote:
> Of course, crawlers typically only follow plain ol' HTML links, so if
> you have forms-driven navigation or JavaScript-driven navigation
> you'll have to figure out how to get to everything.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324334
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4