ColdFusion crawler that appears to be a user-agent?

View: New views
2 Messages — Rating Filter:   Alert me  

ColdFusion crawler that appears to be a user-agent?

by TinaRock :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


At my job we have a secure website. Every hit to the site is captured by the tracking system to the SQL Server database.

We need to create an inventory system that can look at the data and tell us about the assets on the site.

To get the appropriate data into the database, we need to use a directory crawler that can hit every asset and every item that it finds in the directory structure.

Is there such a crawler, that can appear to be a user-agent, that can crawl a secure website? Is there such a crawler in ColdFusion?

Any ideas or pointers to such a crawler would be very much appreciated.

Thanks in advance,

Jo-Anne Head

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324325
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4

Re: ColdFusion crawler that appears to be a user-agent?

by Michael Dinowitz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


It should take you all of 2 minutes to write one using cfhttp with the
useragent attribute set to something like "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"

This is a fragment of a bot I wrote to go to a manga site and download
all of the images of a specific comic.
<cfset useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10">
<cfset siteroot="http://www.onemanga.com">
<cfset Seriesname="Change_123">
<cfhttp url="#siteroot#/#Seriesname#/" useragent="#useragent#">

Oh, you'll also need some regex to identify urls and other navigation
elements to populate the 'next' cfhttp calls. I usually use something
like this:
<cfset regex=structnew()>
<cfset regex.pagetrim='^.+?<!-- Start of ''content'' container
-->(.+?)<div id="footer">.+$'>
<cfset page=rereplacenocase(cfhttp.filecontent, regex.pagetrim, '\1')>

That particular regex trims out the stuff I don't need from the page
so I can run other regex on it to get the content I do need. This
makes the job faster as there is less to parse through, especially if
your doing replaces.

On Tue, Jul 7, 2009 at 3:40 PM, tinarock@...
tinarock@...<tinarock@...> wrote:

>
> At my job we have a secure website. Every hit to the site is captured by the tracking system to the SQL Server database.
>
> We need to create an inventory system that can look at the data and tell us about the assets on the site.
>
> To get the appropriate data into the database, we need to use a directory crawler that can hit every asset and every item that it finds in the directory structure.
>
> Is there such a crawler, that can appear to be a user-agent, that can crawl a secure website? Is there such a crawler in ColdFusion?
>
> Any ideas or pointers to such a crawler would be very much appreciated.
>
> Thanks in advance,
>
> Jo-Anne Head
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324331
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4