It should take you all of 2 minutes to write one using cfhttp with the
useragent attribute set to something like "Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"
This is a fragment of a bot I wrote to go to a manga site and download
all of the images of a specific comic.
<cfset useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10">
<cfset siteroot="
http://www.onemanga.com">
<cfset Seriesname="Change_123">
<cfhttp url="#siteroot#/#Seriesname#/" useragent="#useragent#">
Oh, you'll also need some regex to identify urls and other navigation
elements to populate the 'next' cfhttp calls. I usually use something
like this:
<cfset regex=structnew()>
<cfset regex.pagetrim='^.+?<!-- Start of ''content'' container
-->(.+?)<div id="footer">.+$'>
<cfset page=rereplacenocase(cfhttp.filecontent, regex.pagetrim, '\1')>
That particular regex trims out the stuff I don't need from the page
so I can run other regex on it to get the content I do need. This
makes the job faster as there is less to parse through, especially if
your doing replaces.
On Tue, Jul 7, 2009 at 3:40 PM,
tinarock@...
tinarock@...<
tinarock@...> wrote:
>
> At my job we have a secure website. Every hit to the site is captured by the tracking system to the SQL Server database.
>
> We need to create an inventory system that can look at the data and tell us about the assets on the site.
>
> To get the appropriate data into the database, we need to use a directory crawler that can hit every asset and every item that it finds in the directory structure.
>
> Is there such a crawler, that can appear to be a user-agent, that can crawl a secure website? Is there such a crawler in ColdFusion?
>
> Any ideas or pointers to such a crawler would be very much appreciated.
>
> Thanks in advance,
>
> Jo-Anne Head
>
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists
Archive:
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324331Subscription:
http://www.houseoffusion.com/groups/cf-talk/subscribe.cfmUnsubscribe:
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4