|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
ColdFusion crawler that appears to be a user-agent?At my job we have a secure website. Every hit to the site is captured by the tracking system to the SQL Server database. We need to create an inventory system that can look at the data and tell us about the assets on the site. To get the appropriate data into the database, we need to use a directory crawler that can hit every asset and every item that it finds in the directory structure. Is there such a crawler, that can appear to be a user-agent, that can crawl a secure website? Is there such a crawler in ColdFusion? Any ideas or pointers to such a crawler would be very much appreciated. Thanks in advance, Jo-Anne Head ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324325 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4 |
|
|
Re: ColdFusion crawler that appears to be a user-agent?It should take you all of 2 minutes to write one using cfhttp with the useragent attribute set to something like "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10" This is a fragment of a bot I wrote to go to a manga site and download all of the images of a specific comic. <cfset useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10"> <cfset siteroot="http://www.onemanga.com"> <cfset Seriesname="Change_123"> <cfhttp url="#siteroot#/#Seriesname#/" useragent="#useragent#"> Oh, you'll also need some regex to identify urls and other navigation elements to populate the 'next' cfhttp calls. I usually use something like this: <cfset regex=structnew()> <cfset regex.pagetrim='^.+?<!-- Start of ''content'' container -->(.+?)<div id="footer">.+$'> <cfset page=rereplacenocase(cfhttp.filecontent, regex.pagetrim, '\1')> That particular regex trims out the stuff I don't need from the page so I can run other regex on it to get the content I do need. This makes the job faster as there is less to parse through, especially if your doing replaces. On Tue, Jul 7, 2009 at 3:40 PM, tinarock@... tinarock@...<tinarock@...> wrote: > > At my job we have a secure website. Every hit to the site is captured by the tracking system to the SQL Server database. > > We need to create an inventory system that can look at the data and tell us about the assets on the site. > > To get the appropriate data into the database, we need to use a directory crawler that can hit every asset and every item that it finds in the directory structure. > > Is there such a crawler, that can appear to be a user-agent, that can crawl a secure website? Is there such a crawler in ColdFusion? > > Any ideas or pointers to such a crawler would be very much appreciated. > > Thanks in advance, > > Jo-Anne Head > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:324331 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.4 |
| Free embeddable forum powered by Nabble | Forum Help |