|
View:
New views
1 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Commented: (NUTCH-101) RobotRulesParser[ https://issues.apache.org/jira/browse/NUTCH-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722014#action_12722014 ] Ken Krugler commented on NUTCH-101: ----------------------------------- 1. Not sure if the reported problem with "Disallow:" was fixed, or never existed, but the 1.0 code base has no issues with this. 2. Not sure if the reported problem with parsing multiple agent names was fixed, or never existed, but this code: StringTokenizer tok = new StringTokenizer(agentNames, ","); is now StringTokenizer agentTokenizer = new StringTokenizer(agentNames); Which means it will break on space, tab, return, etc. (white space) but not ',' 3. The reported problem with: StringTokenizer lineParser= new StringTokenizer(content, "\n\r"); doesn't exist. StringTokenizer will break on either \n or \r, and if these occur together (e.g. DOS line endings) then it still works properly because the empty string it finds between the \n and the \r isn't returned (treated as an empty token). You could add support for the more esoteric endings described above, but that doesn't seem very important. 4. The minor bug in main() appears to have been fixed. The code is now: String[] robotNames= new String[argv.length - 2]; So I think this issue can be closed. > RobotRulesParser > ---------------- > > Key: NUTCH-101 > URL: https://issues.apache.org/jira/browse/NUTCH-101 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.6, 0.7, 0.7.1, 0.8 > Reporter: Fuad Efendi > > I noticed this code in protocol-http & protocol-httpclient plugins: > } else if ( (line.length() >= 6) > && (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) { > However, according to the original 1994 protocol description, there is NO "Allow:" field. To allow, simply use "Disallow: ". http://www.robotstxt.org/wc/norobots.html > Please, try to test with www.newegg.com/robots.txt > - their site has this: > User-agent: * > Disallow: > And Nutch does not work with New Egg, but it should! > Sorry guys, I don't have enough time to double-ensure, could you please verify all this... > I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we need to test ......./robots.txt > User-agent: ia_archiver > Disallow: / > User-agent: Googlebot-Image > Disallow: / > User-agent: Nutch > Disallow: / > User-agent: TurnitinBot > Disallow: / > - everything according to standard protocol. Can you retest please whether it works with multiline? It's a standard! > I see this in code: > StringTokenizer tok = new StringTokenizer(agentNames, ","); > > Comma separated? It's not accepted standard yet... > Sorry WebExpertsAmerica, I really didn't have any time to make any test... > Please do not execute tests against production sites. > Thanks! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |