[jira] Created: (HADOOP-2566) need FileSystem#globStatus method

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 | Next >

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559307#action_12559307 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

Stat2Paths is public because I expect that users need to use it after we remove listPath etc. in release 0.17. OK, I will put it in FileUtil.

I am thinking to keep globPath for now. After we make changes in FSShell, then we decide if we can live without it.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559311#action_12559311 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

> If the distinction is important then perhaps the non-existing file case should return null, while the non-matching expression case should return an empty array.
Just saw this. Sounds good. I will make changes to globStatus reflecting this.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-2566:
----------------------------------

    Attachment: globStatus1.patch

I deprecated globPath in this patch and I changed the semantics of globStatus, which returns null if the user supplied path does not exist and returns an empty array if the input path has a glob but no path matches it.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559679#action_12559679 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

Doesn't this patch essentially do ' {{arr; for (path : old_globPaths()) arr[i++] = getFileStatus(path); return arr;}} '. Is this what we wanted? I thought we wanted other way around.

Also this looks like regressions of HADOOP-2151 since 'hasPattern()' is only checked for last component.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559681#action_12559681 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

Is this a blocker for 16 feature freeze?


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559692#action_12559692 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

> Also this looks like regressions of HADOOP-2151 since 'hasPattern()' is only checked for last component.
e.g:
with the patch:{noformat}
$ bin/hadoop fs -D fs.default.name=local -ls '/tmp/x*/xxx'
ls: Could not get listing for file:/tmp/x/xxx
$ {noformat}
trunk : {noformat}
bin/hadoop fs -D fs.default.name=local -ls '/tmp/x*/xxx'
$ {noformat}

Here, {{/tmp/x}} exists but not '{{/tmp/x/xxx}}'. It does not matter whether /tmp/x is a file or a directory.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559723#action_12559723 ]

Doug Cutting commented on HADOOP-2566:
--------------------------------------

> Is this what we wanted? I thought we wanted other way around.

I don't think it does that in all cases, but it does still appear to call getStatus() in places.  I've not yet examined the logic to see if that's easily avoidable or not.  But it's not a fatal problem at this point.  For this release the important thing is to have globStatus() as the preferred, non-deprecated method.  Once we remove the status cache, during 0.17 development, we'll soon find out whether the globStatus() implementation needs more work to perform well without a cache, and fix that before 0.17 is released.  But that aspect shouldn't block this for 0.16, since we still have the cache in 0.16.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559735#action_12559735 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

> Doesn't this patch essentially do ' arr; for (path : old_globPaths()) arr[i++] = getFileStatus(path); return arr; '. Is this what we wanted? I thought we wanted other way around.
No, this patch does not do what you described. Basically it only listStatus on the parent directories when there is a glob in the component and it calls getFileStatus on the last component if there is no glob in there.

For example 1, globStatus("/user/hairong/file*") only does listStatus("/user/hairong") and returns status for the matched files/subdirectories; Previously globPath("/user/hairong/file*") would listStatus("/user/hairong") then discard all statuses of the matched files/subdirectories then return only paths. The caller has to call getFileStatus again for each returned path.

For example 2, globStatus("/user/*/file") calls listStaus("/user") and then calls getFileStatus on all the paths matched /user/*/file. Then it does as what you described.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559742#action_12559742 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

You are right, it does what I described only in example 2 above and not in the first example. When I get some more time I will think about how to avoid it in second example as well. Something like replacing globPathLevel() with with globStatusLevel().

 

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559742#action_12559742 ]

rangadi edited comment on HADOOP-2566 at 1/16/08 3:52 PM:
---------------------------------------------------------------

You are right, it does what I described only in example 2 above and not in the first example. When I get some more time I will think about how to avoid it in second example as well. Something like replacing globPathLevel() with with globStatusLevel(). btw, glob path in second example is {{/usrer/*/file}}.

      was (Author: rangadi):
    You are right, it does what I described only in example 2 above and not in the first example. When I get some more time I will think about how to avoid it in second example as well. Something like replacing globPathLevel() with with globStatusLevel().

 
 

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559735#action_12559735 ]

hairong edited comment on HADOOP-2566 at 1/16/08 3:53 PM:
----------------------------------------------------------------

> Doesn't this patch essentially do ' arr; for (path : old_globPaths()) arr[i++] = getFileStatus(path); return arr; '. Is this what we wanted? I thought we wanted other way around.
No, this patch does not do what you described. Basically it only listStatus on the parent directories when there is a glob in the component and it calls getFileStatus on the last component if there is no glob in there.

For example 1, globStatus("/user/hairong/file*") only does listStatus("/user/hairong") and returns status for the matched files/subdirectories; Previously globPath("/user/hairong/file*") would listStatus("/user/hairong") then discard all statuses of the matched files/subdirectories then return only paths. The caller has to call getFileStatus again for each returned path.

For example 2, globStatus("/user/\*/file") calls listStaus("/user") and then calls getFileStatus on all the paths matched /user/*/file. It does as what you described.

      was (Author: hairong):
    > Doesn't this patch essentially do ' arr; for (path : old_globPaths()) arr[i++] = getFileStatus(path); return arr; '. Is this what we wanted? I thought we wanted other way around.
No, this patch does not do what you described. Basically it only listStatus on the parent directories when there is a glob in the component and it calls getFileStatus on the last component if there is no glob in there.

For example 1, globStatus("/user/hairong/file*") only does listStatus("/user/hairong") and returns status for the matched files/subdirectories; Previously globPath("/user/hairong/file*") would listStatus("/user/hairong") then discard all statuses of the matched files/subdirectories then return only paths. The caller has to call getFileStatus again for each returned path.

For example 2, globStatus("/user/*/file") calls listStaus("/user") and then calls getFileStatus on all the paths matched /user/*/file. Then it does as what you described.
 

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559746#action_12559746 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

Isn't the behavior in example 2 what we expect? I must have misunderstood something. What do you want to avoid?

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559749#action_12559749 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

> What do you want to avoid?
exactly, after talking to you it looks more like there isn't anything we can avoid. My initial thought was we could avoid the final getStatus() loop. Thanks.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559760#action_12559760 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

Thanks Raghu! You comment relieved my mind. Don't want to have a wrong algorithm right before the feature freeze.

Regarding the regression, yes, I removed what HADOOP-2151 did because I think it is not efficient to call exist for each component when there is a glob on the path. My algorithm depends on getFileStatus to throw an exception that indicates an non-existent path. It works on dfs. But LocalFileSystem.getFileStatus returns a valid FileStatus object. I will fix this. I'd like to change the semantics of getFileStaus to return null on a non-existent path. Thanks for helping me test this feature.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559760#action_12559760 ]

hairong edited comment on HADOOP-2566 at 1/16/08 4:27 PM:
----------------------------------------------------------------

Thanks Raghu! You comment relieved my mind. Don't want to have a wrong algorithm right before the feature freeze.

Regarding the regression, yes, I removed what HADOOP-2151 did because I think it is not efficient to call exist for each component when there is a glob on the path. My algorithm depends on getFileStatus to throw an exception that indicates an non-existent path. It works on dfs. But LocalFileSystem.getFileStatus returns a valid FileStatus object on a non-existent path. I will fix this. I'd like to change the semantics of getFileStaus to return null on a non-existent path. Thanks for helping me test this feature.

      was (Author: hairong):
    Thanks Raghu! You comment relieved my mind. Don't want to have a wrong algorithm right before the feature freeze.

Regarding the regression, yes, I removed what HADOOP-2151 did because I think it is not efficient to call exist for each component when there is a glob on the path. My algorithm depends on getFileStatus to throw an exception that indicates an non-existent path. It works on dfs. But LocalFileSystem.getFileStatus returns a valid FileStatus object. I will fix this. I'd like to change the semantics of getFileStaus to return null on a non-existent path. Thanks for helping me test this feature.
 

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559766#action_12559766 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

hmm... by luck, there is one small side benefit from my misread of the code :). I checked with localfs only because it did not need a running dfs.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-2566:
----------------------------------

    Attachment: globStatus2.patch

Raghu, whether it is by luck or not, it is always good to have somebody to take extra effort to make sure that your code works correctly. :-)

The patch fixed the non-existent path problem. It introduces an incompatible change that made getFileStatus to return null for a non-existent path. In the current trunk, dfs throws a RemoteException with a "file does not exist" message and local file system returns a valid FileStatus object.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch, globStatus2.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559781#action_12559781 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

On the second thought, I saw a lot of calls like getFileStatus().isDir in our code base. Making getFileStatus to return null would break those calls. How about letting getFileStatus throws a FileNotFoundException when the path does not exist?

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch, globStatus2.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-2566:
----------------------------------

    Attachment: globStatus3.patch

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch, globStatus2.patch, globStatus3.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-2566:
----------------------------------

    Attachment: globStatus4.patch

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch, globStatus1.patch, globStatus2.patch, globStatus3.patch, globStatus4.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

< Prev | 1 - 2 - 3 | Next >