[jira] Created: (HADOOP-2566) need FileSystem#globStatus method

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 | Next >

[jira] Created: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

need FileSystem#globStatus method
---------------------------------

                 Key: HADOOP-2566
                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
             Project: Hadoop
          Issue Type: Improvement
          Components: fs
            Reporter: Doug Cutting
             Fix For: 0.16.0


To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().

We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting reassigned HADOOP-2566:
------------------------------------

    Assignee: Hairong Kuang

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558067#action_12558067 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

Did you mean that we need FileStatus[] listStatus rather than Path[] listPaths?

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558071#action_12558071 ]

Doug Cutting commented on HADOOP-2566:
--------------------------------------

No, we need 'FileStatus[] globStatus(Path pattern)' instead of 'Path[] globPaths(Path pattern)'.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558078#action_12558078 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

I do not see why we need globStatus. GlobPath is essentially pattern matching. If the provided path does not contain any pattern, the given path is returned without talking to the namenode.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558081#action_12558081 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

globStatus would certainly be useful since globPaths() is used in many places where we really want to do globStatus(). globStatus is much more efficient in those cases since we aften do {{for(path : globPaths(pattern)) { stat = listStatus(path) ... }.

I am not sure if globPaths() can go away. One difference I see is that globPath("/non/existent/path/withoutglob") returns simple path without any filesystem interaction (as expected). But globStatus("/non/existent/path/withoutglob")  will ask filesystem and will return NULL (or array with zero entries).


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558081#action_12558081 ]

rangadi edited comment on HADOOP-2566 at 1/11/08 11:36 AM:
----------------------------------------------------------------

globStatus would certainly be useful since globPaths() is used in many places where we really want to do globStatus(). globStatus is much more efficient in those cases since we aften do '{{for(path : globPaths(pattern)) { stat = listStatus(path) ... }}}'.

I am not sure if globPaths() can go away. One difference I see is that globPath("/non/existent/path/withoutglob") returns simple path without any filesystem interaction (as expected). But globStatus("/non/existent/path/withoutglob")  will ask filesystem and will return NULL (or array with zero entries).


      was (Author: rangadi):
    globStatus would certainly be useful since globPaths() is used in many places where we really want to do globStatus(). globStatus is much more efficient in those cases since we aften do {{for(path : globPaths(pattern)) { stat = listStatus(path) ... }.

I am not sure if globPaths() can go away. One difference I see is that globPath("/non/existent/path/withoutglob") returns simple path without any filesystem interaction (as expected). But globStatus("/non/existent/path/withoutglob")  will ask filesystem and will return NULL (or array with zero entries).

 

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558082#action_12558082 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

Also, this would not duplicate code. {{globPaths()}} would just be implemented with {{globStatus()}} (when there is a glob in the path).


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558088#action_12558088 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

GlobPath is intended to return all pathes that matches the given glob. It is not intended to do "'for(path : globPaths(pattern)) { stat = listStatus(path) ... }'. The feature that you want is listing all the pathes that matches the glob.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558093#action_12558093 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

> "'for(path : globPaths(pattern)) { stat = listStatus(path) ... }'.
FsShell.setReplication() is an example of this pattern of use (essentially).

I agree that globStatus() may not replace all uses of globPaths().

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558129#action_12558129 ]

Doug Cutting commented on HADOOP-2566:
--------------------------------------

Globbing is implemented on top of listPaths() which is implemented on top of listStatus().  The primitive globbing API should not throw away that status information.  It should keep it so that glob clients which need it do not have to call getStatus() for each file that matches.  Currently the cache of FileStatus hides the cost of these getStatus() calls, but that cache will break things once files and their status can change.  So we need globStatus() before we can remove the cache.

FileInputFormat, for example, uses globPaths() to list files matching the input specification, then it uses getStatus() on each matching path when building splits.  This must change to call globStatus() before the cache is removed.

Long-term, globPaths() and listPaths() may perhaps still be useful as a utility methods implemented in terms of of globStatus() and listStatus(), but since most current users of these will be broken performancewise once the cache is removed, we should deprecate them now to strongly encourage folks to stop using them before that cache is removed, to give fair warning.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558179#action_12558179 ]

Hairong Kuang commented on HADOOP-2566:
---------------------------------------

I am still not comfortable with this change:

1. Some of shell commands like delete, copy, and rename use globPath but don't need FileStatus.
2. GlobPath does not always call listPath for every directory. For example, globPath("/user/*/data") needs only to listPath("/user"). Returning FileStatus[] requires listPath on each user xx's home directory /user/xx and /user/xx/data. This is a lot of overhead.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558179#action_12558179 ]

hairong edited comment on HADOOP-2566 at 1/11/08 5:01 PM:
----------------------------------------------------------------

I am still not comfortable with this change:

1. Some of shell commands like delete, copy, and rename use globPath but don't need FileStatus.
2. GlobPath does not always call listPath for every directory. For example, globPath("/user/*/data") needs only to listPath("/user"). Returning FileStatus[] requires additional listPath calls on each user xx's home directory /user/xx and the root /. This is a lot of overhead.

      was (Author: hairong):
    I am still not comfortable with this change:

1. Some of shell commands like delete, copy, and rename use globPath but don't need FileStatus.
2. GlobPath does not always call listPath for every directory. For example, globPath("/user/*/data") needs only to listPath("/user"). Returning FileStatus[] requires listPath on each user xx's home directory /user/xx and /user/xx/data. This is a lot of overhead.
 

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558742#action_12558742 ]

Doug Cutting commented on HADOOP-2566:
--------------------------------------



> For example, globPath("/user/*/data") needs only to listPath("/user").

But listPaths() is not a primitive, it is a utility method defined in terms of listStatus().  So this example is calling listStatus("/user") and then stripping the list of FileStatus objects down to a list of Path objects.  We should remove that stripping, or at least make it optional.  To make it optional, the primitive glob operation should be globStatus, and globPaths() should become a utility method defined in terms of globStatus().

> Some of shell commands like delete, copy, and rename use globPath but don't need FileStatus.

These actually all do need the FileStatus.  They need to find out whether each file is a directory or not, to find out when to recurse.  Copy also needs other attributes so that they can be set on the copy too.  So we'll end up needing to rework these.

We will not remove globPaths() in this release, so these commands do not need to change right now.  But before we can remove the cache we need to examine every place that calls globPaths to check whether these must be converted to use globStatus.  That's why we're deprecating globPaths(), to force folks to do this.  Then, in 0.17, we can remove the cache from trunk, and start identifying all the problems.  But we want users who upgrade to 0.17 to be forwarned, and to have an API that supports cache-free use before we remove the cache, so that they can upgrade to 0.17 more smoothly.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559274#action_12559274 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

Assuming globPaths() goes away.. Should a user of globStatus() be able to distinguish between a non-existent path and a glob that does not match any files? If yes, how? Many hadoop shell commands treat these two differently (I think that is consistent with normal shell behavior).


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559275#action_12559275 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

Also, until globPaths() is removed, its probably better if its behavior (or contract?) does not change.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559284#action_12559284 ]

Doug Cutting commented on HADOOP-2566:
--------------------------------------

> Should a user of globStatus() be able to distinguish between a non-existent path and a glob that does not match any files?

I'm not sure I completely understand the distinction.  In one case are you passing a path without any meta characters but that does not exist, and in the other one with metacharacters but that matches no files?

In any case it should probably handle this the same way globPaths() does.  If the distinction is important then perhaps the non-existing file case should return null, while the non-matching expression case should return an empty array.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559293#action_12559293 ]

Raghu Angadi commented on HADOOP-2566:
--------------------------------------

> I'm not sure I completely understand the distinction. In one case are you passing a path without any meta characters but that does not exist, and in the other one with metacharacters but that matches no files?

yes. e.g. following two commands have different contents in stderr:
- {{bin/hadoop fs -cat '/tmp/nonexistent*' /tmp/exists}}
- {{bin/hadoop fs -cat '/tmp/nonexistent' /tmp/exists}}

This is how the current behavior is.

> If the distinction is important then perhaps the non-existing file case should return null, while the non-matching expression case should return an empty array.
Sounds good. globPaths() can use this keep the current behavior unchanged.

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-2566:
----------------------------------

    Attachment: globStatus.patch

My first attempt to this issue. Comments welcome!

> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559299#action_12559299 ]

Doug Cutting commented on HADOOP-2566:
--------------------------------------

A few comments:
- should stat2paths be a public method on FileSystem?  I'd prefer it were either private or perhaps on FileUtil.
- globPaths() isn't deprecated.  Do we think we'll keep this, or should it be deprecated?  It is handy in some cases, but, on the other hand, we'd like to force folks to examine their uses of it, since in most cases performance will become abysmal once the FileStatus cache is removed, and we don't want to surprise folks with that.  Thoughts?


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>         Attachments: globStatus.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[].  Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

< Prev | 1 - 2 - 3 | Next >