Hadoop 0.19.1

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Hadoop 0.19.1

by Nigel Daley-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Folks,

Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19  
branch has issues and a 0.19.1 release is needed.

Quality issues in the changes made for the file append feature have  
prevented some from deploying Hadoop 0.19.  One of these changes  
(sync) has now been "fixed" by reducing its semantics in Hadoop 0.18.3  
(HADOOP-4997).  This was necessary to stabilize the 0.18 branch.

I would like to propose that we apply this same "fix" to sync in  
0.19.1 and 0.20.0.  Since append requires the full semantics of sync,  
I propose we also disable append (perhaps throw  
UnsupportedOperationException from API?).  Yes, this would  
unfortunately be an incompatible change between 0.19.0 and 0.19.1.  We  
can then take the time needed to fix append properly in 0.21.0.

I will call a vote for 0.19.1 and 0.20.0 when blockers are fixed.

Nigel

Re: Hadoop 0.19.1

by Steve Loughran :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Nigel Daley wrote:

> Folks,
>
> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19
> branch has issues and a 0.19.1 release is needed.
>
> Quality issues in the changes made for the file append feature have
> prevented some from deploying Hadoop 0.19.  One of these changes (sync)
> has now been "fixed" by reducing its semantics in Hadoop 0.18.3
> (HADOOP-4997).  This was necessary to stabilize the 0.18 branch.
>
> I would like to propose that we apply this same "fix" to sync in 0.19.1
> and 0.20.0.  Since append requires the full semantics of sync, I propose
> we also disable append (perhaps throw UnsupportedOperationException from
> API?).  Yes, this would unfortunately be an incompatible change between
> 0.19.0 and 0.19.1.  We can then take the time needed to fix append
> properly in 0.21.0.

I can see some people being unhappy about this, but giving them a choice
between having the filesystem work or not, hopefully they will see the
merits of the change. And I am +1 to taking time to fix things; fast
fixes often create new problems

Re: Hadoop 0.19.1

by stack-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The below proposes pushing a working append out 6 months or so.

Can we have pointers as to what is meant by 'quality issues' in the below so
we can make a more informed vote or is it just the hdfs issue posted against
0.19.1?  Is it there also that we would find what is involved making append
work in 0.19.1?

Thanks,
St.Ack



On Fri, Jan 30, 2009 at 2:36 AM, Steve Loughran <stevel@...> wrote:

> Nigel Daley wrote:
>
>> Folks,
>>
>> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19 branch
>> has issues and a 0.19.1 release is needed.
>>
>> Quality issues in the changes made for the file append feature have
>> prevented some from deploying Hadoop 0.19.  One of these changes (sync) has
>> now been "fixed" by reducing its semantics in Hadoop 0.18.3 (HADOOP-4997).
>>  This was necessary to stabilize the 0.18 branch.
>>
>> I would like to propose that we apply this same "fix" to sync in 0.19.1
>> and 0.20.0.  Since append requires the full semantics of sync, I propose we
>> also disable append (perhaps throw UnsupportedOperationException from API?).
>>  Yes, this would unfortunately be an incompatible change between 0.19.0 and
>> 0.19.1.  We can then take the time needed to fix append properly in 0.21.0.
>>
>
> I can see some people being unhappy about this, but giving them a choice
> between having the filesystem work or not, hopefully they will see the
> merits of the change. And I am +1 to taking time to fix things; fast fixes
> often create new problems
>

RE: Hadoop 0.19.1

by Jim Kellerman (POWERSET) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Are proposing disabling both append and sync?

While HBase does not use the full semantics of append, we
do depend on sync.

There is a patch for trunk (HADOOP-4379) that we are testing
on the 0.19 branch, 0.20 branch and trunk.

-1 if sync does not work until 0.21.1 recovery from server
crashes depends entirely on sync working.

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Nigel Daley [mailto:ndaley@...]
> Sent: Thursday, January 29, 2009 4:16 PM
> To: core-dev@...
> Subject: Hadoop 0.19.1
>
> Folks,
>
> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19
> branch has issues and a 0.19.1 release is needed.
>
> Quality issues in the changes made for the file append feature have
> prevented some from deploying Hadoop 0.19.  One of these changes
> (sync) has now been "fixed" by reducing its semantics in Hadoop 0.18.3
> (HADOOP-4997).  This was necessary to stabilize the 0.18 branch.
>
> I would like to propose that we apply this same "fix" to sync in
> 0.19.1 and 0.20.0.  Since append requires the full semantics of sync,
> I propose we also disable append (perhaps throw
> UnsupportedOperationException from API?).  Yes, this would
> unfortunately be an incompatible change between 0.19.0 and 0.19.1.  We
> can then take the time needed to fix append properly in 0.21.0.
>
> I will call a vote for 0.19.1 and 0.20.0 when blockers are fixed.
>
> Nigel


Re: Hadoop 0.19.1

by Raghu Angadi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

stack wrote:
> The below proposes pushing a working append out 6 months or so.
>
> Can we have pointers as to what is meant by 'quality issues' in the below so
> we can make a more informed vote or is it just the hdfs issue posted against
> 0.19.1?  

> Is it there also that we would find what is involved making append
> work in 0.19.1?

If one knew what is enough to fix properly, it would be easy. But over
last couple of months, there have been many fixes (some of these jiras
are listed in one Konstantins HADOOP-4663). The discussions are still
bringing up more cases where the implementation or algorithm should
change. But these are improvements for sure. But doubt if I would be
ready to call it is 'completely fixed'. It needs time and a lot of
testing in large clusters.

Personally I am +1 for getting these into 0.19 branch. Most importantly
even clusters and application not using append or sync were also
affected, thats why extra caution.

my 2 cents. hope this does not digress too much from the main topic.

Raghu.

> Thanks,
> St.Ack
>
>
>
> On Fri, Jan 30, 2009 at 2:36 AM, Steve Loughran <stevel@...> wrote:
>
>> Nigel Daley wrote:
>>
>>> Folks,
>>>
>>> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19 branch
>>> has issues and a 0.19.1 release is needed.
>>>
>>> Quality issues in the changes made for the file append feature have
>>> prevented some from deploying Hadoop 0.19.  One of these changes (sync) has
>>> now been "fixed" by reducing its semantics in Hadoop 0.18.3 (HADOOP-4997).
>>>  This was necessary to stabilize the 0.18 branch.
>>>
>>> I would like to propose that we apply this same "fix" to sync in 0.19.1
>>> and 0.20.0.  Since append requires the full semantics of sync, I propose we
>>> also disable append (perhaps throw UnsupportedOperationException from API?).
>>>  Yes, this would unfortunately be an incompatible change between 0.19.0 and
>>> 0.19.1.  We can then take the time needed to fix append properly in 0.21.0.
>>>
>> I can see some people being unhappy about this, but giving them a choice
>> between having the filesystem work or not, hopefully they will see the
>> merits of the change. And I am +1 to taking time to fix things; fast fixes
>> often create new problems
>>
>


Re: Hadoop 0.19.1

by Raghu Angadi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Raghu Angadi wrote:
>> Is it there also that we would find what is involved making append
>> work in 0.19.1?
>
> If one knew what is enough to fix properly, it would be easy. But over
> last couple of months, there have been many fixes (some of these jiras
> are listed in one Konstantins HADOOP-4663).

Konstantin's comment I referred to (it was also linked from HADOOP-4663,
but harder to find).

https://issues.apache.org/jira/browse/HADOOP-5027#action_12668136

Raghu.

> The discussions are still
> bringing up more cases where the implementation or algorithm should
> change. But these are improvements for sure. But doubt if I would be
> ready to call it is 'completely fixed'. It needs time and a lot of
> testing in large clusters.
>
> Personally I am +1 for getting these into 0.19 branch. Most importantly
> even clusters and application not using append or sync were also
> affected, thats why extra caution.
>
> my 2 cents. hope this does not digress too much from the main topic.
>
> Raghu.
>
>> Thanks,
>> St.Ack
>>
>>
>>
>> On Fri, Jan 30, 2009 at 2:36 AM, Steve Loughran <stevel@...>
>> wrote:
>>
>>> Nigel Daley wrote:
>>>
>>>> Folks,
>>>>
>>>> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19
>>>> branch
>>>> has issues and a 0.19.1 release is needed.
>>>>
>>>> Quality issues in the changes made for the file append feature have
>>>> prevented some from deploying Hadoop 0.19.  One of these changes
>>>> (sync) has
>>>> now been "fixed" by reducing its semantics in Hadoop 0.18.3
>>>> (HADOOP-4997).
>>>>  This was necessary to stabilize the 0.18 branch.
>>>>
>>>> I would like to propose that we apply this same "fix" to sync in 0.19.1
>>>> and 0.20.0.  Since append requires the full semantics of sync, I
>>>> propose we
>>>> also disable append (perhaps throw UnsupportedOperationException
>>>> from API?).
>>>>  Yes, this would unfortunately be an incompatible change between
>>>> 0.19.0 and
>>>> 0.19.1.  We can then take the time needed to fix append properly in
>>>> 0.21.0.
>>>>
>>> I can see some people being unhappy about this, but giving them a choice
>>> between having the filesystem work or not, hopefully they will see the
>>> merits of the change. And I am +1 to taking time to fix things; fast
>>> fixes
>>> often create new problems
>>>
>>
>


Re: Hadoop 0.19.1

by Konstantin Shvachko :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Raghu, thanks for providing the link.

Jim> Are proposing disabling both append and sync?

Jim, this statement is probably too strong.
sync is not disabled per se, you will be able to use it, although
its full semantic is not guaranteed in some failure scenarios.
See more here
https://issues.apache.org/jira/browse/HADOOP-4663#action_12661802

We will have to really disable append (throw UnsupportedOperationException)
because otherwise current solution may lead to a loss of previously existed data.

I agree with Nigel that there is a need for an urgent 0.19.1 release
because a lot of bugs were fixed since 0.18.2 and 0.19.0.
The system is now stable on our clusters with 0.18.3 same fixes went into 0.19.1.

If we try to rush fixing the bugs for append (listed in my comment) we
risk to destabilize the system again, and this is my main concern.

Formally we should not release until a feature is fixed, but I think it
is better to let people use a stable release with limited functionality
rather than having full functionality with a risk of data loss.

Hope this will work for everybody.
--Konstantin


Raghu Angadi wrote:

> Raghu Angadi wrote:
>>> Is it there also that we would find what is involved making append
>>> work in 0.19.1?
>>
>> If one knew what is enough to fix properly, it would be easy. But over
>> last couple of months, there have been many fixes (some of these jiras
>> are listed in one Konstantins HADOOP-4663).
>
> Konstantin's comment I referred to (it was also linked from HADOOP-4663,
> but harder to find).
>
> https://issues.apache.org/jira/browse/HADOOP-5027#action_12668136
>
> Raghu.
>
>> The discussions are still bringing up more cases where the
>> implementation or algorithm should change. But these are improvements
>> for sure. But doubt if I would be ready to call it is 'completely
>> fixed'. It needs time and a lot of testing in large clusters.
>>
>> Personally I am +1 for getting these into 0.19 branch. Most
>> importantly even clusters and application not using append or sync
>> were also affected, thats why extra caution.
>>
>> my 2 cents. hope this does not digress too much from the main topic.
>>
>> Raghu.
>>
>>> Thanks,
>>> St.Ack
>>>
>>>
>>>
>>> On Fri, Jan 30, 2009 at 2:36 AM, Steve Loughran <stevel@...>
>>> wrote:
>>>
>>>> Nigel Daley wrote:
>>>>
>>>>> Folks,
>>>>>
>>>>> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19
>>>>> branch
>>>>> has issues and a 0.19.1 release is needed.
>>>>>
>>>>> Quality issues in the changes made for the file append feature have
>>>>> prevented some from deploying Hadoop 0.19.  One of these changes
>>>>> (sync) has
>>>>> now been "fixed" by reducing its semantics in Hadoop 0.18.3
>>>>> (HADOOP-4997).
>>>>>  This was necessary to stabilize the 0.18 branch.
>>>>>
>>>>> I would like to propose that we apply this same "fix" to sync in
>>>>> 0.19.1
>>>>> and 0.20.0.  Since append requires the full semantics of sync, I
>>>>> propose we
>>>>> also disable append (perhaps throw UnsupportedOperationException
>>>>> from API?).
>>>>>  Yes, this would unfortunately be an incompatible change between
>>>>> 0.19.0 and
>>>>> 0.19.1.  We can then take the time needed to fix append properly in
>>>>> 0.21.0.
>>>>>
>>>> I can see some people being unhappy about this, but giving them a
>>>> choice
>>>> between having the filesystem work or not, hopefully they will see the
>>>> merits of the change. And I am +1 to taking time to fix things; fast
>>>> fixes
>>>> often create new problems
>>>>
>>>
>>
>
>

Re: Hadoop 0.19.1

by Doug Judd :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Konstantin,

We are also heavy users of fsync().  I've been working with Dhruba on
Hadoop-4379 <https://issues.apache.org/jira/browse/HADOOP-4379>.  His most
recent patch appears to work for Jim's situation.  However, there are still
a couple of problems that need to be resolved before we can start heavily
using it:

1. After application crash/restart, the file length (as returned by
getFileStatus) is incorrect since the length at the namenode is stale.
Ideally getFileStatus() would return the accurate file length by fetching
the size of the last block from the primary datanode.  If that's not
feasible, there should be some other way to obtain the actual file length.

2. When an application comes up after a crash, it seems to hang for about 60
seconds waiting for lease recovery.  Our database cannot go offline for a
whole minute doing nothing.  In our case, when we come up after a crash and
try to re-open the log file, we know for certain that we are the exclusive
owner of that file.  There should be a way to tell the system to forcibly
take over the lease and recover immediately

What do you recommend?  Is there anyway we could get these two issues fixed
for 0.19.1, or should I file issues for them and get them on the schedule
for 0.19.2?

- Doug

On Mon, Feb 2, 2009 at 11:59 AM, Konstantin Shvachko <shv@...>wrote:

> Raghu, thanks for providing the link.
>
> Jim> Are proposing disabling both append and sync?
>
> Jim, this statement is probably too strong.
> sync is not disabled per se, you will be able to use it, although
> its full semantic is not guaranteed in some failure scenarios.
> See more here
> https://issues.apache.org/jira/browse/HADOOP-4663#action_12661802
>
> We will have to really disable append (throw UnsupportedOperationException)
> because otherwise current solution may lead to a loss of previously existed
> data.
>
> I agree with Nigel that there is a need for an urgent 0.19.1 release
> because a lot of bugs were fixed since 0.18.2 and 0.19.0.
> The system is now stable on our clusters with 0.18.3 same fixes went into
> 0.19.1.
>
> If we try to rush fixing the bugs for append (listed in my comment) we
> risk to destabilize the system again, and this is my main concern.
>
> Formally we should not release until a feature is fixed, but I think it
> is better to let people use a stable release with limited functionality
> rather than having full functionality with a risk of data loss.
>
> Hope this will work for everybody.
> --Konstantin
>
>
> Raghu Angadi wrote:
>
>> Raghu Angadi wrote:
>>
>>> Is it there also that we would find what is involved making append
>>>> work in 0.19.1?
>>>>
>>>
>>> If one knew what is enough to fix properly, it would be easy. But over
>>> last couple of months, there have been many fixes (some of these jiras are
>>> listed in one Konstantins HADOOP-4663).
>>>
>>
>> Konstantin's comment I referred to (it was also linked from HADOOP-4663,
>> but harder to find).
>>
>> https://issues.apache.org/jira/browse/HADOOP-5027#action_12668136
>>
>> Raghu.
>>
>>  The discussions are still bringing up more cases where the implementation
>>> or algorithm should change. But these are improvements for sure. But doubt
>>> if I would be ready to call it is 'completely fixed'. It needs time and a
>>> lot of testing in large clusters.
>>>
>>> Personally I am +1 for getting these into 0.19 branch. Most importantly
>>> even clusters and application not using append or sync were also affected,
>>> thats why extra caution.
>>>
>>> my 2 cents. hope this does not digress too much from the main topic.
>>>
>>> Raghu.
>>>
>>>  Thanks,
>>>> St.Ack
>>>>
>>>>
>>>>
>>>> On Fri, Jan 30, 2009 at 2:36 AM, Steve Loughran <stevel@...>
>>>> wrote:
>>>>
>>>>  Nigel Daley wrote:
>>>>>
>>>>>  Folks,
>>>>>>
>>>>>> Some Hadoop deployments have upgraded to 0.19.0.  Clearly, the 0.19
>>>>>> branch
>>>>>> has issues and a 0.19.1 release is needed.
>>>>>>
>>>>>> Quality issues in the changes made for the file append feature have
>>>>>> prevented some from deploying Hadoop 0.19.  One of these changes
>>>>>> (sync) has
>>>>>> now been "fixed" by reducing its semantics in Hadoop 0.18.3
>>>>>> (HADOOP-4997).
>>>>>>  This was necessary to stabilize the 0.18 branch.
>>>>>>
>>>>>> I would like to propose that we apply this same "fix" to sync in
>>>>>> 0.19.1
>>>>>> and 0.20.0.  Since append requires the full semantics of sync, I
>>>>>> propose we
>>>>>> also disable append (perhaps throw UnsupportedOperationException from
>>>>>> API?).
>>>>>>  Yes, this would unfortunately be an incompatible change between
>>>>>> 0.19.0 and
>>>>>> 0.19.1.  We can then take the time needed to fix append properly in
>>>>>> 0.21.0.
>>>>>>
>>>>>>  I can see some people being unhappy about this, but giving them a
>>>>> choice
>>>>> between having the filesystem work or not, hopefully they will see the
>>>>> merits of the change. And I am +1 to taking time to fix things; fast
>>>>> fixes
>>>>> often create new problems
>>>>>
>>>>>
>>>>
>>>
>>
>>

Re: Hadoop 0.19.1

by owen.omalley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:

> What do you recommend?  Is there anyway we could get these two  
> issues fixed
> for 0.19.1, or should I file issues for them and get them on the  
> schedule
> for 0.19.2?

Given the outstanding problems and general level of uncertainty, I'd  
favor releasing a 0.19.1 with the equivalent of the 0.18.3 disable on  
fsync and append. Let's get them fixed in 0.20 first and then we can  
debate whether the rewards of pushing them back into an 0.19.2 would  
make sense. I'm pretty uncomfortable at the moment with how the entire  
functional complex seems to cause a continuous stream of problems.

-- Owen

RE: Hadoop 0.19.1

by Jim Kellerman (POWERSET) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

HBase could live with sync() in 0.20.0 (and preferably 0.19.2),
but 0.20.0 is an absolute blocker.

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Owen O'Malley [mailto:owen.omalley@...] On Behalf Of Owen
> O'Malley
> Sent: Monday, February 02, 2009 1:51 PM
> To: core-dev@...
> Subject: Re: Hadoop 0.19.1
>
> On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:
>
> > What do you recommend?  Is there anyway we could get these two
> > issues fixed
> > for 0.19.1, or should I file issues for them and get them on the
> > schedule
> > for 0.19.2?
>
> Given the outstanding problems and general level of uncertainty, I'd
> favor releasing a 0.19.1 with the equivalent of the 0.18.3 disable on
> fsync and append. Let's get them fixed in 0.20 first and then we can
> debate whether the rewards of pushing them back into an 0.19.2 would
> make sense. I'm pretty uncomfortable at the moment with how the entire
> functional complex seems to cause a continuous stream of problems.
>
> -- Owen


Re: Hadoop 0.19.1

by Doug Judd :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sounds good.  I would much rather wait and have fsync() done correctly in
0.20 than get some sort of hacked version in 0.19.  I'll create a couple of
issues and mark them for 0.20  Thanks.

- Doug

On Mon, Feb 2, 2009 at 1:51 PM, Owen O'Malley <omalley@...> wrote:

> On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:
>
>  What do you recommend?  Is there anyway we could get these two issues
>> fixed
>> for 0.19.1, or should I file issues for them and get them on the schedule
>> for 0.19.2?
>>
>
> Given the outstanding problems and general level of uncertainty, I'd favor
> releasing a 0.19.1 with the equivalent of the 0.18.3 disable on fsync and
> append. Let's get them fixed in 0.20 first and then we can debate whether
> the rewards of pushing them back into an 0.19.2 would make sense. I'm pretty
> uncomfortable at the moment with how the entire functional complex seems to
> cause a continuous stream of problems.
>
> -- Owen
>

Re: Hadoop 0.19.1

by Konstantin Shvachko :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

 >  What do you recommend?

In general. There may be people/organizations, which will not compromise
on the reduced functionality in favor of the stability, this is understandable.
I would propose to create a separate (unofficial experimental) branch, which
would track changes like HADOOP-4379. The branch may later either die when the
main stream is fixed or be merged with the trunk if the changes proved to be stable.

 >1. the file length (as returned by getFileStatus) is incorrect

May be the following work around will be useful.
If you read from a file you always try to read more data than the length reported
by the name-node. How much more? The size of one block would be enough, or
even to the next (ceiling) block boundary.

 >2. When an application comes up after a crash, it seems to hang for about 60

Don't have enough context on that, sorry.

Thanks,
--Konstantin

Doug Judd wrote:

> Sounds good.  I would much rather wait and have fsync() done correctly in
> 0.20 than get some sort of hacked version in 0.19.  I'll create a couple of
> issues and mark them for 0.20  Thanks.
>
> - Doug
>
> On Mon, Feb 2, 2009 at 1:51 PM, Owen O'Malley <omalley@...> wrote:
>
>> On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:
>>
>>  What do you recommend?  Is there anyway we could get these two issues
>>> fixed
>>> for 0.19.1, or should I file issues for them and get them on the schedule
>>> for 0.19.2?
>>>
>> Given the outstanding problems and general level of uncertainty, I'd favor
>> releasing a 0.19.1 with the equivalent of the 0.18.3 disable on fsync and
>> append. Let's get them fixed in 0.20 first and then we can debate whether
>> the rewards of pushing them back into an 0.19.2 would make sense. I'm pretty
>> uncomfortable at the moment with how the entire functional complex seems to
>> cause a continuous stream of problems.
>>
>> -- Owen
>>
>

Re: Hadoop 0.19.1

by Doug Judd :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Comments inline ...

On Mon, Feb 2, 2009 at 4:23 PM, Konstantin Shvachko <shv@...>wrote:

> >  What do you recommend?
>
> In general. There may be people/organizations, which will not compromise
> on the reduced functionality in favor of the stability, this is
> understandable.
> I would propose to create a separate (unofficial experimental) branch,
> which
> would track changes like HADOOP-4379. The branch may later either die when
> the
> main stream is fixed or be merged with the trunk if the changes proved to
> be stable.


Sure, that sounds reasonable.  One thing I would caution against is spending
a lot of time doing incremental patchwork on something that needs a
ground-up overhaul.  I would much rather wait a couple of months longer and
get software that is based on a well thought out design that is
fundamentally sound.  Ultimately that will be the fastest path to stability.


>
> >1. the file length (as returned by getFileStatus) is incorrect



> May be the following work around will be useful.
> If you read from a file you always try to read more data than the length
> reported
> by the name-node. How much more? The size of one block would be enough, or
> even to the next (ceiling) block boundary.


I could certainly implement a workaround, however, from an API standpoint,
the filesystem (IMHO) should always give you a way to obtain the real length
of the file.  The semantics of the current getFileStatus() make it difficult
to reason about the state of your filesystem.  It basically returns a
"possibly stale" version of the length.  I would prefer to wait for an
implementation that gives an accurate answer and spend my time and energy
helping to test that one, rather than spending a bunch of time implementing
a workaround for the current version.

>2. When an application comes up after a crash, it seems to hang for about
> 60
>
> Don't have enough context on that, sorry.


I spoke too soon on this.  The reason that HDFS was hanging on lease
recovery was because I was opening the file in append mode to force lease
recovery (at Dhruba's suggestion) so that it would update the NameNode with
the proper length.  If I had a method of obtaining the accurate length of
the file, I wouldn't need to do this.  Hence, I didn't bother filing an
issue on this.

- Doug


> Thanks,
> --Konstantin
>
> Doug Judd wrote:
>
>> Sounds good.  I would much rather wait and have fsync() done correctly in
>> 0.20 than get some sort of hacked version in 0.19.  I'll create a couple
>> of
>> issues and mark them for 0.20  Thanks.
>>
>> - Doug
>>
>> On Mon, Feb 2, 2009 at 1:51 PM, Owen O'Malley <omalley@...> wrote:
>>
>>  On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:
>>>
>>>  What do you recommend?  Is there anyway we could get these two issues
>>>
>>>> fixed
>>>> for 0.19.1, or should I file issues for them and get them on the
>>>> schedule
>>>> for 0.19.2?
>>>>
>>>>  Given the outstanding problems and general level of uncertainty, I'd
>>> favor
>>> releasing a 0.19.1 with the equivalent of the 0.18.3 disable on fsync and
>>> append. Let's get them fixed in 0.20 first and then we can debate whether
>>> the rewards of pushing them back into an 0.19.2 would make sense. I'm
>>> pretty
>>> uncomfortable at the moment with how the entire functional complex seems
>>> to
>>> cause a continuous stream of problems.
>>>
>>> -- Owen
>>>
>>>
>>

Re: Hadoop 0.19.1

by Sanjay Radia :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Feb 2, 2009, at 1:51 PM, Owen O'Malley wrote:

> On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:
>
> > What do you recommend?  Is there anyway we could get these two
> > issues fixed
> > for 0.19.1, or should I file issues for them and get them on the
> > schedule
> > for 0.19.2?
>
> Given the outstanding problems and general level of uncertainty, I'd
> favor releasing a 0.19.1 with the equivalent of the 0.18.3 disable on
> fsync and append. Let's get them fixed in 0.20 first and then we can
> debate whether the rewards of pushing them back into an 0.19.2 would
> make sense. I'm pretty uncomfortable at the moment with how the entire
> functional complex seems to cause a continuous stream of problems.
>
> -- Owen
>
+1
Append/sync code has been very problematic and bug fixes have not  
always fixed things completely.
Hence getting a quick  release of 0.19.1 that deal with the data loss  
(ie 0.18.3 functionality) is important.

sanjay

Re: Hadoop 0.19.1

by Sanjay Radia :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Feb 2, 2009, at 6:18 PM, Doug Judd wrote:

> Comments inline ...
>
> On Mon, Feb 2, 2009 at 4:23 PM, Konstantin Shvachko <shv@yahoo-
> inc.com>wrote:
>
> > >  What do you recommend?
> >
> > In general. There may be people/organizations, which will not  
> compromise
> > on the reduced functionality in favor of the stability, this is
> > understandable.
> > I would propose to create a separate (unofficial experimental)  
> branch,
> > which
> > would track changes like HADOOP-4379. The branch may later either  
> die when
> > the
> > main stream is fixed or be merged with the trunk if the changes  
> proved to
> > be stable.
>
>
> Sure, that sounds reasonable.  One thing I would caution against is  
> spending
> a lot of time doing incremental patchwork on something that needs a
> ground-up overhaul.  I would much rather wait a couple of months  
> longer and
> get software that is based on a well thought out design that is
> fundamentally sound.  Ultimately that will be the fastest path to  
> stability.
>

Agree.


sanjay
>

Re: Hadoop 0.19.1

by Sanjay Radia :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Feb 2, 2009, at 4:23 PM, Konstantin Shvachko wrote:

>
>  >  What do you recommend?
>
> In general. There may be people/organizations, which will not  
> compromise
> on the reduced functionality in favor of the stability, this is  
> understandable.
> I would propose to create a separate (unofficial experimental)  
> branch, which
> would track changes like HADOOP-4379. The branch may later either  
> die when the
> main stream is fixed or be merged with the trunk if the changes  
> proved to be stable.
>


This is very a interesting suggestion.
Many in the team  have come to the conclusion that complex projects  
like append should be done on a separate branch in the first place and  
integrated with trunk when the project is stable.





sanjay

>
>
>  >1. the file length (as returned by getFileStatus) is incorrect
>
> May be the following work around will be useful.
> If you read from a file you always try to read more data than the  
> length reported
> by the name-node. How much more? The size of one block would be  
> enough, or
> even to the next (ceiling) block boundary.
>
>  >2. When an application comes up after a crash, it seems to hang  
> for about 60
>
> Don't have enough context on that, sorry.
>
> Thanks,
> --Konstantin
>
> Doug Judd wrote:
> > Sounds good.  I would much rather wait and have fsync() done  
> correctly in
> > 0.20 than get some sort of hacked version in 0.19.  I'll create a  
> couple of
> > issues and mark them for 0.20  Thanks.
> >
> > - Doug
> >
> > On Mon, Feb 2, 2009 at 1:51 PM, Owen O'Malley <omalley@...>  
> wrote:
> >
> >> On Feb 2, 2009, at 12:51 PM, Doug Judd wrote:
> >>
> >>  What do you recommend?  Is there anyway we could get these two  
> issues
> >>> fixed
> >>> for 0.19.1, or should I file issues for them and get them on the  
> schedule
> >>> for 0.19.2?
> >>>
> >> Given the outstanding problems and general level of uncertainty,  
> I'd favor
> >> releasing a 0.19.1 with the equivalent of the 0.18.3 disable on  
> fsync and
> >> append. Let's get them fixed in 0.20 first and then we can debate  
> whether
> >> the rewards of pushing them back into an 0.19.2 would make sense.  
> I'm pretty
> >> uncomfortable at the moment with how the entire functional  
> complex seems to
> >> cause a continuous stream of problems.
> >>
> >> -- Owen
> >>
> >
>


Re: Hadoop 0.19.1

by Steve Loughran :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sanjay Radia wrote:

>
> On Feb 2, 2009, at 4:23 PM, Konstantin Shvachko wrote:
>
>>
>>  >  What do you recommend?
>>
>> In general. There may be people/organizations, which will not compromise
>> on the reduced functionality in favor of the stability, this is
>> understandable.
>> I would propose to create a separate (unofficial experimental) branch,
>> which
>> would track changes like HADOOP-4379. The branch may later either die
>> when the
>> main stream is fixed or be merged with the trunk if the changes proved
>> to be stable.
>>
>
>
> This is very a interesting suggestion.
> Many in the team  have come to the conclusion that complex projects like
> append should be done on a separate branch in the first place and
> integrated with trunk when the project is stable.
>

There's a lot to be said for branching; I'm also looking at git so I can
do my service lifecycle stuff under SCM properly.

but the cost of merging can be high. I'd estimate 1 morning/week is
spent updating my local SVN and then seeing that everything still works.
If hudson could both test the branches and test any merged branches,
life would be better

The other problem is incompatible branches: the more branches you have
live, the higher the merge cost.

That said, Git promises wonderful things, and we ought to be able to set
up Apache support for git for people wanting to do their own branches
-svn would still be the official SCM tool

Re: Hadoop 0.19.1

by stack-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Feb 3, 2009 at 7:02 PM, Sanjay Radia <sradia@...> wrote:

>
> Many in the team  have come to the conclusion that complex projects like
> append should be done on a separate branch in the first place and integrated
> with trunk when the project is stable.
>

How do we determine when a feature on the branch is 'stable'?  Is there a
test suite to run beyond hudson's running of unit tests?

St.Ack

Re: Hadoop 0.19.1

by stack-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jan 29, 2009 at 4:16 PM, Nigel Daley <ndaley@...> wrote:
>
> I would like to propose that we apply this same "fix" to sync in 0.19.1 and
> 0.20.0.  Since append requires the full semantics of sync, I propose we also
> disable append (perhaps throw UnsupportedOperationException from API?).
>  Yes, this would unfortunately be an incompatible change between 0.19.0 and
> 0.19.1.  We can then take the time needed to fix append properly in 0.21.0.
>


+1 on a 0.19.1 release with a 0.18.3-like solution removing some of the sync
symantics.  A stable hdfs takes precedence.

Please count us hbasistas in when testing of append is needed.
Applications' like hbase are plain broke without it (You may have heard us
mention this fact once or twice at times in the past -- smile).

Good stuff,
St.Ack

Re: Hadoop 0.19.1

by Sanjay Radia :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Feb 4, 2009, at 3:38 AM, Steve Loughran wrote:

> Sanjay Radia wrote:
> >
> > On Feb 2, 2009, at 4:23 PM, Konstantin Shvachko wrote:
> >
> >>
> >>  >  What do you recommend?
> >>
> >> In general. There may be people/organizations, which will not  
> compromise
> >> on the reduced functionality in favor of the stability, this is
> >> understandable.
> >> I would propose to create a separate (unofficial experimental)  
> branch,
> >> which
> >> would track changes like HADOOP-4379. The branch may later either  
> die
> >> when the
> >> main stream is fixed or be merged with the trunk if the changes  
> proved
> >> to be stable.
> >>
> >
> >
> > This is very a interesting suggestion.
> > Many in the team  have come to the conclusion that complex  
> projects like
> > append should be done on a separate branch in the first place and
> > integrated with trunk when the project is stable.
> >
>
> There's a lot to be said for branching; I'm also looking at git so I  
> can
> do my service lifecycle stuff under SCM properly.
>
> but the cost of merging can be high. I'd estimate 1 morning/week is
> spent updating my local SVN and then seeing that everything still  
> works.
> If hudson could both test the branches and test any merged branches,
> life would be better
>

I agree on the cost of merging.
When a project is branched,  after a while one can spend as much as  
30% of cycles merging
in changes.
But when a system is used in production to store data we cannot afford  
to have users loose their data.
The team at Yahoo had to scramble to recover the lost data, put in  
several emergency patches to deal with
the append code.

I am all for extending hudson testing for branches, but hudson  
testing, while helpful, will not be sufficient  for big
projects because hudson does not have a comprehensive set of tests.  
Each new release is tested significantly beyond the hudson tests.

For me the lesson is that large complex projects should be branched.
(This is how commercial software products are engineered).
There will increased cost to the project team, but over all, the  
community  will have more solid releases and the total cost to the  
community  in delivering the techology will be smaller.

sanjay

>
>
> The other problem is incompatible branches: the more branches you have
> live, the higher the merge cost.
>
> That said, Git promises wonderful things, and we ought to be able to  
> set
> up Apache support for git for people wanting to do their own branches
> -svn would still be the official SCM tool
>

< Prev | 1 - 2 | Next >