Platform reliability with Hadoop

View: New views
9 Messages — Rating Filter:   Alert me  

Platform reliability with Hadoop

by Jeff Eastman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
machines in our CUBiT array for the last month. During this time I have
experienced two major data corruption losses on relatively small amounts
of data (<50gb) that make me wonder about the suitability of this
platform for hosting Hadoop. CUBiT is one of our products for managing a
pool of development servers, allowing developers to check out machines,
install various OS profiles on them and monitor their utilization via
the web. With most machines reporting very low utilization it seemed a
natural place to run Hadoop in the background. I have an NFS-mounted
account on all of the machines and have installed Hadoop there. The DFS
is stored in /tmp on each box. The developers who own the machines
occasionally reboot and reprofile them, but this occurs infrequently and
does not clobber /tmp. Hadoop is designed to deal with slave failures of
this nature, though this platform may well be an acid test.

 

My initial cloud was configured for replication factor of 3 and I have
increased that now to 4 in hopes of improving data reliability in the
face of these more-prevalent slave outages. Ted Dunning has suggested
aggressive rebalancing in his recent posts and I have done this by
increasing replication to 5 (from 3) and then dropping it to 4. Are
there other rebalancing or configuration techniques that might improve
my data reliability? Or, is this platform just too unstable to be a good
fit for Hadoop?

 

Jeff


Parent Message unknown Re: Platform reliability with Hadoop

by lohit.vijayarenu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>The DFS is stored in /tmp on each box.
> The developers who own the machines occasionally reboot and reprofile them

Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this be the reason you see data corruption?
Good idea is to configure DFS to be any place other than /tmp

Thanks,
Lohit
----- Original Message ----
From: Jeff Eastman <jeastman@...>
To: hadoop-user@...
Sent: Wednesday, January 16, 2008 9:32:41 AM
Subject: Platform reliability with Hadoop


I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
machines in our CUBiT array for the last month. During this time I have
experienced two major data corruption losses on relatively small
 amounts
of data (<50gb) that make me wonder about the suitability of this
platform for hosting Hadoop. CUBiT is one of our products for managing
 a
pool of development servers, allowing developers to check out machines,
install various OS profiles on them and monitor their utilization via
the web. With most machines reporting very low utilization it seemed a
natural place to run Hadoop in the background. I have an NFS-mounted
account on all of the machines and have installed Hadoop there. The DFS
is stored in /tmp on each box. The developers who own the machines
occasionally reboot and reprofile them, but this occurs infrequently
 and
does not clobber /tmp. Hadoop is designed to deal with slave failures
 of
this nature, though this platform may well be an acid test.

 

My initial cloud was configured for replication factor of 3 and I have
increased that now to 4 in hopes of improving data reliability in the
face of these more-prevalent slave outages. Ted Dunning has suggested
aggressive rebalancing in his recent posts and I have done this by
increasing replication to 5 (from 3) and then dropping it to 4. Are
there other rebalancing or configuration techniques that might improve
my data reliability? Or, is this platform just too unstable to be a
 good
fit for Hadoop?

 

Jeff





Re: Platform reliability with Hadoop

by Jason Venner-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:

>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and reprofile them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this be the reason you see data corruption?
> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for managing
>  a
> pool of development servers, allowing developers to check out machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The DFS
> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might improve
> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  

RE: Platform reliability with Hadoop

by Jeff Eastman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks, I will try a safer place for the DFS.
Jeff

-----Original Message-----
From: Jason Venner [mailto:jason@...]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and reprofile
them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?

> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS

> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might improve
> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  

RE: Platform reliability with Hadoop

by Jeff Eastman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am almost operational again but something in my configuration is still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason@...]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and reprofile
them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?

> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS

> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might improve
> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!--- global properties -->

<property>
  <name>mapred.system.dir</name>
  <value>/u1/cloud-data/mapred/system</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/u1/cloud-data/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.job.tracker.info.port</name>
  <value>50030</value>
  <description>The port that the MapReduce job tracker info webserver runs at.
  </description>
</property>

<property>
  <name>dfs.secondary.info.port</name>
  <value>50090</value>
  <description>The base number for the Secondary namenode info port.
  </description>
</property>

<property>
  <name>dfs.datanode.port</name>
  <value>50010</value>
  <description>The port number that the dfs datanode server uses as a starting
               point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>dfs.info.port</name>
  <value>50070</value>
  <description>The base port number for the dfs namenode web ui.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<!-- file system properties -->

<property>
   <name>fs.default.name</name>
   <value>hdfs://cu027.cubit.sp.collab.net:54310</value>
   <description>The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
   </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/u1/cloud-data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/u1/cloud-data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes per volume. Always leave this much space free for non dfs use.
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.50f</value>
  <description>When calculating remaining space, only use this percentage of the real available space
  </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>4</value>
   <description>Default block replication.
   The actual number of replications can be specified when the file is created.
   The default is used if replication is not specified in create time.
   </description>
</property>

<!-- map/reduce properties -->

<property>
   <name>mapred.job.tracker</name>
   <value>cu063.cubit.sp.collab.net:54311</value>
   <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
   </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>31</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>11</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>

Parent Message unknown Re: Platform reliability with Hadoop

by lohit.vijayarenu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

you might want to change hadoop.tmp.dir entry alone. since others are derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit

----- Original Message ----
From: Jeff Eastman <jeastman@...>
To: hadoop-user@...
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop


I am almost operational again but something in my configuration is
 still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but
 cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason@...]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and
 reprofile
them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?

> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for
 managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed
 a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS

> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I
 have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might
 improve

> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  


-----Inline Attachment Follows-----

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!--- global properties -->

<property>
  <name>mapred.system.dir</name>
  <value>/u1/cloud-data/mapred/system</value>
  <description>The shared directory where MapReduce stores control
 files.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/u1/cloud-data/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.job.tracker.info.port</name>
  <value>50030</value>
  <description>The port that the MapReduce job tracker info webserver
 runs at.
  </description>
</property>

<property>
  <name>dfs.secondary.info.port</name>
  <value>50090</value>
  <description>The base number for the Secondary namenode info port.
  </description>
</property>

<property>
  <name>dfs.datanode.port</name>
  <value>50010</value>
  <description>The port number that the dfs datanode server uses as a
 starting
               point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>dfs.info.port</name>
  <value>50070</value>
  <description>The base port number for the dfs namenode web ui.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<!-- file system properties -->

<property>
   <name>fs.default.name</name>
   <value>hdfs://cu027.cubit.sp.collab.net:54310</value>
   <description>The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
   </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/u1/cloud-data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name
 node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/u1/cloud-data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data
 node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes per volume. Always leave this
 much space free for non dfs use.
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.50f</value>
  <description>When calculating remaining space, only use this
 percentage of the real available space
  </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>4</value>
   <description>Default block replication.
   The actual number of replications can be specified when the file is
 created.
   The default is used if replication is not specified in create time.
   </description>
</property>

<!-- map/reduce properties -->

<property>
   <name>mapred.job.tracker</name>
   <value>cu063.cubit.sp.collab.net:54311</value>
   <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
   </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>31</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>11</value>
  <description>The default number of reduce tasks per job.  Typically
 set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>





RE: Platform reliability with Hadoop

by Jeff Eastman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Is it really that simple?

The Wiki page GettingStartedWithHadoop recommends setting dfs.name.dir,
dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to
"appropriate" values (without giving an example). Should these be fixed
(XX) or variable (XX-${user.name}) values? The FAQ page recommends
setting the mapred.system.dir to a fixed value (e.g.
/hadoop/mapred/system), so I chose fixed values too.

- dfs.name.dir - /u1/cloud-data
- dfs.data.dir - /u1/cloud-data
- mapred.system.dir - /u1/cloud-data
- mapred.local.dir - /u1/cloud-data

I did not overwrite the dfs.client.buffer.dir (Determines where on the
local filesystem an DFS client should store its blocks before it sends
them to the datanode) because my 'jeastman' client could not put data
into the dfs with it set to the fixed value.

There are 4 other settings that use the ${hadoop.tmp.dir}, and these
seem appropriately tmp-ish. I did not redefine them:

- fs.trash.root - The trash directory, used by FsShell's 'rm' command.
- fs.checkpoint.dir - Determines where on the local filesystem the DFS
secondary name node should store the temporary images and edits to
merge.
- fs.s3.buffer.dir - Determines where on the local filesystem the S3
filesystem should store its blocks before it sends them to S3 or after
it retrieves them from S3.
- mapred.temp.dir - A shared directory for temporary files.

Jeff

-----Original Message-----
From: lohit.vijayarenu@... [mailto:lohit.vijayarenu@...]
Sent: Sunday, January 20, 2008 11:44 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

you might want to change hadoop.tmp.dir entry alone. since others are
derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit

----- Original Message ----
From: Jeff Eastman <jeastman@...>
To: hadoop-user@...
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop


I am almost operational again but something in my configuration is
 still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but
 cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason@...]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and
 reprofile
them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?

> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for
 managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed
 a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS

> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I
 have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might
 improve

> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  


-----Inline Attachment Follows-----

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!--- global properties -->

<property>
  <name>mapred.system.dir</name>
  <value>/u1/cloud-data/mapred/system</value>
  <description>The shared directory where MapReduce stores control
 files.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/u1/cloud-data/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.job.tracker.info.port</name>
  <value>50030</value>
  <description>The port that the MapReduce job tracker info webserver
 runs at.
  </description>
</property>

<property>
  <name>dfs.secondary.info.port</name>
  <value>50090</value>
  <description>The base number for the Secondary namenode info port.
  </description>
</property>

<property>
  <name>dfs.datanode.port</name>
  <value>50010</value>
  <description>The port number that the dfs datanode server uses as a
 starting
               point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>dfs.info.port</name>
  <value>50070</value>
  <description>The base port number for the dfs namenode web ui.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<!-- file system properties -->

<property>
   <name>fs.default.name</name>
   <value>hdfs://cu027.cubit.sp.collab.net:54310</value>
   <description>The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
   </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/u1/cloud-data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name
 node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/u1/cloud-data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data
 node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes per volume. Always leave this
 much space free for non dfs use.
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.50f</value>
  <description>When calculating remaining space, only use this
 percentage of the real available space
  </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>4</value>
   <description>Default block replication.
   The actual number of replications can be specified when the file is
 created.
   The default is used if replication is not specified in create time.
   </description>
</property>

<!-- map/reduce properties -->

<property>
   <name>mapred.job.tracker</name>
   <value>cu063.cubit.sp.collab.net:54311</value>
   <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
   </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>31</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>11</value>
  <description>The default number of reduce tasks per job.  Typically
 set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>





RE: Platform reliability with Hadoop

by Jeff Eastman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I should add, when I run my job it indicates it could not find the input
files:

[jeastman@cu027 jeastman]$ $HADOOP_INSTALL/bin/hadoop jar ~/access0.jar
com.collabnet.hadoop.access.Access0Driver ecn/access ecn-out
08/01/21 10:59:39 INFO mapred.FileInputFormat: Total input paths to
process: 0
08/01/21 10:59:46 INFO mapred.JobClient: Running job:
job_200801182307_0005
08/01/21 10:59:47 INFO mapred.JobClient:  map 100% reduce 100%
08/01/21 10:59:48 INFO mapred.JobClient: Job complete:
job_200801182307_0005
08/01/21 10:59:49 INFO mapred.JobClient: Counters: 0

I tried using full paths for them (/users/jeastman/...) but that throws
'input path does not exist' errors.

Jeff

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@...]
Sent: Monday, January 21, 2008 11:15 AM
To: hadoop-user@...
Subject: RE: Platform reliability with Hadoop

Is it really that simple?

The Wiki page GettingStartedWithHadoop recommends setting dfs.name.dir,
dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to
"appropriate" values (without giving an example). Should these be fixed
(XX) or variable (XX-${user.name}) values? The FAQ page recommends
setting the mapred.system.dir to a fixed value (e.g.
/hadoop/mapred/system), so I chose fixed values too.

- dfs.name.dir - /u1/cloud-data
- dfs.data.dir - /u1/cloud-data
- mapred.system.dir - /u1/cloud-data
- mapred.local.dir - /u1/cloud-data

I did not overwrite the dfs.client.buffer.dir (Determines where on the
local filesystem an DFS client should store its blocks before it sends
them to the datanode) because my 'jeastman' client could not put data
into the dfs with it set to the fixed value.

There are 4 other settings that use the ${hadoop.tmp.dir}, and these
seem appropriately tmp-ish. I did not redefine them:

- fs.trash.root - The trash directory, used by FsShell's 'rm' command.
- fs.checkpoint.dir - Determines where on the local filesystem the DFS
secondary name node should store the temporary images and edits to
merge.
- fs.s3.buffer.dir - Determines where on the local filesystem the S3
filesystem should store its blocks before it sends them to S3 or after
it retrieves them from S3.
- mapred.temp.dir - A shared directory for temporary files.

Jeff

-----Original Message-----
From: lohit.vijayarenu@... [mailto:lohit.vijayarenu@...]
Sent: Sunday, January 20, 2008 11:44 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

you might want to change hadoop.tmp.dir entry alone. since others are
derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit

----- Original Message ----
From: Jeff Eastman <jeastman@...>
To: hadoop-user@...
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop


I am almost operational again but something in my configuration is
 still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but
 cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason@...]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and
 reprofile
them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?

> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for
 managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed
 a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS

> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I
 have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might
 improve

> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  


-----Inline Attachment Follows-----

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!--- global properties -->

<property>
  <name>mapred.system.dir</name>
  <value>/u1/cloud-data/mapred/system</value>
  <description>The shared directory where MapReduce stores control
 files.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/u1/cloud-data/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.job.tracker.info.port</name>
  <value>50030</value>
  <description>The port that the MapReduce job tracker info webserver
 runs at.
  </description>
</property>

<property>
  <name>dfs.secondary.info.port</name>
  <value>50090</value>
  <description>The base number for the Secondary namenode info port.
  </description>
</property>

<property>
  <name>dfs.datanode.port</name>
  <value>50010</value>
  <description>The port number that the dfs datanode server uses as a
 starting
               point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>dfs.info.port</name>
  <value>50070</value>
  <description>The base port number for the dfs namenode web ui.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<!-- file system properties -->

<property>
   <name>fs.default.name</name>
   <value>hdfs://cu027.cubit.sp.collab.net:54310</value>
   <description>The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
   </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/u1/cloud-data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name
 node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/u1/cloud-data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data
 node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes per volume. Always leave this
 much space free for non dfs use.
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.50f</value>
  <description>When calculating remaining space, only use this
 percentage of the real available space
  </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>4</value>
   <description>Default block replication.
   The actual number of replications can be specified when the file is
 created.
   The default is used if replication is not specified in create time.
   </description>
</property>

<!-- map/reduce properties -->

<property>
   <name>mapred.job.tracker</name>
   <value>cu063.cubit.sp.collab.net:54311</value>
   <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
   </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>31</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>11</value>
  <description>The default number of reduce tasks per job.  Typically
 set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>





RE: Platform reliability with Hadoop

by Jeff Eastman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It's alive! Just in case any others follow this thread, I ended up
overwriting only the following hadoop-default.xml entries:

Physical location of DFS on local disks:
- dfs.name.dir = /u1/cloud-data
- dfs.data.dir = /u1/cloud-data

DFS location of mapred files:
- mapred.system.dir = /hadoop/mapred/system
- mapred.temp.dir = /hadoop/mapred/temp

Each user gets their own /users/username directory in the DFS and jobs
submitted by each user use their own user directories. Now to find a
bigger problem to solve...

Jeff


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@...]
Sent: Monday, January 21, 2008 11:15 AM
To: hadoop-user@...
Subject: RE: Platform reliability with Hadoop

Is it really that simple?

The Wiki page GettingStartedWithHadoop recommends setting dfs.name.dir,
dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to
"appropriate" values (without giving an example). Should these be fixed
(XX) or variable (XX-${user.name}) values? The FAQ page recommends
setting the mapred.system.dir to a fixed value (e.g.
/hadoop/mapred/system), so I chose fixed values too.

- dfs.name.dir - /u1/cloud-data
- dfs.data.dir - /u1/cloud-data
- mapred.system.dir - /u1/cloud-data
- mapred.local.dir - /u1/cloud-data

I did not overwrite the dfs.client.buffer.dir (Determines where on the
local filesystem an DFS client should store its blocks before it sends
them to the datanode) because my 'jeastman' client could not put data
into the dfs with it set to the fixed value.

There are 4 other settings that use the ${hadoop.tmp.dir}, and these
seem appropriately tmp-ish. I did not redefine them:

- fs.trash.root - The trash directory, used by FsShell's 'rm' command.
- fs.checkpoint.dir - Determines where on the local filesystem the DFS
secondary name node should store the temporary images and edits to
merge.
- fs.s3.buffer.dir - Determines where on the local filesystem the S3
filesystem should store its blocks before it sends them to S3 or after
it retrieves them from S3.
- mapred.temp.dir - A shared directory for temporary files.

Jeff

-----Original Message-----
From: lohit.vijayarenu@... [mailto:lohit.vijayarenu@...]
Sent: Sunday, January 20, 2008 11:44 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

you might want to change hadoop.tmp.dir entry alone. since others are
derived out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit

----- Original Message ----
From: Jeff Eastman <jeastman@...>
To: hadoop-user@...
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop


I am almost operational again but something in my configuration is
 still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but
 cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:jason@...]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@...
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files
elsewhere.

lohit.vijayarenu@... wrote:
>> The DFS is stored in /tmp on each box.
>> The developers who own the machines occasionally reboot and
 reprofile
them
>>    
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?

> Good idea is to configure DFS to be any place other than /tmp
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <jeastman@...>
> To: hadoop-user@...
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for
 managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed
 a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS

> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I
 have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might
 improve

> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>  


-----Inline Attachment Follows-----

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!--- global properties -->

<property>
  <name>mapred.system.dir</name>
  <value>/u1/cloud-data/mapred/system</value>
  <description>The shared directory where MapReduce stores control
 files.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/u1/cloud-data/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.job.tracker.info.port</name>
  <value>50030</value>
  <description>The port that the MapReduce job tracker info webserver
 runs at.
  </description>
</property>

<property>
  <name>dfs.secondary.info.port</name>
  <value>50090</value>
  <description>The base number for the Secondary namenode info port.
  </description>
</property>

<property>
  <name>dfs.datanode.port</name>
  <value>50010</value>
  <description>The port number that the dfs datanode server uses as a
 starting
               point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>dfs.info.port</name>
  <value>50070</value>
  <description>The base port number for the dfs namenode web ui.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<!-- file system properties -->

<property>
   <name>fs.default.name</name>
   <value>hdfs://cu027.cubit.sp.collab.net:54310</value>
   <description>The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
   </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/u1/cloud-data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name
 node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/u1/cloud-data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data
 node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes per volume. Always leave this
 much space free for non dfs use.
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.50f</value>
  <description>When calculating remaining space, only use this
 percentage of the real available space
  </description>
</property>

<property>
   <name>dfs.replication</name>
   <value>4</value>
   <description>Default block replication.
   The actual number of replications can be specified when the file is
 created.
   The default is used if replication is not specified in create time.
   </description>
</property>

<!-- map/reduce properties -->

<property>
   <name>mapred.job.tracker</name>
   <value>cu063.cubit.sp.collab.net:54311</value>
   <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
   </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>31</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>11</value>
  <description>The default number of reduce tasks per job.  Typically
 set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>