|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
does anyone have idea on how to run multiple sequential jobs with bash scriptHello folks:
I am running several hadoop applications on hdfs. To save the efforts in issuing the set of commands every time, I am trying to use bash script to run the several applications sequentially. To let the job finishes before it is proceeding to the next job, I am using wait in the script like below. sh bin/start-all.sh wait echo cluster start (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D test.randomwrite.bytes_per_map=107374182 rand) wait bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=107374182 rand-text bin/stop-all.sh echo finished hdfs randomwriter experiment However, it always give the error like below. Does anyone have better idea on how to run the multiple sequential jobs with bash script? HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker still initializing at org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722) at org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at $Proxy1.getNewJobId(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.getNewJobId(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) at org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptI'm interested in the same thing -- is there a recommended way to batch
Hadoop jobs together? On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <richardtechzh@...> wrote: > Hello folks: > I am running several hadoop applications on hdfs. To save the efforts in > issuing the set of commands every time, I am trying to use bash script to > run the several applications sequentially. To let the job finishes before > it > is proceeding to the next job, I am using wait in the script like below. > > sh bin/start-all.sh > wait > echo cluster start > (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D > test.randomwrite.bytes_per_map=107374182 rand) > wait > bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D > test.randomtextwrite.total_bytes=107374182 rand-text > bin/stop-all.sh > echo finished hdfs randomwriter experiment > > > However, it always give the error like below. Does anyone have better idea > on how to run the multiple sequential jobs with bash script? > > HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell > > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker > still > initializing > at > org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722) > at > org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) > > at org.apache.hadoop.ipc.Client.call(Client.java:557) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) > at $Proxy1.getNewJobId(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy1.getNewJobId(Unknown Source) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) > at > org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at > org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) > at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) > -- hustlin, hustlin, everyday I'm hustlin |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptwait and sleep are not what you are looking for. you can use 'nohup'
to run a job in the background and have its output piped to a file. On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao <mengmao@...> wrote: > I'm interested in the same thing -- is there a recommended way to batch > Hadoop jobs together? > > On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <richardtechzh@...> > wrote: > >> Hello folks: >> I am running several hadoop applications on hdfs. To save the efforts in >> issuing the set of commands every time, I am trying to use bash script to >> run the several applications sequentially. To let the job finishes before >> it >> is proceeding to the next job, I am using wait in the script like below. >> >> sh bin/start-all.sh >> wait >> echo cluster start >> (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D >> test.randomwrite.bytes_per_map=107374182 rand) >> wait >> bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D >> test.randomtextwrite.total_bytes=107374182 rand-text >> bin/stop-all.sh >> echo finished hdfs randomwriter experiment >> >> >> However, it always give the error like below. Does anyone have better idea >> on how to run the multiple sequential jobs with bash script? >> >> HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell >> >> org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker >> still >> initializing >> at >> org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722) >> at >> org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) >> >> at org.apache.hadoop.ipc.Client.call(Client.java:557) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) >> at $Proxy1.getNewJobId(Unknown Source) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) >> at >> >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) >> at $Proxy1.getNewJobId(Unknown Source) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) >> at >> org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:155) >> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) >> > > > > -- > hustlin, hustlin, everyday I'm hustlin > |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptYou have another problem in that Hadoop is still initialising --this will
cause subsequent jobs to fail. I've not yet migrated to 17.0 (I still use 16.3), but all my jobs are done from nohuped scripts. If you really want to check on the running status and busy wait, you can look at the jobtracker log and poll it for when everything is finished. My turn to ask a question in the next post .. Miles 2008/6/10 Richard Zhang <richardtechzh@...>: > Hello folks: > I am running several hadoop applications on hdfs. To save the efforts in > issuing the set of commands every time, I am trying to use bash script to > run the several applications sequentially. To let the job finishes before > it > is proceeding to the next job, I am using wait in the script like below. > > sh bin/start-all.sh > wait > echo cluster start > (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D > test.randomwrite.bytes_per_map=107374182 rand) > wait > bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D > test.randomtextwrite.total_bytes=107374182 rand-text > bin/stop-all.sh > echo finished hdfs randomwriter experiment > > > However, it always give the error like below. Does anyone have better idea > on how to run the multiple sequential jobs with bash script? > > HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell > > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker > still > initializing > at > org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722) > at > org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) > > at org.apache.hadoop.ipc.Client.call(Client.java:557) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) > at $Proxy1.getNewJobId(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy1.getNewJobId(Unknown Source) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) > at > org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at > org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) > at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptI am not totally sure if I understand the problem that you face, but we do
the following in version 0.16.4 (where the hod shell is deprecated). a) Use shell scripts to echo commands into a runme.hod script b) An example of a runme.hod script is: hadoop jar /grid/0/hadoop/current/hadoop-streaming.jar -input xxx -mapper "./mr_merge_mapper_bin" -output xxx -reducer "./mr_merge_reducer_bin --num_feats 24" -file ../m45scripts/mr_merge_reducer_bin -file ../m45scripts/mr_merge_mapper_bin In this runme.hod you can include many such calls, therefore running jobs in sequence. c) chmod +x runme.hod c) hod script -d $working_dir -n $machines -s $abs_path_to_runme --hod.script-wait-time=$wait_time I usually set wait_time to 20 (seconds), this is fine to deal with the initializing problem. Hope this helped... Ashish On Tue, Jun 10, 2008 at 6:10 PM, Miles Osborne <miles@...> wrote: > You have another problem in that Hadoop is still initialising --this will > cause subsequent jobs to fail. > > I've not yet migrated to 17.0 (I still use 16.3), but all my jobs are done > from nohuped scripts. If you really want to check on the running status > and > busy wait, you can look at the jobtracker log and poll it for when > everything is finished. > > My turn to ask a question in the next post .. > > Miles > 2008/6/10 Richard Zhang <richardtechzh@...>: > > > Hello folks: > > I am running several hadoop applications on hdfs. To save the efforts in > > issuing the set of commands every time, I am trying to use bash script to > > run the several applications sequentially. To let the job finishes before > > it > > is proceeding to the next job, I am using wait in the script like below. > > > > sh bin/start-all.sh > > wait > > echo cluster start > > (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D > > test.randomwrite.bytes_per_map=107374182 rand) > > wait > > bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D > > test.randomtextwrite.total_bytes=107374182 rand-text > > bin/stop-all.sh > > echo finished hdfs randomwriter experiment > > > > > > However, it always give the error like below. Does anyone have better > idea > > on how to run the multiple sequential jobs with bash script? > > > > HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell > > > > org.apache.hadoop.ipc.RemoteException: > > org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker > > still > > initializing > > at > > org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722) > > at > > org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:557) > > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) > > at $Proxy1.getNewJobId(Unknown Source) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > > at > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > > at $Proxy1.getNewJobId(Unknown Source) > > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) > > at > > org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > > at > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > at > > org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) > > at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) > > > > > > -- > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptDepending on the nature of your jobs, Cascading has built in a topological scheduler. It will schedule all your work as their dependencies are satisfied. Dependencies being source data and inter- job intermediate data. http://www.cascading.org The first catch is that you will still need bash to start/stop your cluster and to start the cascading job (per your example below). The second catch is that you currently must use the cascading api (or the groovy api) to assemble your data processing flows. Hopefully in the next couple weeks we will have a means to support custom/raw hadoop jobs as members of a set of dependent jobs. This feature is being delayed by our adding support for stream assertions, the ability to validate data during runtime but have the assertions 'planned' out of the process flow on demand, ie. for production runs. And for stream traps, built in support for siphoning off bad data into side files so long running (or low fidelity) jobs can continue running without losing any data. can read more about these features here http://groups.google.com/group/cascading-user ckw On Jun 10, 2008, at 2:48 PM, Meng Mao wrote: > I'm interested in the same thing -- is there a recommended way to > batch > Hadoop jobs together? > > On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang <richardtechzh@... > > > wrote: > >> Hello folks: >> I am running several hadoop applications on hdfs. To save the >> efforts in >> issuing the set of commands every time, I am trying to use bash >> script to >> run the several applications sequentially. To let the job finishes >> before >> it >> is proceeding to the next job, I am using wait in the script like >> below. >> >> sh bin/start-all.sh >> wait >> echo cluster start >> (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D >> test.randomwrite.bytes_per_map=107374182 rand) >> wait >> bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D >> test.randomtextwrite.total_bytes=107374182 rand-text >> bin/stop-all.sh >> echo finished hdfs randomwriter experiment >> >> >> However, it always give the error like below. Does anyone have >> better idea >> on how to run the multiple sequential jobs with bash script? >> >> HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell >> >> org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job >> tracker >> still >> initializing >> at >> org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: >> 1722) >> at >> org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun >> .reflect >> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun >> .reflect >> .DelegatingMethodAccessorImpl >> .invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) >> >> at org.apache.hadoop.ipc.Client.call(Client.java:557) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) >> at $Proxy1.getNewJobId(Unknown Source) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun >> .reflect >> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun >> .reflect >> .DelegatingMethodAccessorImpl >> .invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org >> .apache >> .hadoop >> .io >> .retry >> .RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) >> at >> >> org >> .apache >> .hadoop >> .io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java: >> 59) >> at $Proxy1.getNewJobId(Unknown Source) >> at >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: >> 973) >> at >> org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun >> .reflect >> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun >> .reflect >> .DelegatingMethodAccessorImpl >> .invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org.apache.hadoop.util.ProgramDriver >> $ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun >> .reflect >> .NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun >> .reflect >> .DelegatingMethodAccessorImpl >> .invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:155) >> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) >> > > > > -- > hustlin, hustlin, everyday I'm hustlin -- Chris K Wensel chris@... http://chris.wensel.net/ http://www.cascading.org/ |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptJust a quick plug for Cascading. Our team uses cascading quite a bit and
found it to be a simpler way to write map reduce jobs. The guys using it find it very helpful. On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel <chris@...> wrote: > > Depending on the nature of your jobs, Cascading has built in a topological > scheduler. It will schedule all your work as their dependencies are > satisfied. Dependencies being source data and inter-job intermediate data. > > http://www.cascading.org > > > -- ted |
|
|
RE: does anyone have idea on how to run multiple sequential jobs with bash scriptTed,
I find cascading very similar to pig, do you care to provide your comment here? If map reduce programmers are to go to the next level (scripting/query language), which way to go? Thanks Haijun -----Original Message----- From: Ted Dunning [mailto:ted.dunning@...] Sent: Wednesday, June 11, 2008 2:16 PM To: core-user@... Subject: Re: does anyone have idea on how to run multiple sequential jobs with bash script Just a quick plug for Cascading. Our team uses cascading quite a bit and found it to be a simpler way to write map reduce jobs. The guys using it find it very helpful. On Wed, Jun 11, 2008 at 1:31 PM, Chris K Wensel <chris@...> wrote: > > Depending on the nature of your jobs, Cascading has built in a topological > scheduler. It will schedule all your work as their dependencies are > satisfied. Dependencies being source data and inter-job intermediate data. > > http://www.cascading.org > > > -- ted |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptOn Jun 10, 2008, at 2:48 PM, Meng Mao wrote: > I'm interested in the same thing -- is there a recommended way to > batch > Hadoop jobs together? > Hadoop Map-Reduce JobControl: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job +Control and http://hadoop.apache.org/core/docs/current/ mapred_tutorial.html#JobControl Arun > On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang > <richardtechzh@...> > wrote: > >> Hello folks: >> I am running several hadoop applications on hdfs. To save the >> efforts in >> issuing the set of commands every time, I am trying to use bash >> script to >> run the several applications sequentially. To let the job finishes >> before >> it >> is proceeding to the next job, I am using wait in the script like >> below. >> >> sh bin/start-all.sh >> wait >> echo cluster start >> (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D >> test.randomwrite.bytes_per_map=107374182 rand) >> wait >> bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D >> test.randomtextwrite.total_bytes=107374182 rand-text >> bin/stop-all.sh >> echo finished hdfs randomwriter experiment >> >> >> However, it always give the error like below. Does anyone have >> better idea >> on how to run the multiple sequential jobs with bash script? >> >> HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell >> >> org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job >> tracker >> still >> initializing >> at >> org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java: >> 1722) >> at >> org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke >> (NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke >> (DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) >> >> at org.apache.hadoop.ipc.Client.call(Client.java:557) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) >> at $Proxy1.getNewJobId(Unknown Source) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke >> (NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke >> (DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod >> (RetryInvocationHandler.java:82) >> at >> >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke >> (RetryInvocationHandler.java:59) >> at $Proxy1.getNewJobId(Unknown Source) >> at org.apache.hadoop.mapred.JobClient.submitJob >> (JobClient.java:696) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: >> 973) >> at >> org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke >> (NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke >> (DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke >> (ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke >> (NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke >> (DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:155) >> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) >> > > > > -- > hustlin, hustlin, everyday I'm hustlin |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptPig is much more ambitious than cascading. Because of the ambitions, simple
things got overlooked. For instance, something as simple as computing a file name to load is not possible in pig, nor is it possible to write functions in pig. You can hook to Java functions (for some things), but you can't really write programs in pig. On the other hand, pig may eventually provide really incredible capabilities including program rewriting and optimization that would be incredibly hard to write directly in Java. The point of cascading was simply to make life easier for a normal Java/map-reduce programmer. It provides an abstraction for gluing together several map-reduce programs and for doing a few common things like joins. Because you are still writing Java (or Groovy) code, you have all of the functionality you always had. But, this same benefit costs you the future in terms of what optimizations are likely to ever be possible. The summary for us (especially 4-6 months ago when we were deciding) is that cascading is good enough to use now and pig will probably be more useful later. On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <haijun@...> wrote: > > I find cascading very similar to pig, do you care to provide your comment > here? If map reduce programmers are to go to the next level (scripting/query > language), which way to go? > > > |
|
|
RE: does anyone have idea on how to run multiple sequential jobs with bash scriptThanks for sharing. We have need to expose hadoop cluster to 'casual' users for ad-hoc query, I find it difficult to ask them to write map reduce program, pig latin comes in very handy in this case. However, for continuous production data processing, hadoop+cascading sounds like a good option.
Haijun -----Original Message----- From: Ted Dunning [mailto:ted.dunning@...] Sent: Wednesday, June 11, 2008 5:01 PM To: core-user@... Subject: Re: does anyone have idea on how to run multiple sequential jobs with bash script Pig is much more ambitious than cascading. Because of the ambitions, simple things got overlooked. For instance, something as simple as computing a file name to load is not possible in pig, nor is it possible to write functions in pig. You can hook to Java functions (for some things), but you can't really write programs in pig. On the other hand, pig may eventually provide really incredible capabilities including program rewriting and optimization that would be incredibly hard to write directly in Java. The point of cascading was simply to make life easier for a normal Java/map-reduce programmer. It provides an abstraction for gluing together several map-reduce programs and for doing a few common things like joins. Because you are still writing Java (or Groovy) code, you have all of the functionality you always had. But, this same benefit costs you the future in terms of what optimizations are likely to ever be possible. The summary for us (especially 4-6 months ago when we were deciding) is that cascading is good enough to use now and pig will probably be more useful later. On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <haijun@...> wrote: > > I find cascading very similar to pig, do you care to provide your comment > here? If map reduce programmers are to go to the next level (scripting/query > language), which way to go? > > > |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash scriptThanks Ted..
Couple quick comments. At one level Cascading is a MapReduce query planner, just like PIG. Except the API is for public consumption and fully extensible, in PIG you typically interact with the PigLatin syntax. Subsequently, with Cascading, you can layer your own syntax on top of the API. Currently there is Groovy support (Groovy is used to assemble the work, it does not run on the mappers or reducers). I hear rumors about Jython elsewhere. A couple groovy examples (note these are obviously trivial, the dsl can absorb tremendous complexity if need be)... http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy Since Cascading is in part a 'planner', it actually builds internally a new representation from what the developer assembled and renders out the necessary map/reduce jobs (and transparently links them) at runtime. As Hadoop evolves, the planner will incorporate the new features and leverage them transparently. Plus there are opportunities for identifying patterns and applying different strategies (hypothetically map side vs reduce side joins, for one). It is also conceivable (but untried) that different planners can exist to target different systems other than Hadoop (making your code/libraries portable). Much of this is true for PIG as well. http://www.cascading.org/documentation/overview.html Also, Cascading will at some point provide a PIG adapter, allowing PigLatin queries to participate in a larger Cascading 'Cascade' (the topological scheduler). Cascading is great with integration, connecting things outside Hadoop with stuff to be done inside Hadoop. And PIG looks like a great way to concisely represent a complex solution and execute it. There isn't any reason they can't work together (it has always been the intention). The takeaway is that with Cascading and PIG, users do not think in MapReduce. With PIG, you think in PigLatin. With Cascading, you can use the pipe/filter based API, or use your favorite scripting language and build a DSL for your problem domain. Many companies have done similar things internally, but they tend to be nothing more than a scriptable way to write a map/reduce job and glue them together. You still think in MapReduce, which in my opinion doesn't scale well. My (biased) recommendation is this. Build out your application in Cascading. If part of the problem is best represented in PIG, no worries use PIG and feed and clean up after PIG with Cascading. And if you see a solvable bottleneck, and we can't convince the planner to recognize the pattern and plan better, replace that piece of the process with a custom MapReduce job (or more). Solve your problem first, then optimize the solution, if need be. ckw On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote: > Pig is much more ambitious than cascading. Because of the > ambitions, simple > things got overlooked. For instance, something as simple as > computing a > file name to load is not possible in pig, nor is it possible to write > functions in pig. You can hook to Java functions (for some things), > but you > can't really write programs in pig. On the other hand, pig may > eventually > provide really incredible capabilities including program rewriting and > optimization that would be incredibly hard to write directly in Java. > > The point of cascading was simply to make life easier for a normal > Java/map-reduce programmer. It provides an abstraction for gluing > together > several map-reduce programs and for doing a few common things like > joins. > Because you are still writing Java (or Groovy) code, you have all of > the > functionality you always had. But, this same benefit costs you the > future > in terms of what optimizations are likely to ever be possible. > > The summary for us (especially 4-6 months ago when we were deciding) > is that > cascading is good enough to use now and pig will probably be more > useful > later. > > On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <haijun@...> > wrote: > >> >> I find cascading very similar to pig, do you care to provide your >> comment >> here? If map reduce programmers are to go to the next level >> (scripting/query >> language), which way to go? >> >> >> -- Chris K Wensel chris@... http://chris.wensel.net/ http://www.cascading.org/ |
|
|
Re: does anyone have idea on how to run multiple sequential jobs with bash script> However, for continuous production data processing, hadoop+cascading
> sounds like a good option. This will be especially true with stream assertions and traps (as mentioned previously, and available in trunk). <grin> I've written workloads for clients that render down to ~60 unique Hadoop map/reduce jobs, all inter-related, from ~10 unique units of work (internally lots of joins, sorts and math). I can't imagine having written them by hand. ckw -- Chris K Wensel chris@... http://chris.wensel.net/ http://www.cascading.org/ |
| Free embeddable forum powered by Nabble | Forum Help |