09 Mar 2009 00:21
TAGS: bash dev parallel
Suppose you have a nice script that does its job pretty well, but you figured out, that running certain parts of scripts in parallel would speed things up.
This can be the option, when you send a bunch of files to an Internet service, that is generally fast, but the connection sequence is quite slow, so uploading 100 files one after one causes the script to wait 100 times to quickly upload a file.
Other situation could be when you have multi-core machine, for example you have eight processing units, but use only one in your script, and you have a bunch of files to compile or to process in some CPU-expensive manner.
We'll use only BASH to smartly parallelize the tasks and speed up the slow part of your script.
First of all you need to know how many jobs in parallel you want (if you have 8 cores and CPU-expensive part of script, having more than 8 jobs does not help, probably a number between 4 and 8 will do best in this case).
#!/bin/bash
PROC_NUM=4
Generally, we'll ensure, than no more than PROC_NUM processes are forked into background and run another task. If there are PROC_NUM processes running in the background we'll wait a (fraction of) second and check again.
#!/bin/bash
PROC_NUM=4
function run_task() {
# task to run
# can be more than one-line
# can take parameters $1, $2, ...
}
function run_parallel() {
while [ `jobs | grep Running | wc -l` -ge $PROC_NUM ]; do
sleep 0.25
done
run_task "$@" &
}
run_task "$@" passes all the parameters passed to run_parallel to run_task. You can use "$@" in run_task to pass all the parameters to external command! The "$@" is the best choice when you have spaces, dollars and other special characters in parameters. It doesn't transform anything, it's completely safe (probably the only short way to pass all the parameters).
There are only two things left: invoking the run_parallel and synchronizing the tasks — you need to know when ALL the tasks ended, right?
#!/bin/bash
PROC_NUM=4
function run_task() {
# task to run
# can be more than one-line
# can take parameters $1, $2, ...
}
function run_parallel() {
while [ `jobs | grep Running | wc -l` -ge $PROC_NUM ]; do
sleep 0.25
done
run_task "$@" &
}
function end_parallel() {
while [ `jobs | grep Running | wc -l` -gt 0 ]; do
sleep 0.25
done
}
# script content
cd /some/where/you/want
# now the parallel operations
# for example in some while
find | while read file; do
run_parallel "$file"
done
# now you want to continue when ALL parallel tasks ended
end_parallel
# the linear script code again
cd /some/where/else
make something
That's all! Though, there is a different approach to this:
#!/bin/bash
function parallel() {
local PROC_NUM="$1"
local SLEEP_TIME="$2"
shift; shift
while [ `jobs | grep Running | wc -l` -ge $PROC_NUM ]; do
sleep $SLEEP_TIME
done
"$@" &
}
This function acts as a wrapper to a non-parallel command and runs it in the background assuring that no more than PROC_NUM processes run at once. If there are PROC_NUM processes running in the background, the wrapper waits SLEEP_TIME to re-check the number of background jobs.
Invoking:
parallel PROC_NUM SLEEP_TIME /usr/bin/some-command arguments ...
so
parallel 4 0.5 ls -R /tmp
means: run ls -R /tmp in the background if there is no more than 3 processes already run in the background. Otherwise wait 0.5 seconds and try again. Then run ls -R /tmp if there is no more than 3 processes already run in the background. Otherwise wait 0.5 seconds and try again. Then run ls -R /tmp if …
Quite nice, isn't it?
Thank you for this hint. Parallel is very nice (present in Ubuntu repositories, though not Debian's ones).
Piotr Gabryjeluk
visit my blog
xargs (GNU version) supports parallellism, xjobs (from Solaris but portable) does a better job of isolating output.
Before you choose between GNU Parallel, xjobs, and xargs you may want to read about the differences.
http://www.gnu.org/software/parallel/man.html#differences_between_xargs_and_gnu_parallel
http://www.gnu.org/software/parallel/man.html#differences_between_xjobs_and_gnu_parallel
A new intro video shows more examples: http://www.youtube.com/watch?v=OpaiGYxkSuQ
Check it out, I just posted another take on the issue. I don't use polling:
http://johannes.jakeapp.com/blog/category/fun-with-linux/201104/parallel-task-execution-in-bash
After those helpful comments, my favorite method to run things in parallel now is xargs :-).
Piotr Gabryjeluk
visit my blog
Post preview:
Close preview