Running things in parallel in BASH

09 Mar 2009 00:21

Suppose you have a nice script that does its job pretty well, but you figured out, that running certain parts of scripts in parallel would speed things up.

This can be the option, when you send a bunch of files to an Internet service, that is generally fast, but the connection sequence is quite slow, so uploading 100 files one after one causes the script to wait 100 times to quickly upload a file.

Other situation could be when you have multi-core machine, for example you have eight processing units, but use only one in your script, and you have a bunch of files to compile or to process in some CPU-expensive manner.

We'll use only BASH to smartly parallelize the tasks and speed up the slow part of your script.

First of all you need to know how many jobs in parallel you want (if you have 8 cores and CPU-expensive part of script, having more than 8 jobs does not help, probably a number between 4 and 8 will do best in this case).

#!/bin/bash

PROC_NUM=4

Generally, we'll ensure, than no more than PROC_NUM processes are forked into background and run another task. If there are PROC_NUM processes running in the background we'll wait a (fraction of) second and check again.

#!/bin/bash

PROC_NUM=4

function run_task() {
    # task to run
    # can be more than one-line
    # can take parameters $1, $2, ...
}

function run_parallel() {
    while [ `jobs | grep Running | wc -l` -ge $PROC_NUM ]; do
        sleep 0.25
    done

    run_task "$@" &
}

run_task "$@" passes all the parameters passed to run_parallel to run_task. You can use "$@" in run_task to pass all the parameters to external command! The "$@" is the best choice when you have spaces, dollars and other special characters in parameters. It doesn't transform anything, it's completely safe (probably the only short way to pass all the parameters).

There are only two things left: invoking the run_parallel and synchronizing the tasks — you need to know when ALL the tasks ended, right?

#!/bin/bash

PROC_NUM=4

function run_task() {
    # task to run
    # can be more than one-line
    # can take parameters $1, $2, ...
}

function run_parallel() {
    while [ `jobs | grep Running | wc -l` -ge $PROC_NUM ]; do
        sleep 0.25
    done

    run_task "$@" &
}

function end_parallel() {
    while [ `jobs | grep Running | wc -l` -gt 0 ]; do
        sleep 0.25
    done
}

# script content

cd /some/where/you/want

# now the parallel operations
# for example in some while

find | while read file; do
    run_parallel "$file"
done

# now you want to continue when ALL parallel tasks ended

end_parallel

# the linear script code again

cd /some/where/else
make something

That's all! Though, there is a different approach to this:

#!/bin/bash

function parallel() {
    local PROC_NUM="$1"
    local SLEEP_TIME="$2"
    shift; shift
    while [ `jobs | grep Running | wc -l` -ge $PROC_NUM ]; do
        sleep $SLEEP_TIME
    done
    "$@" &
}

This function acts as a wrapper to a non-parallel command and runs it in the background assuring that no more than PROC_NUM processes run at once. If there are PROC_NUM processes running in the background, the wrapper waits SLEEP_TIME to re-check the number of background jobs.

Invoking:

parallel PROC_NUM SLEEP_TIME /usr/bin/some-command arguments ...

so

parallel 4 0.5 ls -R /tmp

means: run ls -R /tmp in the background if there is no more than 3 processes already run in the background. Otherwise wait 0.5 seconds and try again. Then run ls -R /tmp if there is no more than 3 processes already run in the background. Otherwise wait 0.5 seconds and try again. Then run ls -R /tmp if …

Quite nice, isn't it?


More posts on this topic

Comments

Add a New Comment
or Sign in as Wikidot user
(will not be published)
- +
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License