DataStage Parallel Processing
Following figure represents one of the simplest jobs you could have — a data source,
a Transformer (conversion) stage, and the data target. The links between the
stages represent the flow of data into or out of a stage.
In a parallel job, each stage would normally (but not always) correspond to a
process. You can have multiple instances of each process to run on the available
processors in your system.
A parallel DataStage job incorporates two basic types of parallel processing —
pipeline and partitioning. Both of these methods are used at runtime by the
Information Server engine to execute the simple job shown in Figure 1-8.
To the DataStage developer, this job would appear the same on your Designer
canvas, but you can optimize it through advanced properties.
Pipeline parallelism
In the following example, all stages run concurrently, even in a single-node
configuration. As data is read from the Oracle source, it is passed to the
Transformer stage for transformation, where it is then passed to the DB2
target. Instead of waiting for all source data to be read, as soon as the source
data stream starts to produce rows, these are passed to the subsequent
stages. This method is called pipeline parallelism, and all three stages in our
example operate simultaneously regardless of the degree of parallelism of the
configuration file. The Information Server Engine always executes jobs with
pipeline parallelism.
If you ran the example job on a system with multiple processors, the stage
reading would start on one processor and start filling a pipeline with the data it
had read. The transformer stage would start running as soon as there was
data in the pipeline, process it and start filling another pipeline. The stage
writing the transformed data to the target database would similarly start
writing as soon as there was data available. Thus all three stages are
operating simultaneously.
Partition parallelism
When large volumes of data are involved, you can use the power of parallel
processing to your best advantage by partitioning the data into a number of
separate sets, with each partition being handled by a separate instance of the
job stages. Partition parallelism is accomplished at runtime, instead of a
manual process that would be required by traditional systems.
The DataStage developer only needs to specify the algorithm to partition the
data, not the degree of parallelism or where the job will execute. Using
partition parallelism the same job would effectively be run simultaneously by
several processors, each handling a separate subset of the total data. At the
end of the job the data partitions can be collected back together again and
written to a single data source. This is shown in following figure.
Attention: You do not need multiple processors to run in parallel. A single
processor is capable of running multiple concurrent processes.
Partition parallelism
Combining pipeline and partition parallelism
The Information Server engine combines pipeline and partition parallel
processing to achieve even greater performance gains. In this scenario you
would have stages processing partitioned data and filling pipelines so the
next one could start on that partition before the previous one had finished.
This is shown in the following figure.
In some circumstances you might want to actually re-partition your data between
stages. This could happen, for example, where you want to group data
differently. Suppose that you have initially processed data based on customer
last name, but now you want to process on data grouped by zip code. You will
have to re-partition to ensure that all customers sharing the same zip code are in
the same group. DataStage allows you to re-partition between stages as and
when necessary. With the Information Server engine, re-partitioning happens in
memory between stages, instead of writing to disk.