DataStage EE
Introduction to Datastage Enterprise Edition (EE)
Datastage Enterprise Edition, formerly known as Datastage PX (parallel extender) has become recently a part of IBM InfoSphere Information Server and its official name is IBM InfoSphere DataStage.
With the recent versions of Datastage (7.5, 8, 8.1), IBM does not release any updates to Datastage Server Edition (however it is still available in Datastage 8) and they seem to put the biggest effort in developing and enriching the Enterprise Edition of the InfoSphere product line.
New Datastage 8 tutorial available on ETL-Tools.Info!
Infosphere Datastage EE tutorial - Datastage and Qualitystage tutorial based on Information Server 8.1 and Datastage 7.5 EE
Key Datastage Enterprise Edition concepts
Parallel processing
Datastage jobs are highly scalable due to the implementation of parallel processing. The EE architecture is process-based (rather than thread processing), platform independent and uses the processing node concept. Datastage EE is able to execute jobs on multiple CPUs (nodes) in parallel and is fully scalable, which means that a properly designed job can run across resources within a single machine or take advantage of parallel platforms like a cluster, GRID, or MPP architecture (massively parallel processing).
Partitioning and Pipelining
Partitioning means breaking a dataset into smaller sets and distributing them evenly across the partitions (nodes). Each partition of data is processed by the same operation and transformed in the same way.
The main outcome of using a partitioning mechanism is getting a linear scalability. This means for instance that once the data is evenly distributed, a 4 CPU server will process the data four times faster than a single CPU machine.
Pipelining means that each part of an ETL process (Extract, Transform, Load) is executed simultaneously, not sequentially. The key concept of ETL Pipeline processing is to start the Transformation and Loading tasks while the Extraction phase is still running.
Datastage Enterprise Edition automatically combines pipelining, partitioning and parallel processing. The concept is hidden from a Datastage programmer. The job developer only chooses a method of data partitioning and the Datastage EE engine will execute the partitioned and parallelized processes.
Differences between Datastage Enterprise Edition and Server Edition
1. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different.
In rough outline:
* Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment
* Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques.
* Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating
2. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs.
Server Jobs are compiled into Basic which is an interpreted psedo-code. This is why parallel jobs run faster, even if processed on one CPU.
3. Datastage Enterprise Edition introduces adds functionality to the traditional server stages, for instance record and column level format properties.
4. Datastage EE brings also completely new stages implementing the parallel concept, for instance:
* Enterprise Database Connectors for Oracle, Teradata & DB2
* Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail, Sample ...
* Data set, File set, Complex flat file, Lookup File Set ...
* Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...
5. When processing large data volumes Datastage EE jobs would be the right choice, however when dealing with smaller data environment, using Server jobs might be just easier to develop, understand and manage.
When a company has both Server and Enterprise licenses, both types of jobs can be used.
6. Sequence jobs are the same in Datastage EE and Server editions.