DATASTAGE TUTORIAL,GUIDES AND TRAINING

DataStage Overview

2011-04-09T01:40:00.001-07:00

DataStage Certifications Exam 000-415 Dumps Special

Checkout Pages 1-10.

IBM® InfoSphere® DataStage® integrates data across multiple and high volumes of data sources and target applications.

IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and the IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition and the Enterprise Edition. Like several other IBM products (e.g. IBM WebSphere Portal from the IBM Lotus family), DataStage belongs to another brand than points its name.

Ibm InfoSphere Datastage Data Flow and Job Design

View more documents from datastaget-tutorials.blogspot.com.

DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in the first half of 1996. Peter Weyman was VMark VP of Strategy and identified the ETL market as an opportunity. He appointed Lee Scheffler as the architect and conceived the product brand name "Stage" to signify modularity and component-orientation]. This tag was used to name DataStage and subsequently used in related products QualityStage, ProfileStage, MetaStage and AuditStage. Lee Scheffler presented the DataStage product overview to the board of VMark in June 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in January 1997.

VMark acquired UniData in October 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor. In April 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software. In November 2001, Ascential Software Corp. of Westboro, Mass. acquired privately held Torrent Systems Inc. of Cambridge, Mass. for $46 million in cash. Ascential has stated a commitment to integrate Orchestrate's parallel processing capabilities directly into the DataStageXE platform. . In March 2005 IBM acquired Ascential Software and made DataStage part of the WebSphere family as WebSphere DataStage. In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

Architecture

It integrates data on demand with a high performance parallel framework, extended metadata management, and enterprise connectivity.

* Supports the collection, integration and transformation of large volumes of data, with data structures ranging from simple to highly complex.
* Offers scalable platform that enables companies to solve large-scale business problems through high-performance processing of massive data volumes
* Supports real-time data integration.
* Enables developers to maximize speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure.
* Completes connectivity between any data source and any application.

What is IBM WebSphere DataStage?

• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects –such as, data warehouses, data marts, and system migrations
• Import, export, create, and manage metadata for use within jobs
• Schedule, run, and monitor jobs, all within DataStage
• Administer your DataStage development and execution environments Create batch (controlling) jobs

DataStage is a comprehensive tool for the fast, easy creation and maintenance of data marts and data warehouses. It provides the tools you need to build, manage, and expand them. With DataStage, you canbuild solutions faster and give users access to the data and reports they need.

With DataStage you can:
• Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart.
• Create and reuse metadata and job components.
• Run, monitor, and schedule these jobs.
• Administer your development and execution environments.

Linked Blogs

2011-04-09T01:39:00.000-07:00

I am going to link every blogs that link to my blog. So, if you placed a link of my blog on your blog, I will place a link for your blog too.
http://www.blogger.com/img/blank.gifhttp://www.blogger.com/img/blank.gif
So, please leave your blog link in the comment box. I will add your blog into the list.

Blog Links:-

http://maybe-she-does.blogspot.com/
http://strobist.blogspot.com/

Leave a link to get into this list

DataStage concepts

2011-03-27T06:47:00.005-07:00

The top half displays the Clients. Below there are two engines: The Server engine that runs DataStage server jobs and the parallel engine that runs parallel jobs. Our focus in this course is on Parallel jobs.

The DataStage client components are:

Administrator
Administers DataStage projects and conducts housekeeping on the server

Designer
Creates DataStage jobs that are compiled into executable programs

Director
Used to run and monitor the DataStage jobs
The Repository is used to store DataStage objects. The Repository is shared with other applications in the Suite.

Datastage certification

2011-03-27T06:47:00.003-07:00

DataStage Certification Dumps Page 1
DataStage Certification Dumps Page 2
DataStage Certification Dumps Page 3
DataStage Certification Dumps Page 4
DataStage Certification Dumps Page 5
DataStage Certification Dumps Page 6
DataStage Certification Dumps Page 7
DataStage Certification Dumps Page 8
DataStage Certification Dumps Page 9
DataStage Certification Dumps Page 10

Datastage Tutorial

2011-03-27T06:47:00.001-07:00

DataStage Overview

DataStage Stages

DataStage Parallel Jobs

DataStage Basics Part-1

DataStage Basics Part-2

DataStage Advanced-1

DataStage Advanced-2

DataStage Lab Part-1

DataStage Lab Part-2

Datastage interview questions

2011-03-27T06:46:00.001-07:00

1)How can we achieve parallelism ?

The degree of parallelism is achieved by configuring the
multiple nodes in the config file. In the config files we can specify multiple nodes.

2)What are Stage Variables, Derivations and Constants?

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.

3)Compare and Contrast ODBC and Plug-In stages?

ODBC : a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.

Plug-In: a) Good Performance.
b) Database specific.(Only one database)
c) Cannot handle Stored Procedures.

4)How to run a Shell Script within the scope of a Data stage job?

select the EDIT tab in the toolbar-> choose job properties-> select the job parameters->choose the Before/ After routines ->select the EXCESH command

5)How do you merge two files in DS?

Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.

6)How can we pass parameters from one job to another job by using command line prompt?

We can pass parameter to a job using two ways .. using dsjob- command line or from a sequencer.
Other way would be -
You configure single parameter set ( version 8.0 onwards) and use the same in both the jobs so that they share the same set of parameters.

7)When we are extracting the flatfiles, What are the basic required validations?

Following are some common validations performed:
a) Check for blank lines and remove them.
b) Check the number of column in each row of the file.
c) If there is aÂ trailer line in the flat file containing additional information like total number of records,then a cross check is performed to check if the number of records specified in the trailer and the actual number of records are same.
d) Check if a column contains blank value (If it is expected to have values).

8)How do you do Usage analysis in datastage ?

1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents.
2. To find how many jobs are using a particular table.
3. To find how many jobs are using a particular routine.
Like this, u can find all the dependents of a particular object.
Its like nested. U can move forward and backward and can see all the dependents.

9)Types of Parallel Processing?

Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.

10)Do u know about METASTAGE?

MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage.

11)Difference between Hashfile and Sequential File?

Difference between Hashfile and sequential file is , searching a record is too fast in hash file based on the hashkey, we can get the address of record directly in hashfile based on the hashkey, and in sequential file it should search record sequential mode only, it has to search for record by record, and we can remove duplicate records based on the hash key in hashfile, we cannot in sequential file.

12)If I add a new environment variable in Windows, how can I access it in DataStage?

U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties

13)What is the difference between LOOK UP File Stage and LookUP stage ?

LookUP stage is used to deal on reference data set with source data .
where as LOOK UP File Stage is used to create the reference data set for the look up stage for to perform the look up operation with the source data.

14)What is the difference between Symetrically parallel processing,Massively parallel processing?

Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate via shared memory and have single operating system.

Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system and communicate via high speed network

DataStage Certification Dumps Pages 1-10

2010-04-24T21:15:00.001-07:00

DataStage Certifications Dumps Page-1

2010-04-24T21:14:00.001-07:00

DataStage Certifications Dumps Page-2

2010-04-24T21:11:00.001-07:00

DataStage Certifications Dumps Page-3

2010-04-24T21:10:00.000-07:00

DataStage Certifications Dumps Page-4

2010-04-24T21:09:00.002-07:00

DataStage Certifications Dumps Page-5

2010-04-24T21:09:00.001-07:00

DataStage Certifications Dumps Page-6

2010-04-24T21:08:00.002-07:00

DataStage Certifications Dumps Page-7

2010-04-24T21:08:00.001-07:00

DataStage Certifications Dumps Page-8

2010-04-24T21:07:00.000-07:00

DataStage Certifications Dumps Page-9

2010-04-24T21:06:00.000-07:00

DataStage Certifications Dumps Page-10

2010-04-24T21:05:00.000-07:00

S896Q9P9RYMS

2010-04-15T02:46:00.000-07:00

S896Q9P9RYMS

DataStage Best Practices

2010-03-18T09:32:00.000-07:00

This section provides an overview of recommendations for standard practices.
The recommendations are categorized as follows:
_ Standards
_ Development guidelines
_ Component usage
_ DataStage Data Types
_ Partitioning data
_ Collecting data
_ Sorting
_ Stage specific guidelines

Standards
It is important to establish and follow consistent standards in:
_ Directory structures for installation and application support directories.
_ Naming conventions, especially for DataStage Project categories, stage
names, and links.
All DataStage jobs should be documented with the Short Description field, as
well as Annotation fields.
It is the DataStage developer’s responsibility to make personal backups of their
work on their local workstation, using DataStage's DSX export capability. This
can also be used for integration with source code control systems.
Note: A detailed discussion of these practices is beyond the scope of this
Redbooks publication, and you should speak to your Account Executive to
engage IBM IPS Services.

Development guidelines
Modular development techniques should be used to maximize re-use of
DataStage jobs and components:
_ Job parameterization allows a single job design to process similar logic
instead of creating multiple copies of the same job. The Multiple-Instance job
property allows multiple invocations of the same job to run simultaneously.
_ A set of standard job parameters should be used in DataStage jobs for source
and target database parameters (DSN, user, password, etc.) and directories
where files are stored. To ease re-use, these standard parameters and
settings should be made part of a Designer Job Parameter Sets.
_ Create a standard directory structure outside of the DataStage project
directory for source and target files, intermediate work files, and so forth.
_ Where possible, create re-usable components such as parallel shared
containers to encapsulate frequently-used logic.
_ DataStage Template jobs should be created with:
– Standard parameters such as source and target file paths, and database
login properties
– Environment variables and their default settings
– Annotation blocks
_ Job Parameters should always be used for file paths, file names, database
login settings.
_ Standardized Error Handling routines should be followed to capture errors
and rejects.

Component usage
The following guidelines should be followed when constructing parallel jobs in
IBM InfoSphere DataStage Enterprise Edition:
_ Never use Server Edition components (BASIC Transformer, Server Shared
Containers) within a parallel job. BASIC Routines are appropriate only for job
control sequences.
_ Always use parallel Data Sets for intermediate storage between jobs unless
that specific data also needs to be shared with other applications.
_ Use the Copy stage as a placeholder for iterative design, and to facilitate
default type conversions.
_ Use the parallel Transformer stage (not the BASIC Transformer) instead of
the Filter or Switch stages.
Chapter 1. IBM InfoSphere DataStage overview 29
_ Use BuildOp stages only when logic cannot be implemented in the parallel
Transformer.

DataStage data types
The following guidelines should be followed with DataStage data types:
_ Be aware of the mapping between DataStage (SQL) data types and the
internal DS/EE data types. If possible, import table definitions for source
databases using the Orchestrate Schema Importer (orchdbutil) utility.
_ Leverage default type conversions using the Copy stage or across the Output
mapping tab of other stages.

Partitioning data
In most cases, the default partitioning method (Auto) is appropriate. With Auto
partitioning, the Information Server Engine will choose the type of partitioning at
runtime based on stage requirements, degree of parallelism, and source and
target systems. While Auto partitioning will generally give correct results, it might
not give optimized performance. As the job developer, you have visibility into
requirements, and can optimize within a job and across job flows.
Given the numerous options for keyless and keyed partitioning, the following
objectives form a methodology for assigning partitioning:

_ Objective 1
Choose a partitioning method that gives close to an equal number of rows in
each partition, while minimizing overhead. This ensures that the processing
workload is evenly balanced, minimizing overall run time.

_ Objective 2
The partition method must match the business requirements and stage
functional requirements, assigning related records to the same partition if
required.
Any stage that processes groups of related records (generally using one or
more key columns) must be partitioned using a keyed partition method.
This includes, but is not limited to: Aggregator, Change Capture, Change
Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be
necessary for Transformers and BuildOps that process groups of related
records.

_ Objective 3
Unless partition distribution is highly skewed, minimize re-partitioning,
especially in cluster or Grid configurations.
Re-partitioning data in a cluster or Grid configuration incurs the overhead of
network transport.

_ Objective 4
Partition method should not be overly complex. The simplest method that
meets the above objectives will generally be the most efficient and yield the
best performance.
Using the above objectives as a guide, the following methodology can be
applied:
a. Start with Auto partitioning (the default).
b. Specify Hash partitioning for stages that require groups of related records
as follows:
• Specify only the key column(s) that are necessary for correct grouping
as long as the number of unique values is sufficient
• Use Modulus partitioning if the grouping is on a single integer key
column
• Use Range partitioning if the data is highly skewed and the key column
values and distribution do not change significantly over time (Range
Map can be reused)
c. If grouping is not required, use Round Robin partitioning to redistribute
data equally across all partitions.
• Especially useful if the input Data Set is highly skewed or sequential
d. Use Same partitioning to optimize end-to-end partitioning and to minimize
re-partitioning
• Be mindful that Same partitioning retains the degree of parallelism of
the upstream stage
• Within a flow, examine up-stream partitioning and sort order and
attempt to preserve for down-stream processing. This may require
re-examining key column usage within stages and re-ordering stages
within a flow (if business requirements permit).
Note: In satisfying the requirements of this second objective, it might not
be possible to choose a partitioning method that gives an almost equal
number of rows in each partition.

Across jobs, persistent Data Sets can be used to retain the partitioning and sort
order. This is particularly useful if downstream jobs are run with the same degree
of parallelism (configuration file) and require the same partition and sort order.

Collecting data
Given the options for collecting data into a sequential stream, the following
guidelines form a methodology for choosing the appropriate collector type:
1. When output order does not matter, use Auto partitioning (the default).
2. Consider how the input Data Set has been sorted:
– When the input Data Set has been sorted in parallel, use Sort Merge
collector to produce a single, globally sorted stream of rows.
– When the input Data Set has been sorted in parallel and Range
partitioned, the Ordered collector might be more efficient.
3. Use a Round Robin collector to reconstruct rows in input order for round-robin
partitioned input Data Sets, as long as the Data Set has not been
re-partitioned or reduced.

Sorting
Apply the following methodology when sorting in an IBM InfoSphere DataStage
Enterprise Edition data flow:
1. Start with a link sort.
2. Specify only necessary key column(s).
3. Do not use Stable Sort unless needed.
4. Use a stand-alone Sort stage instead of a Link sort for options that are not
available on a Link sort:
– The “Restrict Memory Usage” option should be included here. If you want
more memory available for the sort, you can only set that via the Sort
Stage — not on a sort link. The environment variable
$APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort
memory usage (in MB) per partition.
– Sort Key Mode, Create Cluster Key Change Column, Create Key Change
Column, Output Statistics.
– Always specify “DataStage” Sort Utility for standalone Sort stages.
– Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a
sub-grouping of a previously-sorted input Data Set.
32 IBM InfoSphere DataStage Data Flow and Job Design
5. Be aware of automatically-inserted sorts:
– Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish
required sort order.
6. Minimize the use of sorts within a job flow.
7. To generate a single, sequential ordered result set, use a parallel Sort and a
Sort Merge collector.

Stage specific guidelines
The guidelines by stage are as follows:

_ Transformer
Take precautions when using expressions or derivations on nullable columns
within the parallel Transformer:
– Always convert nullable columns to in-band values before using them in
an expression or derivation.
– Always place a reject link on a parallel Transformer to capture / audit
possible rejects.

_ Lookup
It is most appropriate when reference data is small enough to fit into available
shared memory. If the Data Sets are larger than available memory resources,
use the Join or Merge stage.
Limit the use of database Sparse Lookups to scenarios where the number of
input rows is significantly smaller (for example 1:100 or more) than the
number of reference rows, or when exception processing.

_ Join
Be particularly careful to observe the nullability properties for input links to
any form of Outer Join. Even if the source data is not nullable, the non-key
columns must be defined as nullable in the Join stage input in order to identify
unmatched records.

_ Aggregators
Use Hash method Aggregators only when the number of distinct key column
values is small. A Sort method Aggregator should be used when the number
of distinct key values is large or unknown.

_ Database Stages
The following guidelines apply to database stages:
– Where possible, use the Connector stages or native parallel database
stages for maximum performance and scalability.
– The ODBC Connector and ODBC Enterprise stages should only be used
when a native parallel stage is not available for the given source or target
database.
– When using Oracle, DB2, or Informix databases, use Orchestrate Schema
Importer (orchdbutil) to properly import design metadata.
– Take care to observe the data type mappings.
– If possible, use an SQL where clause to limit the number of rows sent to a
DataStage job.
– Avoid the use of database stored procedures on a per-row basis within a
high-volume data flow. For maximum scalability and parallel performance,
it is best to implement business rules natively using DataStage parallel
components

DataStage Parallel Processing

2010-03-18T09:08:00.000-07:00

Following figure represents one of the simplest jobs you could have — a data source,
a Transformer (conversion) stage, and the data target. The links between the
stages represent the flow of data into or out of a stage.
In a parallel job, each stage would normally (but not always) correspond to a
process. You can have multiple instances of each process to run on the available
processors in your system.

A parallel DataStage job incorporates two basic types of parallel processing —
pipeline and partitioning. Both of these methods are used at runtime by the
Information Server engine to execute the simple job shown in Figure 1-8.
To the DataStage developer, this job would appear the same on your Designer
canvas, but you can optimize it through advanced properties.

Pipeline parallelism
In the following example, all stages run concurrently, even in a single-node
configuration. As data is read from the Oracle source, it is passed to the
Transformer stage for transformation, where it is then passed to the DB2
target. Instead of waiting for all source data to be read, as soon as the source
data stream starts to produce rows, these are passed to the subsequent
stages. This method is called pipeline parallelism, and all three stages in our
example operate simultaneously regardless of the degree of parallelism of the
configuration file. The Information Server Engine always executes jobs with
pipeline parallelism.

If you ran the example job on a system with multiple processors, the stage
reading would start on one processor and start filling a pipeline with the data it
had read. The transformer stage would start running as soon as there was
data in the pipeline, process it and start filling another pipeline. The stage
writing the transformed data to the target database would similarly start
writing as soon as there was data available. Thus all three stages are
operating simultaneously.

Partition parallelism
When large volumes of data are involved, you can use the power of parallel
processing to your best advantage by partitioning the data into a number of
separate sets, with each partition being handled by a separate instance of the
job stages. Partition parallelism is accomplished at runtime, instead of a
manual process that would be required by traditional systems.

The DataStage developer only needs to specify the algorithm to partition the
data, not the degree of parallelism or where the job will execute. Using
partition parallelism the same job would effectively be run simultaneously by
several processors, each handling a separate subset of the total data. At the
end of the job the data partitions can be collected back together again and
written to a single data source. This is shown in following figure.

Attention: You do not need multiple processors to run in parallel. A single
processor is capable of running multiple concurrent processes.

Partition parallelism
Combining pipeline and partition parallelism
The Information Server engine combines pipeline and partition parallel
processing to achieve even greater performance gains. In this scenario you
would have stages processing partitioned data and filling pipelines so the
next one could start on that partition before the previous one had finished.
This is shown in the following figure.

In some circumstances you might want to actually re-partition your data between
stages. This could happen, for example, where you want to group data
differently. Suppose that you have initially processed data based on customer
last name, but now you want to process on data grouped by zip code. You will
have to re-partition to ensure that all customers sharing the same zip code are in
the same group. DataStage allows you to re-partition between stages as and
when necessary. With the Information Server engine, re-partitioning happens in
memory between stages, instead of writing to disk.

DataStage Jobs

2010-03-18T09:04:00.000-07:00

An IBM InfoSphere DataStage job consists of individual stages linked together
which describe the flow of data from a data source to a data target.
A stage usually has at least one data input and/or one data output. However,
some stages can accept more than one data input, and output to more than one
stage. Each stage has a set of predefined and editable properties that tell it how
to perform or process data. Properties might include the file name for the
Sequential File stage, the columns to sort, the transformations to perform,
and the database table name for the DB2 stage. These properties are viewed or
edited using stage editors. Stages are added to a job and linked together using
the Designer. Figure shows some of the stages and their iconic
representations.

Stages and links can be grouped in a shared container. Instances of the shared
container can then be reused in different parallel jobs. You can also define a
local container within a job — this groups stages and links into a single unit, but
can only be used within the job in which it is defined.
The different types of jobs have different stage types. The stages that are
available in the Designer depend on the type of job that is currently open in the
Designer.
Parallel Job stages are organized into different groups on the Designer palette as
follows:
_ General includes stages such as Container and Link.
_ Data Quality includes stages such as Investigate, Standardize, Reference
Match, and Survive.

Database includes stages such as Classic Federation, DB2 UDB, DB2
UDB/Enterprise, Oracle, Sybase, SQL Server®, Teradata, Distributed
Transaction, and ODBC.
_ Development/Debug includes stages such as Peek, Sample, Head, Tail, and
Row Generator.
_ File includes stages such as Complex Flat File, Data Set, Lookup File Set,
and Sequential File.
_ Processing includes stages such as Aggregator, Copy, FTP, Funnel, Join,
Lookup, Merge, Remove Duplicates, Slowly Changing Dimension, Surrogate
Key Generator, Sort, and Transformer
_ Real Time includes stages such as Web Services Transformer, WebSphere
MQ, and Web Services Client.
_ Restructure includes stages such as Column Export and Column Import.

DataStage Data Transformations

2010-03-18T09:02:00.000-07:00

Data transformation and movement is the process by which source data is
selected, converted, and mapped to the format required by targeted systems.
The process manipulates data to bring it into compliance with business, domain,
and integrity rules and with other data in the target environment. Transformation
can take some of the following forms:

_ Aggregation
Consolidating or summarizing data values into a single value. Collecting daily
sales data to be aggregated to the weekly level is a common example of
aggregation.

_ Basic conversion
Ensuring that data types are correctly converted and mapped from source to
target columns.

_ Cleansing
Resolving inconsistencies and fixing the anomalies in source data.

_ Derivation
Transforming data from multiple sources by using a complex business rule or
algorithm.

_ Enrichment
Combining data from internal or external sources to provide additional
meaning to the data.

_ Normalizing
Reducing the amount of redundant and potentially duplicated data.

_ Combining
The process of combining data from multiple sources via parallel Lookup,
Join, or Merge operations.

_ Pivoting
Converting records in an input stream to many records in the appropriate
table in the data warehouse or data mart.

_ Sorting
Grouping related records and sequencing data based on data or string
values.

DataStage Main Functions

2010-03-18T08:59:00.000-07:00

In its simplest form, IBM InfoSphere DataStage performs data transformation
and movement from source systems to target systems in batch and in real time.
The data sources might include indexed files, sequential files, relational
databases, archives, external data sources, enterprise applications, and
message queues.

DataStage manages data that arrives and data that is received on a periodic or
scheduled basis. It enables companies to solve large-scale business problems
with high-performance processing of massive data volumes. By leveraging the
parallel processing capabilities of multiprocessor hardware platforms, DataStage
can scale to satisfy the demands of ever-growing data volumes, stringent
real-time requirements, and ever-shrinking batch windows.
Leveraging the combined suite of IBM Information Server, DataStage can
simplify the development of authoritative master data by showing where and how
information is stored across source systems. DataStage can also consolidate
disparate data into a single, reliable record, cleanses and standardizes
information, removes duplicates, and links records together across systems. This
master record can be loaded into operational data stores, data warehouses, or
master data applications such as IBM MDM using IBM InfoSphere DataStage.
IBM InfoSphere DataStage delivers four core capabilities:

_ Connectivity to a wide range of mainframe, legacy, and enterprise
applications, databases, file formats, and external information sources.

_ Prebuilt library of more than 300 functions including data validation rules and
very complex transformations.

_ Maximum throughput using a parallel, high-performance processing
architecture.

_ Enterprise-class capabilities for development, deployment, maintenance, and
high-availability. It leverages metadata for analysis and maintenance. It also
operates in batch, real time, or as a Web service.

IBM InfoSphere DataStage enables an integral part of the information integration
process

DataStage Execution Flow

2010-03-18T08:54:00.002-07:00

When you execute a job, the generated OSH and contents of the configuration
file ($APT_CONFIG_FILE) is used to compose a “score”. This is similar to a SQL
query optimization plan.

At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and
node assignments for each operator, and inserts sorts and partitioners as
needed to ensure correct results. It also defines the connection topology (virtual
data sets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines the number of
actual OS processes. Multiple operators/stages are combined within a single OS
process as appropriate, to improve performance and optimize resource
requirements.

The job score is used to fork processes with communication interconnects for
data, message and control3. Processing begins after the job score and
processes are created. Job processing ends when either the last row of data is
processed by the final operator, a fatal error is encountered by any operator, or
the job is halted by DataStage Job Control or human intervention such as
DataStage Director STOP.

Job scores are divided into two sections — data sets (partitioning and collecting)
and operators (node/operator mapping). Both sections identify sequential or
parallel processing.

The execution (orchestra) manages control and message flow across processes
and consists of the conductor node and one or more processing nodes as shown
in Figure 1-6. Actual data flows from player to player — the conductor and
section leader are only used to control process execution through control and
message channels.

_ Conductor is the initial framework process. It creates the Section Leader (SL)
processes (one per node), consolidates messages to the DataStage log, and
manages orderly shutdown. The Conductor node has the start-up process.
The Conductor also communicates with the players.

Note: You can direct the score to a job log by setting $APT_DUMP_SCORE.
To identify the Score dump, look for “main program: This step....”.

_ Section Leader is a process that forks player processes (one per stage) and
manages up/down communications. SLs communicate between the
conductor and player processes only. For a given parallel configuration file,
one section leader will be started for each logical node.

_ Players are the actual processes associated with the stages. It sends stderr
and stdout to the SL, establishes connections to other players for data flow,
and cleans up on completion. Each player has to be able to communicate
with every other player. There are separate communication channels
(pathways) for control, errors, messages and data. The data channel does
not go through the section leader/conductor as this would limit scalability.

Data flows directly from upstream operator to downstream operator.

DataStage OSH Script

2010-03-18T08:54:00.001-07:00

The IBM InfoSphere DataStage and QualityStage Designer client creates IBM
InfoSphere DataStage jobs that are compiled into parallel job flows, and reusable
components that execute on the parallel Information Server engine. It allows you
to use familiar graphical point-and-click techniques to develop job flows for
extracting, cleansing, transforming, integrating, and loading data into target files,
target systems, or packaged applications.

The Designer generates all the code. It generates the OSH (Orchestrate SHell
Script) and C++ code for any Transformer stages used.
Briefly, the Designer performs the following tasks:
_ Validates link requirements, mandatory stage options, transformer logic, etc.
_ Generates OSH representation of data flows and stages (representations of
framework “operators”).
_ Generates transform code for each Transformer stage which is then compiled
into C++ and then to corresponding native operators.
_ Reusable BuildOp stages can be compiled using the Designer GUI or from
the command line.
Here is a brief primer on the OSH:
_ Comment blocks introduce each operator, the order of which is determined by
the order stages were added to the canvas.
_ OSH uses the familiar syntax of the UNIX shell. such as Operator name,
schema, operator options (“-name value” format), input (indicated by n<
where n is the input#), and output (indicated by the n> where n is the
output #).
_ For every operator, input and/or output data sets are numbered sequentially
starting from zero.
_ Virtual data sets (in memory native representation of data links) are
generated to connect operators.

Framework (Information Server Engine) terms and DataStage terms have
equivalency. The GUI frequently uses terms from both paradigms. Runtime
messages use framework terminology because the framework engine is where
execution occurs. The following list shows the equivalency between framework
and DataStage terms:
_ Schema corresponds to table definition
_ Property corresponds to format
_ Type corresponds to SQL type and length
_ Virtual data set corresponds to link
_ Record/field corresponds to row/column
_ Operator corresponds to stage

Note: The actual execution order of operators is dictated by input/output
designators, and not by their placement on the diagram. The data sets
connect the OSH operators. These are “virtual data sets”, that is, in memory
data flows. Link names are used in data set names — it is therefore good
practice to give the links meaningful names.