tag:blogger.com,1999:blog-83895120916679255282024-03-13T01:26:05.144-07:00DATASTAGE TUTORIAL,GUIDES AND TRAININGThis blog aims at providing free tutorial tutorials guides and other study materials for IBM WebSphere DataStage.Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.comBlogger36125tag:blogger.com,1999:blog-8389512091667925528.post-10099623296962010902011-04-09T01:40:00.001-07:002011-04-09T01:40:41.395-07:00DataStage Overview<u>DataStage Certifications Exam 000-415 Dumps Special</u><br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNDhll3_I1Kk0LFqb5kraaR4I05rCyHwMkX1FvUkCnjJvx4nE_2w7GRsBuUgu2x2_FrmJYXuOCjgD7Mei3nvTqGRC6lm_flYrDe70Hxj4IUhCG6Ju5o7j2CeLJm_UgKC9I-2g8DVqbpxxZ/s1600/000-415_Certs_Page_01.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNDhll3_I1Kk0LFqb5kraaR4I05rCyHwMkX1FvUkCnjJvx4nE_2w7GRsBuUgu2x2_FrmJYXuOCjgD7Mei3nvTqGRC6lm_flYrDe70Hxj4IUhCG6Ju5o7j2CeLJm_UgKC9I-2g8DVqbpxxZ/s400/000-415_Certs_Page_01.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463923296672789074" /></a><br /><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certification-dumps-pages-1.html">Checkout Pages 1-10</a>.<br /><br />IBM® InfoSphere® DataStage® integrates data across multiple and high volumes of data sources and target applications.<br /><br />IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and the IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition and the Enterprise Edition. Like several other IBM products (e.g. IBM WebSphere Portal from the IBM Lotus family), DataStage belongs to another brand than points its name.<br /><br /><div style="width:477px" id="__ss_4671966"><strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/divjeev/ibm-info-sphere-datastage-data-flow-and-job-design" title="Ibm info sphere datastage data flow and job design">Ibm InfoSphere Datastage Data Flow and Job Design</a></strong><object id="__sse4671966" width="477" height="510"><param name="movie" value="http://static.slidesharecdn.com/swf/doc_player.swf?doc=ibminfospheredatastagedataflowandjobdesign-100703064419-phpapp01&stripped_title=ibm-info-sphere-datastage-data-flow-and-job-design" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse4671966" src="http://static.slidesharecdn.com/swf/doc_player.swf?doc=ibminfospheredatastagedataflowandjobdesign-100703064419-phpapp01&stripped_title=ibm-info-sphere-datastage-data-flow-and-job-design" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="477" height="510"></embed></object><div style="padding:5px 0 12px">View more <a href="http://www.slideshare.net/">documents</a> from <a href="http://www.slideshare.net/divjeev">datastaget-tutorials.blogspot.com</a>.</div></div><br /><br /><br />DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in the first half of 1996. Peter Weyman was VMark VP of Strategy and identified the ETL market as an opportunity. He appointed Lee Scheffler as the architect and conceived the product brand name "Stage" to signify modularity and component-orientation]. This tag was used to name DataStage and subsequently used in related products QualityStage, ProfileStage, MetaStage and AuditStage. Lee Scheffler presented the DataStage product overview to the board of VMark in June 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in January 1997.<br /><br />VMark acquired UniData in October 1997 and renamed itself to Ardent Software. In 1999 Ardent Software was acquired by Informix the database software vendor. In April 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software. In November 2001, Ascential Software Corp. of Westboro, Mass. acquired privately held Torrent Systems Inc. of Cambridge, Mass. for $46 million in cash. Ascential has stated a commitment to integrate Orchestrate's parallel processing capabilities directly into the DataStageXE platform. . In March 2005 IBM acquired Ascential Software and made DataStage part of the WebSphere family as WebSphere DataStage. In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.<br /><br /><span style="font-weight:bold;">Architecture</span><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBZH8o0kGb2aSEPdyHEwHl4DEoI51owJijDI0AkZV9nxjbAXH2Iw6_cfC9ngW2nyX4JTSkcaJ9Ws0D2SbAM6_xwDAidUz7nHsm1XvECd-CRtG9JhXl3FqwuyBDDofPnoFS3Nc-t5UsjeDz/s1600-h/Architecture.jpg"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 378px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBZH8o0kGb2aSEPdyHEwHl4DEoI51owJijDI0AkZV9nxjbAXH2Iw6_cfC9ngW2nyX4JTSkcaJ9Ws0D2SbAM6_xwDAidUz7nHsm1XvECd-CRtG9JhXl3FqwuyBDDofPnoFS3Nc-t5UsjeDz/s400/Architecture.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5385739819997544562" /></a><br /><br />It integrates data on demand with a high performance parallel framework, extended metadata management, and enterprise connectivity.<br /><br /> * Supports the collection, integration and transformation of large volumes of data, with data structures ranging from simple to highly complex.<br /> * Offers scalable platform that enables companies to solve large-scale business problems through high-performance processing of massive data volumes<br /> * Supports real-time data integration.<br /> * Enables developers to maximize speed, flexibility and effectiveness in building, deploying, updating and managing their data integration infrastructure.<br /> * Completes connectivity between any data source and any application.<br /><br /><br />What is IBM WebSphere DataStage?<br /><br />• Design jobs for Extraction, Transformation, and Loading (ETL) <br />• Ideal tool for data integration projects –such as, data warehouses, data marts, and system migrations <br />• Import, export, create, and manage metadata for use within jobs <br />• Schedule, run, and monitor jobs, all within DataStage<br />• Administer your DataStage development and execution environments Create batch (controlling) jobs<br /><br />DataStage is a comprehensive tool for the fast, easy creation and maintenance of data marts and data warehouses. It provides the tools you need to build, manage, and expand them. With DataStage, you canbuild solutions faster and give users access to the data and reports they need. <br /><br />With DataStage you can: <br />• Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart. <br />• Create and reuse metadata and job components. <br />• Run, monitor, and schedule these jobs. <br />• Administer your development and execution environments.<br /><br /><br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgitcsBHLwa-J8RWjJHlbJMwvFyKS5PGrfS6u6A_8QmQze-X9dvqgk9msarO17IY4PoHizwpU7RKoDBiFWY-m6rbQT8vmOnrE5Zdr7tTeQV29ywd6_y8Ur50XtiyZAlzdCR410Y7-kvY6nm/s1600-h/Information+Server+BackBone.jpg"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 220px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgitcsBHLwa-J8RWjJHlbJMwvFyKS5PGrfS6u6A_8QmQze-X9dvqgk9msarO17IY4PoHizwpU7RKoDBiFWY-m6rbQT8vmOnrE5Zdr7tTeQV29ywd6_y8Ur50XtiyZAlzdCR410Y7-kvY6nm/s400/Information+Server+BackBone.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5391255420801625778" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-1416360666875259752011-04-09T01:39:00.000-07:002011-04-09T01:42:56.619-07:00Linked BlogsI am going to link every blogs that link to my blog. So, if you placed a link of my blog on your blog, I will place a link for your blog too.<br />http://www.blogger.com/img/blank.gifhttp://www.blogger.com/img/blank.gif<br />So, please leave your blog link in the comment box. I will add your blog into the list.<br /><br />Blog Links:-<br /><br /><a href="http://maybe-she-does.blogspot.com/" target='_blank'>http://maybe-she-does.blogspot.com/</a><br /><a href="http://strobist.blogspot.com/" target='_blank'>http://strobist.blogspot.com/</a><br /><br />Leave a link to get into this listTutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-29102751103024754712011-03-27T06:47:00.005-07:002011-03-27T07:06:54.473-07:00DataStage concepts<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ1bIfNwvMEmN_klAmJaF6Hovty-Yz-fmkomP8vu7ECWcDlYHzXQVyflJa-nadAAPCCFsmkN9pKi-J2cB7oXsDYMVFz114Lk_WTGHLCuNvzu_uD3A1qW8os70f1iVxkHi8uX0ITBo4nU7v/s1600-h/DataStage+Architecture.jpg"><img style="diesplay:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 292px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ1bIfNwvMEmN_klAmJaF6Hovty-Yz-fmkomP8vu7ECWcDlYHzXQVyflJa-nadAAPCCFsmkN9pKi-J2cB7oXsDYMVFz114Lk_WTGHLCuNvzu_uD3A1qW8os70f1iVxkHi8uX0ITBo4nU7v/s400/DataStage+Architecture.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5393997356952312482" /></a><br /><br /><br />The top half displays the Clients. Below there are two engines: The Server engine that runs DataStage server jobs and the parallel engine that runs parallel jobs. Our focus in this course is on Parallel jobs.<br /><br />The DataStage client components are: <br /><br /><span style="font-weight:bold;">Administrator </span><br />Administers DataStage projects and conducts housekeeping on the server <br /><br /><span style="font-weight:bold;">Designer </span><br />Creates DataStage jobs that are compiled into executable programs <br /><br /><span style="font-weight:bold;">Director </span><br />Used to run and monitor the DataStage jobs<br />The Repository is used to store DataStage objects. The Repository is shared with other applications in the Suite.Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-58612740207586233522011-03-27T06:47:00.003-07:002011-03-27T07:07:22.066-07:00Datastage certification<a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-1.html">DataStage Certification Dumps Page 1</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-2.html">DataStage Certification Dumps Page 2</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-3.html">DataStage Certification Dumps Page 3</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-4.html">DataStage Certification Dumps Page 4</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-5.html">DataStage Certification Dumps Page 5</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-6.html">DataStage Certification Dumps Page 6</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-7.html">DataStage Certification Dumps Page 7</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-8.html">DataStage Certification Dumps Page 8</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-9.html">DataStage Certification Dumps Page 9</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-10.html">DataStage Certification Dumps Page 10</a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com1tag:blogger.com,1999:blog-8389512091667925528.post-83651516352698372262011-03-27T06:47:00.001-07:002011-03-27T07:10:30.055-07:00Datastage Tutorial<u><a href="http://www.easy-share.com/1907916970/IBM InfoSphere DataStage Overview.pdf" target="_blank">DataStage Overview</a><br/><br /><a href="http://www.easy-share.com/1907916978/IBM InfoSphere DataStage Stages.pdf" target="_blank">DataStage Stages</a><br/><br /><a href="http://www.easy-share.com/1907918420/Data Stage Tutorial.pdf" target="_blank">DataStage Parallel Jobs</a><br/><br /><a href="http://www.easy-share.com/1908309865/DataStage V8 Basics Part-1.pdf" target="_blank">DataStage Basics Part-1</a><br/><br /><a href="http://www.easy-share.com/1908309870/DataStage V8 Basics Part-2.pdf" target="_blank">DataStage Basics Part-2</a><br/><br /><a href="http://www.easy-share.com/1908309998/DataStage V8 Special Topics Part-1.pdf" target="_blank">DataStage Advanced-1</a><br/><br /><a href="http://www.easy-share.com/1908310006/DataStage V8 Special Topics Part-2.pdf" target="_blank">DataStage Advanced-2</a><br/></u><br /><u><a href="http://www.easy-share.com/1908309969/DataStage Lab Exercises Part-1.pdf" target="_blank">DataStage Lab Part-1</a><br/></u><br /><u><a href="http://www.easy-share.com/1908309979/DataStage Lab Exercises Part-2.pdf" target="_blank">DataStage Lab Part-2</a><br/></u>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-4312471951779759942011-03-27T06:46:00.001-07:002011-03-27T07:04:37.507-07:00Datastage interview questions1)How can we achieve parallelism ?<br /><br />The degree of parallelism is achieved by configuring the <br />multiple nodes in the config file. In the config files we can specify multiple nodes.<br /><br />2)What are Stage Variables, Derivations and Constants?<br /><br /> Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.<br />Derivation - Expression that specifies value to be passed on to the target column.<br />Constant - Conditions that are either true or false that specifies flow of data with a link. <br /><br />3)Compare and Contrast ODBC and Plug-In stages? <br /><br />ODBC : a) Poor Performance.<br />b) Can be used for Variety of Databases.<br />c) Can handle Stored Procedures.<br /><br />Plug-In: a) Good Performance.<br />b) Database specific.(Only one database)<br />c) Cannot handle Stored Procedures. <br /><br />4)How to run a Shell Script within the scope of a Data stage job? <br /><br />select the EDIT tab in the toolbar-> choose job properties-> select the job parameters->choose the Before/ After routines ->select the EXCESH command<br /><br />5)How do you merge two files in DS?<br /><br />Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.<br /><br />6)How can we pass parameters from one job to another job by using command line prompt?<br /><br />We can pass parameter to a job using two ways .. using dsjob- command line or from a sequencer.<br />Other way would be -<br />You configure single parameter set ( version 8.0 onwards) and use the same in both the jobs so that they share the same set of parameters.<br /><br />7)When we are extracting the flatfiles, What are the basic required validations?<br /><br />Following are some common validations performed:<br />a) Check for blank lines and remove them.<br />b) Check the number of column in each row of the file.<br />c) If there is a trailer line in the flat file containing additional information like total number of records,then a cross check is performed to check if the number of records specified in the trailer and the actual number of records are same.<br />d) Check if a column contains blank value (If it is expected to have values).<br /><br />8)How do you do Usage analysis in datastage ?<br /><br />1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents.<br />2. To find how many jobs are using a particular table.<br />3. To find how many jobs are using a particular routine.<br />Like this, u can find all the dependents of a particular object.<br />Its like nested. U can move forward and backward and can see all the dependents.<br /><br /><br />9)Types of Parallel Processing? <br /><br />Parallel Processing is broadly classified into 2 types.<br />a) SMP - Symmetrical Multi Processing.<br />b) MPP - Massive Parallel Processing. <br /><br />10)Do u know about METASTAGE?<br /><br />MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage.<br /><br />11)Difference between Hashfile and Sequential File?<br /><br />Difference between Hashfile and sequential file is , searching a record is too fast in hash file based on the hashkey, we can get the address of record directly in hashfile based on the hashkey, and in sequential file it should search record sequential mode only, it has to search for record by record, and we can remove duplicate records based on the hash key in hashfile, we cannot in sequential file.<br /><br />12)If I add a new environment variable in Windows, how can I access it in DataStage?<br /><br />U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties<br /><br />13)What is the difference between LOOK UP File Stage and LookUP stage ? <br /><br />LookUP stage is used to deal on reference data set with source data .<br />where as LOOK UP File Stage is used to create the reference data set for the look up stage for to perform the look up operation with the source data.<br /><br />14)What is the difference between Symetrically parallel processing,Massively parallel processing?<br /><br />Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate via shared memory and have single operating system.<br /><br />Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system and communicate via high speed networkTutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-19574504903047080952010-04-24T21:15:00.001-07:002010-04-24T21:17:45.107-07:00DataStage Certification Dumps Pages 1-10<a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-1.html">DataStage Certification Dumps Page 1</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-2.html">DataStage Certification Dumps Page 2</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-3.html">DataStage Certification Dumps Page 3</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-4.html">DataStage Certification Dumps Page 4</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-5.html">DataStage Certification Dumps Page 5</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-6.html">DataStage Certification Dumps Page 6</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-7.html">DataStage Certification Dumps Page 7</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-8.html">DataStage Certification Dumps Page 8</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-9.html">DataStage Certification Dumps Page 9</a><br /><a href="http://datastage-tutorials.blogspot.com/2010/04/datastage-certifications-dumps-page-10.html">DataStage Certification Dumps Page 10</a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-13952830097824058372010-04-24T21:14:00.001-07:002010-04-24T21:14:52.679-07:00DataStage Certifications Dumps Page-1<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQ_zsfYw9JWmDLA8x8Lsyjxfg5Yw6dPBnNmbMkz3bJlIUKT3wNxwsNfMCASE9I4vd8ymcM-B-GohbDbORsx5jGKkTuvq64nHpFP0Igh7n-XJdBajNInABx6D4P7LdBraknC_-wG6fMomrQ/s1600/000-415_Certs_Page_01.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQ_zsfYw9JWmDLA8x8Lsyjxfg5Yw6dPBnNmbMkz3bJlIUKT3wNxwsNfMCASE9I4vd8ymcM-B-GohbDbORsx5jGKkTuvq64nHpFP0Igh7n-XJdBajNInABx6D4P7LdBraknC_-wG6fMomrQ/s400/000-415_Certs_Page_01.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463923766124240354" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-75493348841448324022010-04-24T21:11:00.001-07:002010-04-24T21:12:29.133-07:00DataStage Certifications Dumps Page-2<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk7J01odJ9nrHV3N32-AEvutY-dQ3SqFTYHGJ3NbRGFk-ZhsnvatPNF0QkXXXPikmK6gQG6YYiep5FghAs_enLLwmsdpWjP8lSvGrqCdV9b3ZjFI1cLGigD8vMJNaRVFbLnHMzSxsUeQJh/s1600/000-415_Certs_Page_02.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk7J01odJ9nrHV3N32-AEvutY-dQ3SqFTYHGJ3NbRGFk-ZhsnvatPNF0QkXXXPikmK6gQG6YYiep5FghAs_enLLwmsdpWjP8lSvGrqCdV9b3ZjFI1cLGigD8vMJNaRVFbLnHMzSxsUeQJh/s400/000-415_Certs_Page_02.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463923146161887842" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-74185853252499573712010-04-24T21:10:00.000-07:002010-04-24T21:11:32.808-07:00DataStage Certifications Dumps Page-3<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZdZcdzfWz5dkWSYtcSHj57hXlYPirSXRq0K3Z2PxcZ96SrcQt3aR9J0eYrS26sB7oXrs0aY3HOuN9Mp2bssoBEaZxh0WUVcEF4UUu1jOXI410-JRFDHKFSVz6gnJYSuCbNZK10TdS9Yli/s1600/000-415_Certs_Page_03.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZdZcdzfWz5dkWSYtcSHj57hXlYPirSXRq0K3Z2PxcZ96SrcQt3aR9J0eYrS26sB7oXrs0aY3HOuN9Mp2bssoBEaZxh0WUVcEF4UUu1jOXI410-JRFDHKFSVz6gnJYSuCbNZK10TdS9Yli/s400/000-415_Certs_Page_03.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463922913737939522" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-62511083370717548302010-04-24T21:09:00.002-07:002010-04-24T21:10:18.928-07:00DataStage Certifications Dumps Page-4<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjadBoKKbkYbyj0Fp2QhOLKS480S_3LfZGzEp0aWvktrKddTpDscAjD0_4eECpbQGMkbj9OEyVvtlCwKvANYB0lQQZtoNuiOwAGEZgEnA4ot1FqrD0jdhUfJ-HorCv982-gHKpi165E6A9G/s1600/000-415_Certs_Page_04.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjadBoKKbkYbyj0Fp2QhOLKS480S_3LfZGzEp0aWvktrKddTpDscAjD0_4eECpbQGMkbj9OEyVvtlCwKvANYB0lQQZtoNuiOwAGEZgEnA4ot1FqrD0jdhUfJ-HorCv982-gHKpi165E6A9G/s400/000-415_Certs_Page_04.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463922588748794690" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-69794874980216657802010-04-24T21:09:00.001-07:002010-04-24T21:09:52.233-07:00DataStage Certifications Dumps Page-5<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhALxP-vM6mfHElUsTvBctbJ7d_3-hpToVf3G4XHADmUz13v5N1Wrpb4jamxw_vBG9H5k4mLeieHrhu4lCoTs_kBmmBjFP0JD1LH2MRbmYoeDnve6HC2Pvlp_uovLD-ZLHPvUvliUdqhj5z/s1600/000-415_Certs_Page_05.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhALxP-vM6mfHElUsTvBctbJ7d_3-hpToVf3G4XHADmUz13v5N1Wrpb4jamxw_vBG9H5k4mLeieHrhu4lCoTs_kBmmBjFP0JD1LH2MRbmYoeDnve6HC2Pvlp_uovLD-ZLHPvUvliUdqhj5z/s400/000-415_Certs_Page_05.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463922473496162962" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-14488818458613730882010-04-24T21:08:00.002-07:002010-04-24T21:09:17.715-07:00DataStage Certifications Dumps Page-6<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSBy2gN5aCHw0KS53otnwZ6miPT4RUgqEZKar6G6wi4JtwOhMNn_YqbUQBsVS-W_HnPdrzGBQzjAIMPx_B73SoXnWSnMaDJbNaZK0gT1uPVzHDeO46pwf6yPzkD0nKxmLslwret5DWKaal/s1600/000-415_Certs_Page_06.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSBy2gN5aCHw0KS53otnwZ6miPT4RUgqEZKar6G6wi4JtwOhMNn_YqbUQBsVS-W_HnPdrzGBQzjAIMPx_B73SoXnWSnMaDJbNaZK0gT1uPVzHDeO46pwf6yPzkD0nKxmLslwret5DWKaal/s400/000-415_Certs_Page_06.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463922333820745842" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-31498358860793449322010-04-24T21:08:00.001-07:002010-04-24T21:08:49.418-07:00DataStage Certifications Dumps Page-7<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhs_wvtTXgaQR1EjdVHSQxALM-TDtqo_43rkCsxBSR03MGlnuk-y-8kqR7jx86owKQSUgBkEgg9WBiZ8Xv0Vlu_edtCc0c-2gcLhsGvyoSqSDx4cLRU7vivczGfnRe9X9HSeu3QHAWL0JcD/s1600/000-415_Certs_Page_07.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhs_wvtTXgaQR1EjdVHSQxALM-TDtqo_43rkCsxBSR03MGlnuk-y-8kqR7jx86owKQSUgBkEgg9WBiZ8Xv0Vlu_edtCc0c-2gcLhsGvyoSqSDx4cLRU7vivczGfnRe9X9HSeu3QHAWL0JcD/s400/000-415_Certs_Page_07.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463922168752033026" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com1tag:blogger.com,1999:blog-8389512091667925528.post-34123431179335663102010-04-24T21:07:00.000-07:002010-04-24T21:08:00.587-07:00DataStage Certifications Dumps Page-8<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibHwoWNvFegxe5nhBwcAH3e0Mi6B1lrR7U9D3gQ3y9MD8g9yVQ7h4H8s98auPGlNu7x18kNfsxRU38KgKMBIVJDZ3RKGST6NHw42zKu3ZIyC2gOaE2mSI4Tif2I1BxXFlbmEVW_oH8B9tC/s1600/000-415_Certs_Page_08.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibHwoWNvFegxe5nhBwcAH3e0Mi6B1lrR7U9D3gQ3y9MD8g9yVQ7h4H8s98auPGlNu7x18kNfsxRU38KgKMBIVJDZ3RKGST6NHw42zKu3ZIyC2gOaE2mSI4Tif2I1BxXFlbmEVW_oH8B9tC/s400/000-415_Certs_Page_08.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463921957767488946" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-29196446212429655312010-04-24T21:06:00.000-07:002010-04-24T21:07:25.747-07:00DataStage Certifications Dumps Page-9<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqWBM_NdOrLb8wT6H3UdHDFZ3LwSXv2QMb2I7rivUNDBea8tWtIun3Hkw4VgFK0onrrCle92Lq_lUeIkt8CQeSYUnJum-aGKWoSwHPFbN3n4lk0MHBu2FFWw8eOPVJhlxE5cC3k47WAtrK/s1600/000-415_Certs_Page_09.jpg"><img style="cursor:pointer; cursor:hand;width: 306px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqWBM_NdOrLb8wT6H3UdHDFZ3LwSXv2QMb2I7rivUNDBea8tWtIun3Hkw4VgFK0onrrCle92Lq_lUeIkt8CQeSYUnJum-aGKWoSwHPFbN3n4lk0MHBu2FFWw8eOPVJhlxE5cC3k47WAtrK/s400/000-415_Certs_Page_09.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463921843369323442" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-31005588121834579132010-04-24T21:05:00.000-07:002010-04-24T21:06:07.982-07:00DataStage Certifications Dumps Page-10<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_nlMCF8NQxoOO88fvaiK7sY837bluGpRXJJ2QanOcO3kcWvLyVfNofwxAOuMfjXfpgId4wnYHrSNJ_FfA33gD-cJbH2aKm4dJXVLlwVmlSh0juB_fWcjzB6Bdc7Jftk2NjAVsi65CgB8W/s1600/000-415_Certs_Page_10.jpg"><img style="cursor:pointer; cursor:hand;width: 309px; height: 400px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_nlMCF8NQxoOO88fvaiK7sY837bluGpRXJJ2QanOcO3kcWvLyVfNofwxAOuMfjXfpgId4wnYHrSNJ_FfA33gD-cJbH2aKm4dJXVLlwVmlSh0juB_fWcjzB6Bdc7Jftk2NjAVsi65CgB8W/s400/000-415_Certs_Page_10.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5463921407143134498" /></a>Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-77477705952135924902010-04-15T02:46:00.000-07:002010-04-15T02:47:15.278-07:00S896Q9P9RYMSS896Q9P9RYMSTutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-18410889564068562982010-03-18T09:32:00.000-07:002010-03-18T09:38:37.563-07:00DataStage Best PracticesThis section provides an overview of recommendations for standard practices.<br />The recommendations are categorized as follows:<br />_ Standards<br />_ Development guidelines<br />_ Component usage<br />_ DataStage Data Types<br />_ Partitioning data<br />_ Collecting data<br />_ Sorting<br />_ Stage specific guidelines<br /><br />Standards<br />It is important to establish and follow consistent standards in:<br />_ Directory structures for installation and application support directories.<br />_ Naming conventions, especially for DataStage Project categories, stage<br />names, and links.<br />All DataStage jobs should be documented with the Short Description field, as<br />well as Annotation fields.<br />It is the DataStage developer’s responsibility to make personal backups of their<br />work on their local workstation, using DataStage's DSX export capability. This<br />can also be used for integration with source code control systems.<br />Note: A detailed discussion of these practices is beyond the scope of this<br />Redbooks publication, and you should speak to your Account Executive to<br />engage IBM IPS Services.<br /><br /><br />Development guidelines<br />Modular development techniques should be used to maximize re-use of<br />DataStage jobs and components:<br />_ Job parameterization allows a single job design to process similar logic<br />instead of creating multiple copies of the same job. The Multiple-Instance job<br />property allows multiple invocations of the same job to run simultaneously.<br />_ A set of standard job parameters should be used in DataStage jobs for source<br />and target database parameters (DSN, user, password, etc.) and directories<br />where files are stored. To ease re-use, these standard parameters and<br />settings should be made part of a Designer Job Parameter Sets.<br />_ Create a standard directory structure outside of the DataStage project<br />directory for source and target files, intermediate work files, and so forth.<br />_ Where possible, create re-usable components such as parallel shared<br />containers to encapsulate frequently-used logic.<br />_ DataStage Template jobs should be created with:<br />– Standard parameters such as source and target file paths, and database<br />login properties<br />– Environment variables and their default settings<br />– Annotation blocks<br />_ Job Parameters should always be used for file paths, file names, database<br />login settings.<br />_ Standardized Error Handling routines should be followed to capture errors<br />and rejects.<br /><br />Component usage<br />The following guidelines should be followed when constructing parallel jobs in<br />IBM InfoSphere DataStage Enterprise Edition:<br />_ Never use Server Edition components (BASIC Transformer, Server Shared<br />Containers) within a parallel job. BASIC Routines are appropriate only for job<br />control sequences.<br />_ Always use parallel Data Sets for intermediate storage between jobs unless<br />that specific data also needs to be shared with other applications.<br />_ Use the Copy stage as a placeholder for iterative design, and to facilitate<br />default type conversions.<br />_ Use the parallel Transformer stage (not the BASIC Transformer) instead of<br />the Filter or Switch stages.<br />Chapter 1. IBM InfoSphere DataStage overview 29<br />_ Use BuildOp stages only when logic cannot be implemented in the parallel<br />Transformer.<br /><br />DataStage data types<br />The following guidelines should be followed with DataStage data types:<br />_ Be aware of the mapping between DataStage (SQL) data types and the<br />internal DS/EE data types. If possible, import table definitions for source<br />databases using the Orchestrate Schema Importer (orchdbutil) utility.<br />_ Leverage default type conversions using the Copy stage or across the Output<br />mapping tab of other stages.<br /><br />Partitioning data<br />In most cases, the default partitioning method (Auto) is appropriate. With Auto<br />partitioning, the Information Server Engine will choose the type of partitioning at<br />runtime based on stage requirements, degree of parallelism, and source and<br />target systems. While Auto partitioning will generally give correct results, it might<br />not give optimized performance. As the job developer, you have visibility into<br />requirements, and can optimize within a job and across job flows.<br />Given the numerous options for keyless and keyed partitioning, the following<br />objectives form a methodology for assigning partitioning:<br /><br />_ Objective 1<br />Choose a partitioning method that gives close to an equal number of rows in<br />each partition, while minimizing overhead. This ensures that the processing<br />workload is evenly balanced, minimizing overall run time.<br /><br />_ Objective 2<br />The partition method must match the business requirements and stage<br />functional requirements, assigning related records to the same partition if<br />required.<br />Any stage that processes groups of related records (generally using one or<br />more key columns) must be partitioned using a keyed partition method.<br />This includes, but is not limited to: Aggregator, Change Capture, Change<br />Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be<br />necessary for Transformers and BuildOps that process groups of related<br />records.<br /><br />_ Objective 3<br />Unless partition distribution is highly skewed, minimize re-partitioning,<br />especially in cluster or Grid configurations.<br />Re-partitioning data in a cluster or Grid configuration incurs the overhead of<br />network transport.<br /><br />_ Objective 4<br />Partition method should not be overly complex. The simplest method that<br />meets the above objectives will generally be the most efficient and yield the<br />best performance.<br />Using the above objectives as a guide, the following methodology can be<br />applied:<br />a. Start with Auto partitioning (the default).<br />b. Specify Hash partitioning for stages that require groups of related records<br />as follows:<br />• Specify only the key column(s) that are necessary for correct grouping<br />as long as the number of unique values is sufficient<br />• Use Modulus partitioning if the grouping is on a single integer key<br />column<br />• Use Range partitioning if the data is highly skewed and the key column<br />values and distribution do not change significantly over time (Range<br />Map can be reused)<br />c. If grouping is not required, use Round Robin partitioning to redistribute<br />data equally across all partitions.<br />• Especially useful if the input Data Set is highly skewed or sequential<br />d. Use Same partitioning to optimize end-to-end partitioning and to minimize<br />re-partitioning<br />• Be mindful that Same partitioning retains the degree of parallelism of<br />the upstream stage<br />• Within a flow, examine up-stream partitioning and sort order and<br />attempt to preserve for down-stream processing. This may require<br />re-examining key column usage within stages and re-ordering stages<br />within a flow (if business requirements permit).<br />Note: In satisfying the requirements of this second objective, it might not<br />be possible to choose a partitioning method that gives an almost equal<br />number of rows in each partition.<br /><br />Across jobs, persistent Data Sets can be used to retain the partitioning and sort<br />order. This is particularly useful if downstream jobs are run with the same degree<br />of parallelism (configuration file) and require the same partition and sort order.<br /><br /><br />Collecting data<br />Given the options for collecting data into a sequential stream, the following<br />guidelines form a methodology for choosing the appropriate collector type:<br />1. When output order does not matter, use Auto partitioning (the default).<br />2. Consider how the input Data Set has been sorted:<br />– When the input Data Set has been sorted in parallel, use Sort Merge<br />collector to produce a single, globally sorted stream of rows.<br />– When the input Data Set has been sorted in parallel and Range<br />partitioned, the Ordered collector might be more efficient.<br />3. Use a Round Robin collector to reconstruct rows in input order for round-robin<br />partitioned input Data Sets, as long as the Data Set has not been<br />re-partitioned or reduced.<br /><br />Sorting<br />Apply the following methodology when sorting in an IBM InfoSphere DataStage<br />Enterprise Edition data flow:<br />1. Start with a link sort.<br />2. Specify only necessary key column(s).<br />3. Do not use Stable Sort unless needed.<br />4. Use a stand-alone Sort stage instead of a Link sort for options that are not<br />available on a Link sort:<br />– The “Restrict Memory Usage” option should be included here. If you want<br />more memory available for the sort, you can only set that via the Sort<br />Stage — not on a sort link. The environment variable<br />$APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort<br />memory usage (in MB) per partition.<br />– Sort Key Mode, Create Cluster Key Change Column, Create Key Change<br />Column, Output Statistics.<br />– Always specify “DataStage” Sort Utility for standalone Sort stages.<br />– Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a<br />sub-grouping of a previously-sorted input Data Set.<br />32 IBM InfoSphere DataStage Data Flow and Job Design<br />5. Be aware of automatically-inserted sorts:<br />– Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish<br />required sort order.<br />6. Minimize the use of sorts within a job flow.<br />7. To generate a single, sequential ordered result set, use a parallel Sort and a<br />Sort Merge collector.<br /><br /><br />Stage specific guidelines<br />The guidelines by stage are as follows:<br /><br />_ Transformer<br />Take precautions when using expressions or derivations on nullable columns<br />within the parallel Transformer:<br />– Always convert nullable columns to in-band values before using them in<br />an expression or derivation.<br />– Always place a reject link on a parallel Transformer to capture / audit<br />possible rejects.<br /><br />_ Lookup<br />It is most appropriate when reference data is small enough to fit into available<br />shared memory. If the Data Sets are larger than available memory resources,<br />use the Join or Merge stage.<br />Limit the use of database Sparse Lookups to scenarios where the number of<br />input rows is significantly smaller (for example 1:100 or more) than the<br />number of reference rows, or when exception processing.<br /><br />_ Join<br />Be particularly careful to observe the nullability properties for input links to<br />any form of Outer Join. Even if the source data is not nullable, the non-key<br />columns must be defined as nullable in the Join stage input in order to identify<br />unmatched records.<br /><br />_ Aggregators<br />Use Hash method Aggregators only when the number of distinct key column<br />values is small. A Sort method Aggregator should be used when the number<br />of distinct key values is large or unknown.<br /><br />_ Database Stages<br />The following guidelines apply to database stages:<br />– Where possible, use the Connector stages or native parallel database<br />stages for maximum performance and scalability.<br />– The ODBC Connector and ODBC Enterprise stages should only be used<br />when a native parallel stage is not available for the given source or target<br />database.<br />– When using Oracle, DB2, or Informix databases, use Orchestrate Schema<br />Importer (orchdbutil) to properly import design metadata.<br />– Take care to observe the data type mappings.<br />– If possible, use an SQL where clause to limit the number of rows sent to a<br />DataStage job.<br />– Avoid the use of database stored procedures on a per-row basis within a<br />high-volume data flow. For maximum scalability and parallel performance,<br />it is best to implement business rules natively using DataStage parallel<br />components<br />Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-60597065078206254862010-03-18T09:08:00.000-07:002010-03-18T09:38:37.568-07:00DataStage Parallel ProcessingFollowing figure represents one of the simplest jobs you could have — a data source,<br />a Transformer (conversion) stage, and the data target. The links between the<br />stages represent the flow of data into or out of a stage.<br />In a parallel job, each stage would normally (but not always) correspond to a<br />process. You can have multiple instances of each process to run on the available<br />processors in your system.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ8C3gEMGadFNwnCxxqX3PjKSwcT4VhgnMdwfkoT5w_3YoaZOeTREQ1peLmLYAdhbqoXKJtAHeS4JJ3YfwNXnmqeUDf7YzJrb_WE2ZvkP5sPKrlq_yxf6pRffyjF2gG1o3gD4sEUJkuZFr/s1600-h/1.jpg"><img style="cursor:pointer; cursor:hand;width: 400px; height: 69px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQ8C3gEMGadFNwnCxxqX3PjKSwcT4VhgnMdwfkoT5w_3YoaZOeTREQ1peLmLYAdhbqoXKJtAHeS4JJ3YfwNXnmqeUDf7YzJrb_WE2ZvkP5sPKrlq_yxf6pRffyjF2gG1o3gD4sEUJkuZFr/s400/1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5450007562014556658" /></a><br /><br />A parallel DataStage job incorporates two basic types of parallel processing —<br />pipeline and partitioning. Both of these methods are used at runtime by the<br />Information Server engine to execute the simple job shown in Figure 1-8.<br />To the DataStage developer, this job would appear the same on your Designer<br />canvas, but you can optimize it through advanced properties.<br /><br /> Pipeline parallelism<br />In the following example, all stages run concurrently, even in a single-node<br />configuration. As data is read from the Oracle source, it is passed to the<br />Transformer stage for transformation, where it is then passed to the DB2<br />target. Instead of waiting for all source data to be read, as soon as the source<br />data stream starts to produce rows, these are passed to the subsequent<br />stages. This method is called pipeline parallelism, and all three stages in our<br />example operate simultaneously regardless of the degree of parallelism of the<br />configuration file. The Information Server Engine always executes jobs with<br />pipeline parallelism.<br /><br />If you ran the example job on a system with multiple processors, the stage<br />reading would start on one processor and start filling a pipeline with the data it<br />had read. The transformer stage would start running as soon as there was<br />data in the pipeline, process it and start filling another pipeline. The stage<br />writing the transformed data to the target database would similarly start<br />writing as soon as there was data available. Thus all three stages are<br />operating simultaneously.<br /><br /><br /> Partition parallelism<br />When large volumes of data are involved, you can use the power of parallel<br />processing to your best advantage by partitioning the data into a number of<br />separate sets, with each partition being handled by a separate instance of the<br />job stages. Partition parallelism is accomplished at runtime, instead of a<br />manual process that would be required by traditional systems.<br /><br />The DataStage developer only needs to specify the algorithm to partition the<br />data, not the degree of parallelism or where the job will execute. Using<br />partition parallelism the same job would effectively be run simultaneously by<br />several processors, each handling a separate subset of the total data. At the<br />end of the job the data partitions can be collected back together again and<br />written to a single data source. This is shown in following figure.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghgkf4DdcaBh4ohXJBHUyI4kIbDTsw3H86JT5kBr1hQD_S5HquN-5GSdPGI_TXsivXggUKiVZx_QaaRoBd0XbtEhdE7IhYpEg5NIZTDJrsg9Mpg44ydynpScZ4rNdiyDw8CyUa9p25P3iM/s1600-h/1.jpg"><img style="cursor:pointer; cursor:hand;width: 400px; height: 166px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghgkf4DdcaBh4ohXJBHUyI4kIbDTsw3H86JT5kBr1hQD_S5HquN-5GSdPGI_TXsivXggUKiVZx_QaaRoBd0XbtEhdE7IhYpEg5NIZTDJrsg9Mpg44ydynpScZ4rNdiyDw8CyUa9p25P3iM/s400/1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5450008554528299986" /></a><br /><br />Attention: You do not need multiple processors to run in parallel. A single<br />processor is capable of running multiple concurrent processes.<br /><br /><br />Partition parallelism<br /> Combining pipeline and partition parallelism<br />The Information Server engine combines pipeline and partition parallel<br />processing to achieve even greater performance gains. In this scenario you<br />would have stages processing partitioned data and filling pipelines so the<br />next one could start on that partition before the previous one had finished.<br />This is shown in the following figure.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYJ-B_UrFXvhS_vu5YBHttkh5SO_6eo4lRI9mSvlH3OjAaMbtW7rjDzXilHPvO1_doQ8vDUXZ9z_l6Vs2hiT1KQzfYFgWanMYz-zVt3LHYCu9IIfj90ZwnR-dF6kCKncWlr-CnoZs8AE-f/s1600-h/1.jpg"><img style="cursor:pointer; cursor:hand;width: 400px; height: 385px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYJ-B_UrFXvhS_vu5YBHttkh5SO_6eo4lRI9mSvlH3OjAaMbtW7rjDzXilHPvO1_doQ8vDUXZ9z_l6Vs2hiT1KQzfYFgWanMYz-zVt3LHYCu9IIfj90ZwnR-dF6kCKncWlr-CnoZs8AE-f/s400/1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5450012500029235826" /></a><br /><br />In some circumstances you might want to actually re-partition your data between<br />stages. This could happen, for example, where you want to group data<br />differently. Suppose that you have initially processed data based on customer<br />last name, but now you want to process on data grouped by zip code. You will<br />have to re-partition to ensure that all customers sharing the same zip code are in<br />the same group. DataStage allows you to re-partition between stages as and<br />when necessary. With the Information Server engine, re-partitioning happens in<br />memory between stages, instead of writing to disk.Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-5139660152076683442010-03-18T09:04:00.000-07:002010-03-18T09:38:37.572-07:00DataStage JobsAn IBM InfoSphere DataStage job consists of individual stages linked together<br />which describe the flow of data from a data source to a data target.<br />A stage usually has at least one data input and/or one data output. However,<br />some stages can accept more than one data input, and output to more than one<br />stage. Each stage has a set of predefined and editable properties that tell it how<br />to perform or process data. Properties might include the file name for the<br />Sequential File stage, the columns to sort, the transformations to perform,<br />and the database table name for the DB2 stage. These properties are viewed or<br />edited using stage editors. Stages are added to a job and linked together using<br />the Designer. Figure shows some of the stages and their iconic<br />representations. <br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnE64Nxkzz4Lq4okLdc7ny0hlN3zdx34cweFEIf8wwWwbJjOWj68Qo-DrOkdpKrlXO1IPEhSyfzVQqrt1Z5F2tFoc1smP1SirREEcuQ9VDzEgXxtljE05fGD6E-7kqXH19MObJKWSokBx3/s1600-h/1.jpg"><img style="cursor:pointer; cursor:hand;width: 400px; height: 304px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnE64Nxkzz4Lq4okLdc7ny0hlN3zdx34cweFEIf8wwWwbJjOWj68Qo-DrOkdpKrlXO1IPEhSyfzVQqrt1Z5F2tFoc1smP1SirREEcuQ9VDzEgXxtljE05fGD6E-7kqXH19MObJKWSokBx3/s400/1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5450006353717857266" /></a><br /><br />Stages and links can be grouped in a shared container. Instances of the shared<br />container can then be reused in different parallel jobs. You can also define a<br />local container within a job — this groups stages and links into a single unit, but<br />can only be used within the job in which it is defined.<br />The different types of jobs have different stage types. The stages that are<br />available in the Designer depend on the type of job that is currently open in the<br />Designer.<br />Parallel Job stages are organized into different groups on the Designer palette as<br />follows:<br />_ General includes stages such as Container and Link.<br />_ Data Quality includes stages such as Investigate, Standardize, Reference<br />Match, and Survive.<br /><br />Database includes stages such as Classic Federation, DB2 UDB, DB2<br />UDB/Enterprise, Oracle, Sybase, SQL Server®, Teradata, Distributed<br />Transaction, and ODBC.<br />_ Development/Debug includes stages such as Peek, Sample, Head, Tail, and<br />Row Generator.<br />_ File includes stages such as Complex Flat File, Data Set, Lookup File Set,<br />and Sequential File.<br />_ Processing includes stages such as Aggregator, Copy, FTP, Funnel, Join,<br />Lookup, Merge, Remove Duplicates, Slowly Changing Dimension, Surrogate<br />Key Generator, Sort, and Transformer<br />_ Real Time includes stages such as Web Services Transformer, WebSphere<br />MQ, and Web Services Client.<br />_ Restructure includes stages such as Column Export and Column Import.<br />Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-29604772931183383612010-03-18T09:02:00.000-07:002010-03-18T09:38:37.575-07:00DataStage Data TransformationsData transformation and movement is the process by which source data is<br />selected, converted, and mapped to the format required by targeted systems.<br />The process manipulates data to bring it into compliance with business, domain,<br />and integrity rules and with other data in the target environment. Transformation<br />can take some of the following forms:<br /><br /><br />_ Aggregation<br />Consolidating or summarizing data values into a single value. Collecting daily<br />sales data to be aggregated to the weekly level is a common example of<br />aggregation.<br /><br />_ Basic conversion<br />Ensuring that data types are correctly converted and mapped from source to<br />target columns.<br /><br />_ Cleansing<br />Resolving inconsistencies and fixing the anomalies in source data.<br /><br />_ Derivation<br />Transforming data from multiple sources by using a complex business rule or<br />algorithm.<br /><br />_ Enrichment<br />Combining data from internal or external sources to provide additional<br />meaning to the data.<br /><br />_ Normalizing<br />Reducing the amount of redundant and potentially duplicated data.<br /><br />_ Combining<br />The process of combining data from multiple sources via parallel Lookup,<br />Join, or Merge operations.<br /><br />_ Pivoting<br />Converting records in an input stream to many records in the appropriate<br />table in the data warehouse or data mart.<br /><br />_ Sorting<br />Grouping related records and sequencing data based on data or string<br />values. <br />Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-50761356765678662482010-03-18T08:59:00.000-07:002010-03-18T09:38:37.582-07:00DataStage Main FunctionsIn its simplest form, IBM InfoSphere DataStage performs data transformation<br />and movement from source systems to target systems in batch and in real time.<br />The data sources might include indexed files, sequential files, relational<br />databases, archives, external data sources, enterprise applications, and<br />message queues.<br /><br />DataStage manages data that arrives and data that is received on a periodic or<br />scheduled basis. It enables companies to solve large-scale business problems<br />with high-performance processing of massive data volumes. By leveraging the<br />parallel processing capabilities of multiprocessor hardware platforms, DataStage<br />can scale to satisfy the demands of ever-growing data volumes, stringent<br />real-time requirements, and ever-shrinking batch windows.<br />Leveraging the combined suite of IBM Information Server, DataStage can<br />simplify the development of authoritative master data by showing where and how<br />information is stored across source systems. DataStage can also consolidate<br />disparate data into a single, reliable record, cleanses and standardizes<br />information, removes duplicates, and links records together across systems. This<br />master record can be loaded into operational data stores, data warehouses, or<br />master data applications such as IBM MDM using IBM InfoSphere DataStage.<br />IBM InfoSphere DataStage delivers four core capabilities:<br /><br />_ Connectivity to a wide range of mainframe, legacy, and enterprise<br />applications, databases, file formats, and external information sources.<br /><br />_ Prebuilt library of more than 300 functions including data validation rules and<br />very complex transformations.<br /><br />_ Maximum throughput using a parallel, high-performance processing<br />architecture.<br /><br />_ Enterprise-class capabilities for development, deployment, maintenance, and<br />high-availability. It leverages metadata for analysis and maintenance. It also<br />operates in batch, real time, or as a Web service.<br /><br />IBM InfoSphere DataStage enables an integral part of the information integration<br />process <br />Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-39470180750223427192010-03-18T08:54:00.002-07:002010-03-18T09:38:37.587-07:00DataStage Execution FlowWhen you execute a job, the generated OSH and contents of the configuration<br />file ($APT_CONFIG_FILE) is used to compose a “score”. This is similar to a SQL<br />query optimization plan.<br /><br />At runtime, IBM InfoSphere DataStage identifies the degree of parallelism and<br />node assignments for each operator, and inserts sorts and partitioners as<br />needed to ensure correct results. It also defines the connection topology (virtual<br />data sets/links) between adjacent operators/stages, and inserts buffer operators<br />to prevent deadlocks (for example, in fork-joins). It also defines the number of<br />actual OS processes. Multiple operators/stages are combined within a single OS<br />process as appropriate, to improve performance and optimize resource<br />requirements.<br /><br />The job score is used to fork processes with communication interconnects for<br />data, message and control3. Processing begins after the job score and<br />processes are created. Job processing ends when either the last row of data is<br />processed by the final operator, a fatal error is encountered by any operator, or<br />the job is halted by DataStage Job Control or human intervention such as<br />DataStage Director STOP.<br /><br />Job scores are divided into two sections — data sets (partitioning and collecting)<br />and operators (node/operator mapping). Both sections identify sequential or<br />parallel processing.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDPLLha1AhmIaY3Pczxpl53l51LYcQ0Cp0raf_XIdCcn4ES5MtvAIMaWhZg8SnbFf9KM6OKDdccnqwl_DfCTyoVZShoBkh_LmL3LgvffwzgnAKxzVmstgzkvEzdPT7o-xgXPm29nd8fLse/s1600-h/1.jpg"><img style="cursor:pointer; cursor:hand;width: 400px; height: 267px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDPLLha1AhmIaY3Pczxpl53l51LYcQ0Cp0raf_XIdCcn4ES5MtvAIMaWhZg8SnbFf9KM6OKDdccnqwl_DfCTyoVZShoBkh_LmL3LgvffwzgnAKxzVmstgzkvEzdPT7o-xgXPm29nd8fLse/s400/1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5450004084647047938" /></a><br /><br />The execution (orchestra) manages control and message flow across processes<br />and consists of the conductor node and one or more processing nodes as shown<br />in Figure 1-6. Actual data flows from player to player — the conductor and<br />section leader are only used to control process execution through control and<br />message channels.<br /><br />_ Conductor is the initial framework process. It creates the Section Leader (SL)<br />processes (one per node), consolidates messages to the DataStage log, and<br />manages orderly shutdown. The Conductor node has the start-up process.<br />The Conductor also communicates with the players.<br /><br />Note: You can direct the score to a job log by setting $APT_DUMP_SCORE.<br />To identify the Score dump, look for “main program: This step....”.<br /><br />_ Section Leader is a process that forks player processes (one per stage) and<br />manages up/down communications. SLs communicate between the<br />conductor and player processes only. For a given parallel configuration file,<br />one section leader will be started for each logical node.<br /><br />_ Players are the actual processes associated with the stages. It sends stderr<br />and stdout to the SL, establishes connections to other players for data flow,<br />and cleans up on completion. Each player has to be able to communicate<br />with every other player. There are separate communication channels<br />(pathways) for control, errors, messages and data. The data channel does<br />not go through the section leader/conductor as this would limit scalability.<br /><br />Data flows directly from upstream operator to downstream operator.<br />Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0tag:blogger.com,1999:blog-8389512091667925528.post-84671891800733825202010-03-18T08:54:00.001-07:002010-03-18T09:38:37.592-07:00DataStage OSH ScriptThe IBM InfoSphere DataStage and QualityStage Designer client creates IBM<br />InfoSphere DataStage jobs that are compiled into parallel job flows, and reusable<br />components that execute on the parallel Information Server engine. It allows you<br />to use familiar graphical point-and-click techniques to develop job flows for<br />extracting, cleansing, transforming, integrating, and loading data into target files,<br />target systems, or packaged applications.<br /><br />The Designer generates all the code. It generates the OSH (Orchestrate SHell<br />Script) and C++ code for any Transformer stages used.<br />Briefly, the Designer performs the following tasks:<br />_ Validates link requirements, mandatory stage options, transformer logic, etc.<br />_ Generates OSH representation of data flows and stages (representations of<br />framework “operators”).<br />_ Generates transform code for each Transformer stage which is then compiled<br />into C++ and then to corresponding native operators.<br />_ Reusable BuildOp stages can be compiled using the Designer GUI or from<br />the command line.<br />Here is a brief primer on the OSH:<br />_ Comment blocks introduce each operator, the order of which is determined by<br />the order stages were added to the canvas.<br />_ OSH uses the familiar syntax of the UNIX shell. such as Operator name,<br />schema, operator options (“-name value” format), input (indicated by n<<br />where n is the input#), and output (indicated by the n> where n is the<br />output #).<br />_ For every operator, input and/or output data sets are numbered sequentially<br />starting from zero.<br />_ Virtual data sets (in memory native representation of data links) are<br />generated to connect operators.<br /><br />Framework (Information Server Engine) terms and DataStage terms have<br />equivalency. The GUI frequently uses terms from both paradigms. Runtime<br />messages use framework terminology because the framework engine is where<br />execution occurs. The following list shows the equivalency between framework<br />and DataStage terms:<br />_ Schema corresponds to table definition<br />_ Property corresponds to format<br />_ Type corresponds to SQL type and length<br />_ Virtual data set corresponds to link<br />_ Record/field corresponds to row/column<br />_ Operator corresponds to stage<br /><br />Note: The actual execution order of operators is dictated by input/output<br />designators, and not by their placement on the diagram. The data sets<br />connect the OSH operators. These are “virtual data sets”, that is, in memory<br />data flows. Link names are used in data set names — it is therefore good<br />practice to give the links meaningful names.<br /><br />Tutorial Blogshttp://www.blogger.com/profile/08779672772085427042noreply@blogger.com0