Always Learning

Advanced Search

Data Integration Blueprint and Modeling

Data Integration Blueprint and Modeling

Techniques for a Scalable and Sustainable Architecture

Anthony Giordano

Jan 2011, Hardback, 416 pages
ISBN13: 9780137084937
ISBN10: 0137084935
This title is no longer available.

This title cannot be purchased online
  • Print pagePrint page
  • Email this pageEmail page
  • Share

Making Data Integration Work: How to Systematically Reduce Cost, Improve Quality, and Enhance Effectiveness

Today’s enterprises are investing massive resources in data integration. Many possess thousands of point-to-point data integration applications that are costly, undocumented, and difficult to maintain. Data integration now accounts for a major part of the expense and risk of typical data warehousing and business intelligence projects--and, as businesses increasingly rely on analytics, the need for a blueprint for data integration is increasing now more than ever.

This book presents the solution: a clear, consistent approach to defining, designing, and building data integration components to reduce cost, simplify management, enhance quality, and improve effectiveness. Leading IBM data management expert Tony Giordano brings together best practices for architecture, design, and methodology, and shows how to do the disciplined work of getting data integration right.

Mr. Giordano begins with an overview of the “patterns” of data integration, showing how to build blueprints that smoothly handle both operational and analytic data integration. Next, he walks through the entire project lifecycle, explaining each phase, activity, task, and deliverable through a complete case study. Finally, he shows how to integrate data integration with other information management disciplines, from data governance to metadata. The book’s appendices bring together key principles, detailed models, and a complete data integration glossary.

Coverage includes

  • Implementing repeatable, efficient, and well-documented processes for integrating data
  • Lowering costs and improving quality by eliminating unnecessary or duplicative data integrations
  • Managing the high levels of complexity associated with integrating business and technical data
  • Using intuitive graphical design techniques for more effective process and data integration modeling
  • Building end-to-end data integration applications that bring together many complex data sources

Preface xix

Acknowledgments xxii

About the Author xxiii

Introduction: Why Is Data Integration Important? 1

Part 1 Overview of Data Integration 5

Chapter 1 Types of Data Integration 7

Data Integration Architectural Patterns 7

Enterprise Application Integration (EAI) 8

Service-Oriented Architecture (SOA) 9

Federation 12

Extract, Transform, Load (ETL) 14

Common Data Integration Functionality 15

Summary 16

End-of-Chapter Questions 16

Chapter 2 An Architecture for Data Integration 19

What Is Reference Architecture? 19

Reference Architecture for Data Integration 20

Objectives of the Data Integration Reference Architecture 21

The Data Subject Area-Based Component Design Approach 22

A Scalable Architecture 24

Purposes of the Data Integration Reference Architecture 26

The Layers of the Data Integration Architecture 26

Extract/Subscribe Processes 27

Data Integration Guiding Principle: “Read Once, Write Many” 28

Data Integration Guiding Principle: “Grab Everything” 28

Initial Staging Landing Zone 29

Data Quality Processes 31

What Is Data Quality? 31

Causes of Poor Data Quality 31

Data Quality Check Points 32

Where to Perform a Data Quality Check 32

Clean Staging Landing Zone 34

Transform Processes 35

Conforming Transform Types 35

Calculations and Splits Transform Types 35

Processing and Enrichment Transform Types 36

Target Filters Transform Types 38

Load-Ready Publish Landing Zone 39

Load/Publish Processes 40

Physical Load Architectures 41

An Overall Data Architecture 41

Summary 42

End-of-Chapter Questions 43

Chapter 3 A Design Technique: Data Integration Modeling 45

The Business Case for a New Design Process 45

Improving the Development Process 47

Leveraging Process Modeling for Data Integration 48

Overview of Data Integration Modeling 48

Modeling to the Data Integration Architecture 48

Data Integration Models within the SDLC 49

Structuring Models on the Reference Architecture 50

Conceptual Data Integration Models 51

Logical Data Integration Models 51

High-Level Logical Data Integration Model 52

Logical Extraction Data Integration Models 52

Logical Data Quality Data Integration Models 53

Logical Transform Data Integration Models 54

Logical Load Data Integration Models 55

Physical Data Integration Models 56

Converting Logical Data Integration Models to Physical Data Integration Models 56

Target-Based Data Integration Design Technique Overview 56

Physical Source System Data Integration Models 57

Physical Common Component Data Integration Models 58

Physical Subject Area Load Data Integration Models 60

Logical Versus Physical Data Integration Models 61

Tools for Developing Data Integration Models 61

Industry-Based Data Integration Models 63

Summary 64

End-of-Chapter Questions 65

Chapter 4 Case Study: Customer Loan Data Warehouse Project 67

Case Study Overview 67

Step 1: Build a Conceptual Data Integration Model 69

Step 2: Build a High-Level Logical Model Data Integration Model 70

Step 3: Build the Logical Extract DI Models 72

Confirm the Subject Area Focus from the Data Mapping Document 73

Review Whether the Existing Data Integration Environment Can Fulfill the Requirements 74

Determine the Business Extraction Rules 74

Control File Check Processing 74

Complete the Logical Extract Data Integration Models 74

Final Thoughts on Designing a Logical Extract DI Model 76

Step 4: Define a Logical Data Quality DI Model 76

Design a Logical Data Quality Data Integration Model 77

Identify Technical and Business Data Quality Criteria 77

Determine Absolute and Optional Data Quality Criteria 80

Step 5: Define the Logical Transform DI Model 81

Step 6: Define the Logical Load DI Model 85

Step 7: Determine the Physicalization Strategy 87

Step 8: Convert the Logical Extract Models into Physical Source System Extract DI Models 88

Step 9: Refine the Logical Load Models into Physical Source System Subject Area Load DI Models 90

Step 10: Package the Enterprise Business Rules into Common Component Models 92

Step 11: Sequence the Physical DI Models 94

Summary 95

Part 2 The Data Integration Systems Development Life Cycle 97

Chapter 5 Data Integration Analysis 99

Analyzing Data Integration Requirements 100

Building a Conceptual Data Integration Model 101

Key Conceptual Data Integration Modeling Task Steps 102

Why Is Source System Data Discovery So Difficult? 103

Performing Source System Data Profiling 104

Overview of Data Profiling 104

Key Source System Data Profiling Task Steps 105

Reviewing/Assessing Source Data Quality 109

Validation Checks to Assess the Data 109

Key Review/Assess Source Data Quality Task Steps 111

Performing Source\Target Data Mappings 111

Overview of Data Mapping 112

Types of Data Mapping 113

Key Source\Target Data Mapping Task Steps 115

Summary 116

End-of-Chapter Questions 116

Chapter 6 Data Integration Analysis Case Study 117

Case Study Overview 117

Envisioned Wheeler Data Warehouse Environment 118

Aggregations in a Data Warehouse Environment 120

Data Integration Analysis Phase 123

Step 1: Build a Conceptual Data Integration Model 123

Step 2: Perform Source System Data Profiling 124

Step 3: Review/Assess Source Data Quality 130

Step 4: Perform Source\Target Data Mappings 135

Summary 145

Chapter 7 Data Integration Logical Design 147

Determining High-Level Data Volumetrics 147

Extract Sizing 148

Disk Space Sizing 148

File Size Impacts Component Design 150

Key Data Integration Volumetrics Task Steps 150

Establishing a Data Integration Architecture 151

Identifying Data Quality Criteria 154

Examples of Data Quality Criteria from a Target 155

Key Data Quality Criteria Identification Task Steps 155

Creating Logical Data Integration Models 156

Key Logical Data Integration Model Task Steps 157

Defining One-Time Data Conversion Load Logical Design 163

Designing a History Conversion 164

One-Time History Data Conversion Task Steps 166

Summary 166

End-of-Chapter Questions 167

Chapter 8 Data Integration Logical Design Case Study 169

Step 1: Determine High-Level Data Volumetrics 169

Step 2: Establish the Data Integration Architecture 174

Step 3: Identify Data Quality Criteria 177

Step 4: Create Logical Data Integration Models 180

Define the High-Level Logical Data Integration Model 181

Define the Logical Extraction Data Integration Model 183

Define the Logical Data Quality Data Integration Model 187

Define Logical Transform Data Integration Model 190

Define Logical Load Data Integration Model 191

Define Logical Data Mart Data Integration Model 192

Develop the History Conversion Design 195

Summary 198

Chapter 9 Data Integration Physical Design 199

Creating Component-Based Physical Designs 200

Reviewing the Rationale for a Component-Based Design 200

Modularity Design Principles 200

Key Component-Based Physical Designs Creation Task Steps 201

Preparing the DI Development Environment 201

Key Data Integration Development Environment Preparation Task Steps 202

Creating Physical Data Integration Models 203

Point-to-Point Application Development--The Evolution of Data Integration Development 203

The High-Level Logical Data Integration Model in Physical Design 205

Design Physical Common Components Data Integration Models 206

Design Physical Source System Extract Data Integration Models 208

Design Physical Subject Area Load Data Integration Models 209

Designing Parallelism into the Data Integration Models 210

Types of Data Integration Parallel Processing 211

Other Parallel Processing Design Considerations 214

Parallel Processing Pitfalls 215

Key Parallelism Design Task Steps 216

Designing Change Data Capture 216

Append Change Data Capture Design Complexities 217

Key Change Data Capture Design Task Steps 219

Finalizing the History Conversion Design 220

From Hypothesis to Fact 220

Finalize History Data Conversion Design Task Steps 220

Defining Data Integration Operational Requirements 221

Determining a Job Schedule for the Data Integration Jobs 221

Determining a Production Support Team 222

Key Data Integration Operational Requirements Task Steps 224

Designing Data Integration Components for SOA 225

Leveraging Traditional Data Integration Processes as SOA Services 225

Appropriate Data Integration Job Types 227

Key Data Integration Design for SOA Task Steps 227

Summary 228

End-of-Chapter Questions 228

Chapter 10 Data Integration Physical Design Case Study 229

Step 1: Create Physical Data Integration Models 229

Instantiating the Logical Data Integration Models into a Data Integration Package 229

Step 2: Find Opportunities to Tune through Parallel Processing 237

Step 3: Complete Wheeler History Conversion Design 238

Step 4: Define Data Integration Operational Requirements 239

Developing a Job Schedule for Wheeler 240

The Wheeler Monthly Job Schedule 240

The Wheeler Monthly Job Flow 240

Process Step 1: Preparation for the EDW Load Processing 241

Process Step 2: Source System to Subject Area File Processing 242

Process Step 3: Subject Area Files to EDW Load Processing 245

Process Step 4: EDW-to-Product Line Profitability Data Mart Load Processing 248

Production Support Staffing 248

Summary 249

Chapter 11 Data Integration Development Cycle 251

Performing General Data Integration Development Activities 253

Data Integration Development Standards 253

Error-Handling Requirements 255

Naming Standards 255

Key General Development Task Steps 256

Prototyping a Set of Data Integration Functionality 257

The Rationale for Prototyping 257

Benefits of Prototyping 257

Prototyping Example 258

Key Data Integration Prototyping Task Steps 261

Completing/Extending Data Integration Job Code 262

Complete/Extend Common Component Data Integration Jobs 263

Complete/Extend the Source System Extract Data Integration Jobs 264

Complete/Extend the Subject Area Load Data Integration Jobs 265

Performing Data Integration Testing 266

Data Warehousing Testing Overview 267

Types of Data Warehousing Testing 268

Perform Data Warehouse Unit Testing 269

Perform Data Warehouse Integration Testing 272

Perform Data Warehouse System and Performance Testing 273

Perform Data Warehouse User Acceptance Testing 274

The Role of Configuration Management in Data Integration 275

What Is Configuration Management? 276

Data Integration Version Control 277

Data Integration Software Promotion Life Cycle 277

Summary 277

End-of-Chapter Questions 278

Chapter 12 Data Integration Development Cycle Case Study 279

Step 1: Prototype the Common Customer Key 279

Step 2: Develop User Test Cases 283

Domestic OM Source System Extract Job Unit Test Case 284

Summary 287

Part 3 Data Integration with Other Information Management Disciplines 289

Chapter 13 Data Integration and Data Governance 291

What Is Data Governance? 292

Why Is Data Governance Important? 294

Components of Data Governance 295

Foundational Data Governance Processes 295

Data Governance Organizational Structure 298

Data Stewardship Processes 304

Data Governance Functions in Data Warehousing 305

Compliance in Data Governance 309

Data Governance Change Management 310

Summary 311

End-of-Chapter Questions 311

Chapter 14 Metadata 313

What Is Metadata? 313

The Role of Metadata in Data Integration 314

Categories of Metadata 314

Business Metadata 315

Structural Metadata 315

Navigational Metadata 317

Analytic Metadata 318

Operational Metadata 319

Metadata as Part of a Reference Architecture 319

Metadata Users 320

Managing Metadata 321

The Importance of Metadata Management in Data Governance 321

Metadata Environment Current State 322

Metadata Management Plan 322

Metadata Management Life Cycle 324

Summary 327

End-of-Chapter Questions 327

Chapter 15 Data Quality 329

The Data Quality Framework 330

Key Data Quality Elements 331

The Technical Data Quality Dimension 332

The Business-Process Data Quality Dimension 333

Types of Data Quality Processes 334

The Data Quality Life Cycle 334

The Define Phase 336

Defining the Data Quality Scope 336

Identifying/Defining the Data Quality Elements 336

Developing Preventive Data Quality Processes 337

The Audit Phase 345

Developing a Data Quality Measurement Process 346

Developing Data Quality Reports 348

Auditing Data Quality by LOB or Subject Area 350

The Renovate Phase 351

Data Quality Assessment and Remediation Projects 352

Data Quality SWAT Renovation Projects 352

Data Quality Programs 353

Final Thoughts on Data Quality 353

Summary 353

End-of-Chapter Questions 354

Appendix A Exercise Answers 355

Appendix B Data Integration Guiding Principles 369

Write Once, Read Many 369

Grab Everything 369

Data Quality before Transforms 369

Transformation Componentization 370

Where to Perform Aggregations and Calculations 370

Data Integration Environment Volumetric Sizing 370

Subject Area Volumetric Sizing 370

Appendix C Glossary 371

Appendix D Case Study Models

Appendix D is an online-only appendix. Print-book readers can download the appendix at For eBook editions, the appendix is included in the book.

Index 375

Anthony Giordano is a partner in IBM’s Business Analytics and Optimization Consulting Practice and currently leads the Enterprise Information Management Service Line that focuses on data modeling, data integration, master data management, and data governance. He has more than 20 years of experience in the Information Technology field with a focus in the areas of business intelligence, data warehousing, and Information Management. In his spare time, he has taught classes in data warehousing and project management at the undergraduate and graduate levels at several local colleges and universities.