Establishing Data Changes When No Incrementals Are Available

[Table of Contents]

Introduction

Data Academy has a fantasitic capability in being able to create incremental loaders that loads only data that has changed since the last load. However, the pre-requisites of this are a change date field and a primary key on each table. This combination is not always available, and yet there is still a need to be able to identify changes in data as part of the transformation capability.

This article does not address the business of loading data incrementally, but how to use Academy's transformation capability to identify change once it's loaded.

The Principal

Change is defined quite simply as any new data, and any data that has changed since the last time we checked. So this is what we need to establish. To achieve this we are going to use the following very simple flow

After the data has completed loading, we will establish the new data that's been added, this will be achieved by comparing the loaded data with the data that was already there, so a copy of the previously loaded data will be cached.Once established, a third change Dataset will be used to receive the new data, this is cleared down before starting the process.

in the next phase, we will establish any changed data in the same way, by comparing with the cache'd version. Any data that has changed will be added to the change Dataset leaving it with the data that has been added or changed. This can then be used for whatever purpose is necessary.

At the end of the process we use the change Dataset to update the cache'd version of the original data ready for the next change cycle.

Preparing for the Process 

OK, so we're ready to get going. I used a Dataset that was already in my data warehouse, which has the following structure

CREATE TABLE [dbo].[OP_CUSTOMERS](
     [Address]      [varchar](60) NULL,
     [City]         [varchar](15) NULL,
     [CompanyName]  [varchar](40) NULL,
     [ContactName]  [varchar](30) NULL,
     [ContactTitle] [varchar](30) NULL,
     [Country]      [varchar](15) NULL,
     [CustomerID]   [char](5) NULL,
     [Fax]          [varchar](24) NULL,
     [Phone]        [varchar](24) NULL,
     [PostalCode]   [varchar](10) NULL,
     [RegionID]     [bigint] NULL
) ON [PRIMARY]

The first thing we need to do is establish a primary key, without this none of the process can work, we need some reference point and this has to be it. In the case of the OP_CUSTOMERS table this is the "CustomerID".

To establish new data, we are going to compare the Primary Keys of the incoming data with the primary keys of the cache'd data.

But wait!!! I hear you call, "What about the first time we run this..?? ie When we have no Cache'd data.."..

So first off, but as a one off, we need to do just that, so create (or re-use) a Transformation Space that we can create a single Dataset that will cache the current data position.. The Dataset will be a "Regular" one, based on the "OP_CUSTOMERS" table and contain a copy of all of the fields, the final definition looks something like this..(Don't forget to edit the CustomerID field and check the "Primary Key" check box.)

 

Run the Dataset and the cache will be prepared...

Establishing New Data

So next we need a Transformation Space in which to place the transformations we're going to create. I'm going to create one called "CustomerChangesCheck"


Having done this we need to create a "Regular" Dataset based on the "OP_CUSTOMERS" table. The target ouput table will be called "DS_CUSTOMERS_CHANGES".

Then we'll be using a "LEFT JOIN" onto the cache'd Dataset "DS_OP_CUSTOMERS" (created above)

So the join screen will look something like this before pressing the "Finish" button.


Note : I have aliased the Join Table as CustomerCache.. This is for reference purposes so in future I will know why that table has been linked in..

We can finally establish new records by filtering "WHERE CustomerCache.CustomerID IS NULL".. This needs to be entered into the "Advanced Where Clause" field. Then add all of the source fields to the output table, the final screen looks something like this...(Don't forget to edit the CustomerID field and check the "Primary Key" check box.)


Saving it and running it right now would result in no records added.. This is because we've already cache'd everything in the current source table. But we need to run it anyway, just to get the output table created.. So if you review the resulting Dataset you will notice that it's empty..

Establishing the Changes.

In my case, I was interested in tracking changes in "CompanyName", "RegionID" and "ContactName" only, but the pricncipal applies to all. We can use exactly the same Dataset as above with one or two key differences, so to save messing around, we'll use the copy function to copy the structure of the Dataset.. Clicking the "copy" button at the head of the Dataset list in the Transformation Space details screen, displays the copy Dataset dialog.. I took a copy of the above datsset, ensuring the Target table is the same, as we want both changes and additions in the same target output table.


OK so to establish changes, we going to do direct field for field comparisons, so therefore

  1. The join between OP_CUSTOMERS and DS_OP_CUSTOMERS needs to be "inner join" so that we making record for record, field for field comparisons
  2. We need to put our fields comparisons in as an "Advanced Where Clause" on the Dataset
  3. We need to make sure the output is appended to the "DS_CSUTOMERS_CHANGES" Dataset

So

1. Change the Join to Inner from Left

To do this, go to the joins tab and click the edit icon on the join, then modify the join SQL


2. Update the Advanced Where Clause

Since I need to monitor change in the CompanyName, RegionID and ContactName fields the advanced where clause will be

CustomerCache.CompanyName <> OP_CUSTOMERS.CompanyName or CustomerCache.RegionID <> OP_CUSTOMERS.RegionID or CustomerCache.ContactName <> OP_CUSTOMERS.ContactName

3. Ensure the output is Appended to the target Dataset

This is achieved by changing the "Target Option / Transitory" setting on the Dataset to "Append".

4. Add all of the fields to the output table. (Don't forget to edit the CustomerID field and check the "Primary Key" check box.), the final screen should therefore look something like this

 

There is one final step we need to take which is to Append/Update the Cache'd Dataset with the changes that we've established that are now stored in the DS_CUSTOMERS_CHANGES Dataset, so next time we run the Transformation Space we don't get the same records as last time...

To do this we need to create a "Regular" Dataset based on the DS_CUSTOMERS_CHANGES, and

  1. Set the output Dataset to be DS_OP_CUSTOMERS
  2. Set the target option to be "Append/Update"
  3. Add all fields
  4. Edit the CustomerID field and check the "Primary Key" check box

 The final Dataset definition should look something like this

 

Whilst it is now complete, again if we run the "CustomerChangesCheck" Transformation Space nothing will be added or appended to the Output Dataset..

To test that the strategy is working properly, I subsequently went into the original OP_CUSTOMERS table and added a record and changed two existing records and this was the output result

 

Then if I run the Dataset a second time

 

Showing that the updates were excluded the second time around... as we would expect.

The whole process could be enhanced by adding "Target Only" fields to the Datasets that establish the chenges to identify whether the rows are updates or deletes. This would enable any downstream processes to understand what they need to do with the changes.. The target only field would simply have "A" in it's field function for and new record and "U" for an update.