Which Target Option Should I Use?

[Table of Contents]

The first time you run your Dataset, data is appended to your output table as you would expect. But what about the second (and subsequent) runs? Should the existing data be deleted? Or should only new records be added to the table? The most important options are:

Basic Options

  • Append - Append all records every time the Dataset is run.
  • Delete With Rollback - Delete all records in the target, and then append all records from the source.
  • Truncate Without Rollback - Truncate all records from the target, and then append all records from the source. This will be faster than "Delete With Rollback" but any subsequent errors in the Transformation Space may leave your warehouse in an inconsistent state.  

Incremental Options

  • Append New* - Only new records in the source are appended to the output.
  • Append / Update* - New records are appended to the output, and existing records are updated.
  • Append / Update / Delete* - New records are appended to the output, existing records updated, and any records no longer in the source are deleted. This is functionally equivalent to the "Delete" and "Truncate" options, but will almost certainly perform differently.

In the case of Incremental builds, you can use the "Date Constraint" functionality (in the Field Details screen) to ensure that the output is restricted to only items which have changed since the last time the dataset was run.

If you are having performance issues with your transformations, then you may benefit from tweaking some of these options, but... which ones?

Unfortunately, the answer is not straightforward - it depends on the character of your data; and the best advice is to run the transformations and examine the logs to see which options work best in your circumstances. While Data Academy is responsible for issuing queries to SQL Server, it is SQL Server which optimizes and runs those instructions, so the optimal solution can vary according to hardware configuration, memory, etc. etc.

However, the following tips are worth bearing in mind.

  • You need to test your transformations with real data - not only will performance be hard to asses with small, test datasets, but the actual outcomes may vary radically.
  • Look at the logs! Pay attention to the "Affected" flag which follows many of the SQL queries issued by Data Academy - it indicates not only the number of records affected by a particular instruction, but also the time taken.
  • Try running the SQL Queries generated by Data Academy interactively in Management Studio. Some of them probably look particularly long and horrible, but it is only SQL and you don't have to actually understand everything about them to take advantage of features like the Tuning Advisor.
  • Don't waste time optimizing operations which are taking only minutes to run - focus on those items which are taking hours.
  • Throw away your preconceptions! An incremental build may feel like it ought to be a faster, but the "UPDATE" operation in particular can be expensive, and in many cases a Truncate (full re-build) will be better performing.
  • Look critically at your indexes, and make sure that your source tables are properly indexed. In the case of Incremental builds, an index on the source table should reflect the primary keys of the target table. But don't over-do it either - lots of indexes mean slower writes!
  • Break the transformation into smaller steps - try not to base an incremental build on an expensive JOIN. Instead, consider joining the source items into a transitory dataset (properly indexed) on which to base the update.
  • Try to re-factor-out multi-part keys - SQL Server is much happier working on a single-part key. If you have a number of operations based on a multi-part key (particularly incremental builds) then try to create a surrogate using an identity field.