Introduction
One of SQL Server 2008's biggest BI features being touted is Change Data Capture (CDC). CDC is basically suppose to be a built-in solution to the old-age practice in ETL solutions of identifying and using change identifiers columns in source systems. I have now spent a fair amount of time using this feature and more importantly how to leverage it inside of SSIS packages for incremental ETL solutions. My work here has been to prepare for an upcoming demonstration of CDC in SSIS. This post/Q&A is a brief summary of my findings thus far...
*CDC is being positioned as the 'design of choice' for SQL Server 2008+ OLTP database servers for exposing changed relational data for data warehousing consumption purposes.
What is Change Data Capture (CDC)?
CDC records (or captures) DML activity on designated tables. CDC works by scanning the transaction log for a designated table's 'captured columns' whose content has changed and then making those changes available for data syncronizing purposes in a relational format. As you can see this feature in entrenched in transaction log architecture and thus alot of the metadata in CDC is related around the concept of a Log Sequence Number (LSN).
So whats a LSN?
Here is the definition of a LSN per Books Online: "Every record in the Microsoft SQL Server transaction log is uniquely identified by a log sequence number (LSN). LSNs are ordered such that if LSN2 is greater than LSN1, the change described by the log record referred to by LSN2 occurred after the change described by the log record LSN. "
How do I get CDC?
CDC is a feature of SQL Server 2008 Enterprise, Developer, and Evaluation editions.
What are the target applications or consumers of the CDC technology?
ETL Solutions are the most common, however any data consuming application that requires syncronizing data could benefit from the technology.
Is CDC configurable via a UI or just TSQL?
As of this time, just TSQL.
How do you configure CDC?
1. Enable CDC for a database
1. Enables the current database via the USE statement *select is_cdc_enabled from sys.databases where [name] = 'AdventureWorks' to determine if DB is allready enabled *Also note that when you do this all of the below system objects get created in the selected database
2. Enable CDC for a given table and it's selected columns
1. Specify the captured table's schema, name, database role, capture instance name (defaults to schema_name), support net changes (bit, set it to 1 if you want both change data table-valued functions created), name of captured table's unique index, captured column list (null/default to all columns), filegroup for the change table (null/defaults to default filegroup)
3. Query Change Data via 1 of 2 built in table-valued functions created during step #2
1. For all changes (meaning a row is returned for each DML) use cdc.fn_cdc_get_all_changes_
2. For the net changes (meaning one row returned for each source row modified among 1 or more DMLs) use cdc.fn_cdc_get_net_changes_
What are all of the CDC system objects available to me?
System Tables:
o cdc.captured_columns
o cdc.change_tables
o cdc.ddl_history
o cdc.index_columns
o cdc.lsn_time_mapping
o cdc.Schema_Name_CT (change tables) *this is just the default naming convention, configurable via the enable table sysproc
DMVs:
o sys.dm_cdc_log_scan_sessions
o sys.dm_repl_traninfo
o sys.dm_cdc_errors
System Stored Procedures:
o sys.sp_cdc_enabledb
o sys.sp_cdc_disabledb
o sys.sp_cdc_cleanup_change_table
o sys.sp_cdc_disable_db_change_data_capture
o sys.sp_cdc_disable_table_change_data_capture
o sys.sp_cdc_enable_db_change_data_capture
o sys.sp_cdc_enable_table_change_data_capture
o sys.sp_cdc_get_ddl_history
o sys.sp_cdc_get_captured_columns
o sys.sp_cdc_help_change_data_capture
System Functions:
o cdc.fn_cdc_get_all_changes_
o cdc.fn_cdc_get_net_changes_
o sys.fn_cdc_decrement_lsn
o sys.fn_cdc_get_column_ordinal ( 'capture_instance' , 'column_name' )
o sys.fn_cdc_get_max_lsn
o sys.fn_cdc_get_min_lsn
o sys.fn_cdc_has_column_changed
o sys.fn_cdc_increment_lsn
o sys.fn_cdc_is_bit_set
o sys.fn_cdc_map_lsn_to_time
o sys.fn_cdc_map_time_to_lsn
Do the change tables keep growing?
No, there is an automatic cleanup process that occurs every three days (and this is configurable). For more intense environments you can leverage the manual method using the system stored procedure: sys.sp_cdc_cleanup_change_table. When you execute this system procedure you specify the low LSN and any change records occuring before this point are removed and the start_lsn is set to the low LSN you specified.
How do you leverage CDC in SSIS Packages?
Books Online in CTP5 (November) actually has a sample package in the topic 'change data capture in integration services' and I found this to be a good starting point to build from. For my CDC/SSIS demo here is the Control/Data Flow I am using:
1. Calculate Date Intervals (these will correspond to LSNs later) *also note that in both BOL and my own package we are using fixed intervals, in the real world this will be driven by a table solution which tells the SSIS package when the last successful execution occurs (starting point of next package iteration)
2. Check is any data is available in the selected date/time interval. This is important because the rest of the package will fail if no data is ready. BOL recommends performing Thead.Sleep/WAITFORs here. I am not for demo purposes but its not a bad idea.
3. Build the query via a SSIS variable *BOL states that SSIS packages cannot call the cdc.fn_cdc_getnetall functions and must use a wrapper. Whether or not we end up being forced to do this, it is a good design practice, below is my custom function that SSIS calls passing in the start/end datetime values to get the actual change data in the data flow step below.
4. Create a data flow task that executes the SSIS variable query (OLEDB source), and then splits the rows into via a conditional split based on the CDC_OPERATION column calculated in the function below.
No comments:
Post a Comment