DW and Big Data Testing Best Practices

Unit testing checklist

Some programmers are not well trained as testers. They may like to program, deploy the code, and move on to the next development task without a thorough unit test. A checklist will aid database programmers to systematically test their code before formal QA testing.

  • Check the mapping of fields that support data staging and in data marts.
  • Check for duplication of values generated using sequence generators.
  • Check the correctness of surrogate keys that uniquely identify rows of data.
  • Check for data-type constraints of the fields present in staging and core levels.
  • Check the data loading status and error messages after ETLs (extracts, transformations, loads).
  • Look for string columns that are incorrectly left or right trimmed.
  • Make sure all tables and specified fields were loaded from source to staging.
  • Verify that not-null fields were populated.
  • that no data truncation occurred in each field.
  • Make sure data types and formats are as specified during database design.
  • Make sure there are no duplicate records in target tables.
  • Make sure data transformations are correctly based on business rules.
  • Verify that numeric fields are populated precisely.
  • Make sure every ETL session completed with only planned exceptions.
  • Verify all data cleansing, transformation, and error and exception handling.
  • Verify stored procedure calculations and data mappings.

Integration testing checklist

An integration test checklist helps ensure that ETL workflows are executed as scheduled with correct dependencies.

  • Look for the successful execution of data-loading workflows.
  • Make sure target tables are correctly populated with all expected records, and none were rejected.
  • Verify all dependencies among data-load workflows—including source-to-staging, staging-to-operational data store (ODS), and staging-to-data marts—have been properly defined.
  • Check all ETL error and exception log messages for correctable issues.
  • Verify that data-load jobs start and end at predefined times.

Performance and scalability testing checklist

As the volume of data in a warehouse grows, ETL execution times can be expected to increase, and performance of queries often degrade. These changes can be mitigated by having a solid technical architecture and efficient ETL design. The aim of performance testing is to point out potential weaknesses in the ETL design, such as reading a file multiple times or creating unnecessary intermediate files. A performance and scalability testing checklist helps discover performance issues.

  • Load the database with peak expected production volumes to help ensure that the volume of data can be loaded by the ETL process within the agreed-on window.
  • Compare ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues. Compare the ETL processing times component by component to pinpoint any areas of weakness.
  • Monitor the timing of the reject process, and consider how large volumes of rejected data will be handled.
  • Perform simple and multiple join queries to validate query performance on large database volumes. Work with business users to develop sample queries and acceptable performance criteria for each query.

Data Completeness

One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider include:

  • Comparing record counts between source data, data loaded to the warehouse and rejected records.
  • Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be used during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct.
  • Populating the full contents of each field to validate that no truncation occurs at any step in the process. For example, if the source data field is a string(30) make sure to test it with 30 characters.
  • Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts is too small.

Data Transformation

Validating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. One typical method is to pick some sample records and “stare and compare” to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:

  • Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is a good requirements elicitation exercise during design and can also be used during testing.
  • Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the process of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change.
  • Utilize data profiling results to compare range and distribution of values in each field between source and target data.
  • Validate correct processing of ETL-generated fields such as surrogate keys.
  • Validate that data types in the warehouse are as specified in the design and/or the data model.
  • Set up data scenarios that test referential integrity between tables. For example, what happens when the data contains foreign key values not in the parent table?
    Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned child records are handled.

Data Quality

For the purposes of this discussion, data quality is defined as “how the ETL system handles data rejection, substitution, correction and notification without modifying data.” To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design, for example:

  • Reject the record if a certain decimal field has nonnumeric data.
  • Substitute null if a certain decimal field has nonnumeric data.
  • Validate and correct the state field if necessary based on the ZIP code.
  • Compare product code to values in a lookup table, and if there is no match load anyway but report to users.

Schema level Testing

  • Verify that data transformation from source to destination works as expected
  • Verify that expected data is added in target system
  • Verify that all DB fields and field data is loaded without any truncation
  • Verify data checksum for record count match
  • Verify that for rejected data proper error logs are generated with all details
  • Verify NULL value fields
  • Verify that duplicate data is not loaded
  • Verify data integrity

Ready to do something awesome?

© Copyright 2015. Fusion Systems Inc. All rights reserved | Design and Developed by www.qualinsoft.com