The purpose of Metadata Testing is to verify that the table definitions conform to the data model and application design specifications. Data Type CheckVerify that the table and column data type definitions are as per the data model design specifications.Example: Data Model column data type is NUMBER but the database column data type is STRING (or VARCHAR). Data Length CheckVerify that the length of database columns are as per the data model design specifications.Example: Data Model specification for the ‘firstname’ column is of length 100 but the corresponding database table column is only 80 characters long.
Index / Constraint CheckVerify that proper constraints and indexes are defined on the database tables as per the design specifications. Verify that the columns that cannot be null have the ‘NOT NULL’ constraint. Verify that the unique key and foreign key columns are indexed as per the requirement. Verify that the table was named according to the table naming convention.Example 1: A column was defined as ‘NOT NULL’ but it can be optional as per the design.Example 2: Foreign key constraints were not defined on the database table resulting in orphan records in the child table. Metadata Naming Standards CheckVerify that the names of the database metadata such as tables, columns, indexes are as per the naming standards.Example: The naming standard for Fact tables is to end with an ‘F’ but some of the fact tables names end with ‘FACT’.
Metadata Check Across EnvironmentsCompare table and column metadata across environments to ensure that changes have been migrated appropriately.Example: A new column added to the SALES fact table was not migrated from the Development to the Test environment resulting in ETL failures. Automate metadata testing with ETL ValidatorETL Validator comes with Metadata Compare Wizard for automatically capturing and comparing Table Metadata. Track changes to Table metadata over a period of time. This helps ensure that the QA and development teams are aware of the changes to table metadata in both Source and Target systems.
Compare table metadata across environments to ensure that metadata changes have been migrated properly to the test and production environments. Compare column data types between source and target environments. Validate Reference data between spreadsheet and database or across environments. The purpose of Data Completeness tests are to verify that all the expected data is loaded in target from the source. Some of the tests that can be run are: Compare and Validate counts, aggregates (min, max, sum, avg) and actual data between the source and target.
Record Count ValidationCompare count of records of the primary source table and target table. Check for any rejected records.Example: A simple count of records comparison between the source and target tables.Source QuerySELECT count(1) srccount FROM customerTarget QuerySELECT count(1) tgtcount FROM customerdim Column Data Profile ValidationColumn or attribute level data profiling is an effective tool to compare source and target data without actually comparing the entire data. It is similar to comparing the checksum of your source and target data. The purpose of Data Quality tests is to verify the accuracy of the data. Data profiling is used to identify data quality issues and the ETL is designed to fix or handle these issue. However, source data keeps changing and new data quality issues may be discovered even after the ETL is being used in production. Automating the data quality checks in the source and target system is an important aspect of ETL execution and testing.
Duplicate Data ChecksLook for duplicate rows with same unique key column or a unique combination of columns as per business requirement.Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique.Sample query to identify duplicatesSELECT fstname, lstname, midname, dateofbirth, count(1) FROM Customer GROUP BY fstname, lstname, midname HAVING count(1)1 Data Validation RulesMany database fields can contain a range of values that cannot be enumerated. However, there are reasonable constraints or rules that can be applied to detect situations where the data is clearly wrong. Instances of fields containing values violating the validation rules defined represent a quality gap that can impact ETL processing.Example: Date of birth (DOB). This is defined as the DATE datatype and can assume any valid date.
However, a DOB in the future, or more than 100 years in the past are probably invalid. Also, the date of birth of the child is should not be greater than that of their parents. Data Integrity ChecksThis measurement addresses “keyed” relationships of entities within a domain. The goal of these checks is to identify orphan records in the child entity with a foreign key to the parent entity.
Count of records with null foreign key values in the child table. Count of invalid foreign key values in the child table that do not have a corresponding primary key in the parent table.Example: In a data warehouse scenario, fact tables have foreign keys to the dimension tables.
If an ETL process does a full refresh of the dimension tables while the fact table is not refreshed, the surrogate foreign keys in the fact table are not valid anymore. Data is transformed during the ETL process so that it can be consumed by applications on the target system. Transformed data is generally important for the target systems and hence it is important to test transformations.
There are two approaches for testing transformations – white box testing and blackbox testing.Transformation testing using White Box approachWhite box testing is a testing technique, that examines the program structure and derives test data from the program logic / code. For transformation testing, this involves reviewing the transformation logic from the mapping design document and the ETL code to come up with test cases.The steps to be followed are listed below:. Review the source to target mapping design document to understand the transformation design.
Apply transformations on the data using SQL or a procedural language such as PLSQL to reflect the ETL transformation logic. Compare the results of the transformed test data with the data in the target table.The advantage with this approach is that the test can be rerun easily on a larger source data. The goal of ETL Regression testing is to verify that the ETL is producing the same output for a given input before and after the change. Any differences need to be validated whether are expected as per the changes.Changes to MetadataTrack changes to table metadata in the Source and Target environments.
Often changes to source and target system metadata changes are not communicated to the QA and Development teams resulting in ETL and Application failures. This check is important from a regression testing standpoint.Example 1: The length of a comments column in the source database was increased but the ETL development team was not notified. Data started getting truncated in production data warehouse for the comments column after this change was deployed in the source system.
Example 2: One of the index in the data warehouse was dropped accidentally which resulted in performance issues in reports. Automated ETL TestingAutomating the ETL testing is the key for regression testing of the ETL particularly more so in an agile development environment.
Organizing test cases into test plans (or test suites) and executing them automatically as and when needed can reduce the time and effort needed to perform the regression testing. Automating ETL testing can also eliminate any human errors while performing manual checks.Regression testing by baselining target dataOften testers need to regression test an existing ETL mapping with a number of transformations.
It may not be practical to perform an end-to-end transformation testing in such cases given the time and resource constraints. Many database fields can only contain limited set of enumerated values. ETL process is generally designed to be run in a Full mode or Incremental mode. When running in Full mode, the ETL process truncates the target tables and reloads all (or most) of the data from the source systems. Incremental ETL only loads the data that changed in the source system using some kind of change capture mechanism to identify changes. Incremental ETL is essential to reducing the ETL run times and it is often used method for updating data on a regular basis.
The purpose of Incremental ETL testing is to verify that updates on the sources are getting loaded into the target system properly.While most of the data completeness and data transformation tests are relevant for incremental ETL testing, there are a few additional tests that are relevant. To start with, setup of test data for updates and inserts is a key for testing Incremental ETL. Duplicate Data ChecksWhen a source record is updated, the incremental ETL should be able to lookup for the existing record in the target table and update it. If not this can result in duplicates in the target table.Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique.Sample query to identify duplicatesSELECT fstname, lstname, midname, dateofbirth, count(1) FROM Customer GROUP BY fstname, lstname, midname HAVING count(1)1. Compare Data ValuesVerify that the changed data values in the source are reflecting correctly in the target data. Typically, the records updated by an ETL process are stamped by a run ID or a date of the ETL run. This date can be used to identify the newly updated or inserted records in the target system.
Alternatively, all the records that got updated in the last few days in the source and target can be compared based on the incremental ETL run frequency.Example: Write a source query that matches the data in the target table after transformation.Source QuerySELECT fstname ’,’ lstname FROM Customer where updateddtsysdate-7Target QuerySELECT fullname FROM Customerdim where updateddtsysdate-7 Data Denormalization ChecksDenormalization of data is quite common in a data warehouse environment. Source data is denormalized in the ETL so that the report performance can be improved. However, the denormalized values can get stale if the ETL process is not designed to update them based on changes in the source data.Example: The Customer dimension in the data warehouse is denormalized to have the latest customer address data. However, the incremental ETL for the Customer Dim was not designed to update the latest address data when the customer updates their address because it was only designed to handle the Change Capture on the Customer source table and not the Customeraddress table. Once the data is transformed and loaded into the target by the ETL process, it is consumed by another application or process in the target system. For data warehouse projects, the consuming application is a BI tool such as OBIEE, Business Objects, Cognos or SSRS. For a data migration project, data is extracted from a legacy application and loaded into a new application.
In a data integration project, data is being shared between two different applications usually on a regular basis. The goal of ETL integration testing is to perform an end-to-end testing of the data in the ETL process and the consuming application.End-to-End Data TestingIntegration testing of the ETL process and the related applications involves the following steps:.
Setup test data in the source system. Execute ETL process to load the test data into the target.
View or process the data in the target system. Validate the data and application functionality that uses the data.Example: Let’s consider a data warehouse scenario for Case Management analytics using OBIEE as the BI tool. An executive report shows the number of Cases by Case type in OBIEE. However, during testing when the number of cases were compared between the source, target (data warehouse) and OBIEE report, it was found that each of them showed different values. As part of this testing it is important to identify the key measures or data values that can be compared across the source, target and consuming application. Automate integrated ETL testing using ETL ValidatorETL Validator comes with Component Test Case the supports comparing an OBIEE report (logical query) with the database queries from the source and target. Using the component test case the data in the OBIEE report can be compared with the data from the source and target databases thus identifying issues in the ETL process as well as the OBIEE report.
Performance of the ETL process is one of the key issues in any ETL project. Often development environments do not have enough source data for performance testing of the ETL process. This could be because the project has just started and the source system only has small amount of test data or production data has PII information which cannot be loaded into the test database without scrubbing. The ETL process can behave differently with different volumes of data.Example 1: A lookup might perform well when the data is small but might become a bottle neck that slowed down the ETL task when there is large volume of data.
What can make it worse is that the ETL task may be running by itself for hours causing the entire ETL process to run much longer than the expected SLA. Example 2: An incremental ETL task was updating more records than it should. When the data volumes were low in the target table, it performed well but when the data volumes increased, the updated slowed down the incremental ETL tremendously. End-to-End Data TestingIntegration testing of the ETL process and the related applications involves the following steps:. Estimate expected data volumes in each of the source table for the ETL for the next 1-3 years. Setup test data for performance testing either by generating sample data or making a copy of the production (scrubbed) data.
Execute Full ETL process to load the test data into the target. Review each individual ETL task (workflow) run times and the order of execution of the ETL. Revisit ETL task dependencies and reorder the ETL tasks so that the tasks run in parallel as much as possible.
Setup test data for incremental ETL process with the data change volumes as expected during an incremental ETL. Executing incremental ETL. Review ETL task load times and the order of execution of the tasks to identify bottlenecks. This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website.
These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies.
But opting out of some of these cookies may have an effect on your browsing experience.
Types of Testing. Application Testing: The focus of this article isdata-centric testing so we’ll not discuss application testing here. Data-Centric Testing: Data-centric testing revolvesaround testing quality of the data.
The objective of the data-centrictesting is to ensure valid and correct data is in the system. Followingare the couple of reasons that cause the requirement of performingdata-centric testing:. ETL Processes/Data Movement: When you apply ETL processes onsource database, and transform and load data in the target database. System Migration/Upgrade: When you migrate your databasefrom one database to another or you upgrade an existing system where thedatabase is currently running. Data-Centric TestingThe data-centric testing validates data using the following approaches:.
Technical Testing: Technical testing ensures thatthe data is moved, copied, or loaded from the source system to targetsystem correctly and completely. Technical testing is performed bycomparing the target data against the source data. Following is a list offunctions that can be performed under technical testing:. Checksum Comparison: The data-centric testing makesuse of checksum approach to discover errors. Checksum can be performed onsource and target databases in n number of ways such as counting thenumber of rows, and adding the data of a column. Later the result ofchecksum calculated on source database is compared against the checksumcalculated on the target database.For example, row count compares the number of rows in the target databasewith the number of corresponding rows in the source database. Or thetarget database may contain the summarized annual data for monthlysalaries in the source database.
So the target database should containsum of the monthly salaries paid within a year for each year. Domain comparison: The domain list in the targetdatabase is compared against the corresponding domain list in the sourcedatabase. For example, the source system has 100 employees and the targetsystem also has 100 employees but it does not guarantee that theemployees in both the source and target domain lists are same unlesscompared. For example, value ofcertain fields such as salary and commission cannot be less than zero. Thesetypes of errors can occur because of data manipulation between the source andtarget systems. Unless you test data for any such errors, the quality of thedata is not guaranteed.For every business, the list of the business requirements is different. Anexhaustive list of business rules against which the data is compared, ensureshigh quality data.
Reconciliation: Reconciliation ensures that thedata in the target system is in agreement with the overall systemrequirements. Following are the couple of examples of how thereconciliation helps in achieving high quality data:. Internal reconciliation: In this type of reconciliation,the data is compared within the system against the corresponding dataset. For example shipping would also always be less than or equal to theorders.
If the shipping ever exceeds the orders then it means data isinvalid. External reconciliation: In this type of reconciliation,data in system is compared against its counterpart in other systems. Forexample, in a module or in an application, number of employees can neverbe more than the number of employees in HR Employee database.
Because HREmployee database is the master database that keeps record of all theemployees. If such a situation occurs where the number of employeesanywhere in the system is more than the HR Employee database, then thedata is invalid.
When you do ‘Select. from tab;’ sometimes, you will be surprised to see some tables with garbage names.
Welcome tothe world of Oracle Recycle Bin Feature. Because of this feature, Oracle savesthe dropped table in recycle bins until you clear it.1. Empty recycle bin use the command:PURGE RECYCLEBIN;2.
To drop the table without storing the table in recycle bin use:DROP TABLE employee PURGE;3. To restore the table from the recycle bin use:FLASHBACK TABLE employee TO BEFORE DROP;So don’t forget to clean your database /Schema once in a while. Pitfalls oftype II dimension Type II dimension has beenpopularized by R Kimball. It has become so popular that in any interviewrelated to data warehouse, the interviewer will surely ask to explain theconcepts. And chances are that if you don’t know; they will laugh at yourignorance and reject you.Here’s your chance to laugh at them.If you read this article, probably you will end up knowing something more thanthem. This is not because you will find the definition of type II dimension,but for an entirely different reason.To be continuedTo clearly explain the pitfallsof Type II dimension, let’s take an example.
In the example, there are threetables, that is, DIMInstrument, FACTTrade, and FACTSettlement. Each of thetables contains data as shown below:DIMInstrument tableIn the DimInstrument table, theproperty of instrument IBM changes from X to Y. So to maintain type IIdimensions, a new entry is added in the table by updating the status of currententry to obsolete (denoted by ‘O’) and adding date in the TODT column as well.In the new entry, the ToDT column is NULL and the status is current (denoted by‘C’).FACTTrade tableThe trade table containsinformation about just one trade and that was executed on April 29th, 2011, which means itwas processed for InstrumentKey ‘1’ as Instrumentkey ‘2’ did not exist on April 29th, 2011.FACTSettlement tableGenerally, it takes three daysfor a trade to settle. So the trade that was executed on 29th April 2011, gotsettled only on 2-MAY-2011.
Sai, I’mglad you mentioned this. If we look carefully, this article tried to bring intonotice the pitfalls of Type II dimensions.
As mentioned earlier, type IIrequires the primary key change for the same identity to maintain the history.Once the primary key changes, we can very well imagine what kind of results itcan produce.As far as the solution is concerned, it can be implemented the way you want. Itshould not be assumed that there is no solution to the issues reported here. Infact, the solution for this has to be implemented at the Data Model level.Sometimes the problem and the solution are so closely woven that we prefer notto look at them separately.
Design C. Can there be a comprimise.
Howabout using from date (time) – to date (time)! The report write can simplyprovide a date (time) and the straight SQL can return a value/row that wasvalid at that moment. However the ETL is indeed complexas the A model. Because while the current row will be from current dateto- infinity. The previous row has to be retired to from date to todaysdate -1. This kind of ETL coding alsocreates lots of testing issues as you want to make sure that for nay givendate and time only one instance of the row exists (for the primary key). ETL deltalogic & de-normalization of data model.
It is a normal practice in datawarehouse to de normalizes (Or once auto corrected as demoralize) as the data modelfor performance. I am not going to discuss the benefits vs. Issues withde-normalization. As by the time it comes to the ETL guy the fate of themodel is already decided.Let’s look at the model in thesource side, which is perfectly normalized.Now let’s look at the denormalized model on the target side.Next lets think of delta logicfor loading of the dimemployee table. Ideally you would only check changes inthe employee table.
Then if there is any changes after the last load date time; then get those rows from refemployee and do the lookup to get the department& the designation and load it into the target table.The issue with this delta logic is that it has not considered the effect of denormalization of employee table on the target side. If you carefully look atthe two de normalized attributes deptname and empdesignationdesc, the ETLprocess will miss any changes in the parent tables, so only new employees orupdated employee will get the new definition of department & designation.And any employee that has not been updated in the source side will still havethe same deptname & empdesignationdesc. This is wrong.The reason it is wrong is the ETLdelta logic only picked the row from the employee table when it changed andignored the changes in the dept & designation tables. The truth of thematter is, ” For any de normalized target table data (affected rows) should bere-captured from the source, any time there is change in the driving/core tableas well as when there is change in any parent tables to which the driving tablerefers to.” In this case, even if there is change in department or designationtable, all the rows affected on the employee tables should be re-processed.It might seem very simple, butETL developers/designers/modelers always miss this point. Also once developedit is very difficult to catch.The next question is how youwould catch the affected rows. Well there are ways to write SQL that combinethe three tables (in this case) and treat them as one single entity and thepull rows based on the any updatedttm greater than the last ETL run. Figureout the SQL.
Types dataelements and entities (Tables) for ETL. It is important for an ETLdeveloper to understand the types of tables and data, to intelligently designETL processes. Once the common types objects are understood, reusable templatesfor ETL can be developed, regardless of business logic. This will greatlyimprove the efficiency of an ETL developer.1.
Reference data2. Dimensional data (master data)3. Transactional data4. Out triggers/mini dimensions10. Log tables11. Meta data tables12. Security tables13.
Configuration tablesSimulatingOracle Sequences in Sybase & SQL Server Programmatic control is lost whenidentity columns are used in Sybase and SQL Server. I do not recommend usingIdentity columns to create surrogate keys during ETL process. There are manymore reasons for that. Oracle has the sequence feature which is usedextensively by Oracle programmers.
I have no clue why other vendors are notproviding the same. This custom code has been used extensively by me andthoroughly tested. I ran multiple processes simultaneously to check if there isdeadlock and also made sure that the process returns different sequences todifferent client process.Notes: -1. The table should have ‘ROW LEVEL LOCKING’2. The sequence generator process is stateless (See more details in ObjectOriented Programming)3.
Create one row for each target table in the sequence master table. Do nottry to use one sequence for multiple tables. Data stagingtable / area design. This could be long topic ofdiscussion.
This is not some fake page to get money, in fact, its all free and it's a. Copy the scriptcode and paste it in the page that opened after you. Your TSR Sims 4 download section should now look like the photo below. I highly recommend choosing the How to basically get FREE VIP from TSR website The Sims Resource Untitl12. For the time being it requires a VIP membership. Subscriptions are replaced with a 'freemium' model and a new site design. For The Sims games, including our Featured Artist creations that have. Sites like TSR have substantial overheads with bandwidth fees and masses. Version 9 of the Sims Resource is now live and gives users unparalleled access to all of the 852,000+ downloads available for The Sims.
Following are the main issues I would like to discuss onstaging table /database design.1. Whystaging area is needed?Unlike OLTP systems that createtheir own data through an user interface data warehouses source their data fromother systems. There is physical data movement from source database to datawarehouse database. Staging area is primarily designed to serve as intermediateresting place for data before it is processed and integrated into the targetdata warehouse. This staging are serves many purpose above and beyond theprimary functiona. The data is most consistent with the source.
It is devoid of anytransformation or has only minor format changes.b. The staging area in a relation database can be read/ scanned/ queried usingSQL without the need of logging into the source system or reading files(text/xml/binary).c. It is a prime location for validating data quality from source or auditingand tracking down data issues.d.
Staging area acts as a repository for historical data if not truncatede. What isthe difference between staging area as compared to other areas of datawarehouse?a. Normally tables in anyrelational database are relational.
Normally tables are not stand alone. Tableshave relationship with at least one or more tables. But the staging areagreatly differs in this aspect. The tables are random in nature. They are morebatch oriented. They are staged in the hope that in the next phase of loadthere will be a process that will identify the relationship with other tablesand during such load a relationship will be established.3.
Tools For Etl Testing
Whatshould the staging table look like?a. The key shown is a meaninglesssurrogate key but still it has been added the reason being; as may times thedata coming from a source has no unique identifier or some times the uniqueidentifier is a composite key; in such cases when data issue is found with anyof the row it is very difficult to identify the particular row or even mentionit.
When a unique row num is assigned to each row in the staging table itbecomes really easy to reference it.b. Various dates have added tothe table; please refer date discussion.c. The data type has been kept asstring because this data type ensures that a bad format or wrong data type rowwill be at least populated in the stage table for further analysis orfollow-up.d. Source system column has beeadded to keep a data reference so that next process step can use this value andcan have dynamic behavior during process based on the source system. Also itsupports reuse of table, data partitioning etc.e.
Note the table has source astable qualifier as prefix this distinguishes the table from other sourcesystem. Example customer from another system called MKT.d. Other columns can be addedexample processed flag to indicate if the row has been processed by the downstream application. It also provides incremental restart abilities for the downstream process.
Also exception flag can be added to the table to indicatethat while processing the table an exception or error was raised hence the rowis not processed.4. Which designto choose?a. Should the table be truncatedand loaded?b. Should the table will be append only?c. Should default data type to be left as alpha numeric string (VARCHAR)?d.
Should constraints be enforced?e. Should there be a primary Key?It is normally based on thesituation but if not sure or you don’t want to think then design suggested hereshould more than suffice your requirementSlow RunningETL process (read side) & Table Statistics (Oracle). Sometimes an ETL process runsconsiderably slow speed.
During test for the small result set it might fly butwhen a million rows are applied the performance takes a nosedive. There can bemany reasons for slow ETL process. The process can be slow because read,transformation, load.
Lets eliminate the transformation and load for the sakeof discussion.For ETL process to be slow onread side here are some reasons.1. No indexes on joins and/or ‘where clause’2. Query badly written. Source not analyzed. Out of these three letsrule out 1 & 2.In the past most of the databases had RULE based optimizer in the INIT.ORAfile, but with new development and specially Data warehouses ‘CHOOSE’ optimizeris preferred. With ‘CHOOSE’ option the query uses COST based optimizer if thestatistics are available for the tables in the query.There are two methods to gatherstatistics 1. DBMSSTATS package, 2.
ANALYZE command. Dataintegration basics-Dimension Confirmation. The purpose of this topic is toestablish the basics for design of ETL processes. With out the understanding anETL process for data integration cannot be designed or developed.A database can have multiplesources. Multiple sources may contain a data set of entirely different subjectareas, but some data set will intersect. Example Sales data and salary datawill have employee as the common set.Between two or more sources, thesubjects, entities or even attributes can be common.
So can we integrate thedata easily? Mathematically it seems very easy, but the real world is not justabout numbers or exact same string values. Every thing can be similar or samebut may be not represented in the exact same manner in all the sources. Thedifferences in representation of same information and facts between two or moresources can create some of the most interesting challenges in Data Integration.Dataintegration: -The first step in dataintegration is identification of common elements1. Identify the common entities.Example, Employee dimension can come from Sales system, Payroll System, etc.Products can come from manufacturing, sales, purchase etc. Once the commonentity is identified its definition should be standardized. Example, doesemployee include fulltime employees as well as temporary workers?2.
Identify the common attributes.What are the attributes that are common to employee, 1st name, 2nd name, lastname, date of joining, etc? Each attribute should be defined.3. Identify the common valuesSame information can be represented in different forms in multiple sourcesystem. Example, male sex, can be represented as ‘M’ or ‘1′ or ‘male’ or some thing else by each source system. A commonrepresentation must be decided (example, ‘Male’).
Also if necessary afinite set of values should be established. Example the employee sex =(‘Male’,’ Female’) and it will not allow more than these two values.The second step is theidentification of Data Steward who will own the responsibility and ownershipfor particular set data elements.The third step is to design anETL process to integrate the data into the target. This is the mostimportant area in the implementation of an ETL process for data integration.This topic will be discussed in more detailed under its own heading.The final fourth step is toestablish a process of maintenance, review & reporting of such elements.For everyrule there is an exception; for each exception there are more exceptions To implement an ETL process thereare many steps that are followed.
One such step is creating a mapping document.This mapping document describes the data mapping between the source systems andthe target and the rules of data transformation.Ex. Table / column map betweensource and target, rules to identify unique rows, not null attributes, uniquevalues, and range of a attributes, transformations rules, etc.Without going into further details of the document, lets analyze the very nextstep. It seems obvious and natural to start development of the of the ETLprocess. The ETL developer is all fired up and comes up with a design documentand starts developing, few days time the code is ready for data loading.But unexpectedly (?) the code starts having issues every few days. Issues arefound and fixed.
Unit Testing Document Template
And then it fails again. What’s happening? Analysis was doneproperly; rules were chalked out & implemented according to the mappingdocument. But why are issues popping up? Was something missed?Maybe not! Isn’t it, normal to have more issues in the initial lifetime of theprocesses?Maybe Yes!
You have surely missed ‘Source System Data Profiling’. The businessanalyst has told you rules as the how the data is structured in the sourcesystem and how it is supposed to behave; but he/she has not told you the ‘butsand ifs’ called as EXCEPTIONS for those rules.To be realistic it is not possible for anyone to just read you all rules andexceptions like a parrot. You have to collaborate and dig the truth. The actualchoice is yours, to do data profiling on the source system and try to break allthe rules told by the analyst. Or you can choose to wait for the process to golive and then wakeup every night as the load fails. If you are luckyenough you deal with an unhappy user every morning you go to the office.Make the right choice; don’t miss ‘Source system data profiling’ beforeactually righting a single line of code. Question every rule.
Try to findexception to the rules. There must be at least 20 tables.
One table on anaverage will have 30 columns; each column will have on an average 100k values.If you make matrix of number of tables. columns. data values, it will givethe number of reasons the why your assumptions may be wrong. It’slike unit testing source data even without loading. There is a reason whymachines alone cannot do your job; there is reason why IT jobs are more paying.Remember, ‘for every rule there is an exception; for each exception there aremore exceptions’ETL Startegyto store data validation rules Every time there is movement ofdata the results have to be tested against the expected results.
Software Testing Template For Bank Application
For every ETLprocess, test conditions for testing data are defined before/during design anddevelopment phase itself. Some that are missed can be added later on.Various test conditions are usedto validate data when the ETL process is migrated fromDEV-to-QA-to-PRD. These test conditions are can exists in thedeveloper’s/tester’s mind /documented in word or excel. With time the testconditions either lost ignored or scattered all around to be really useful.In production if the ETL processruns successfully without error is a good thing. But it does not really meananything. You still need rules to validate data processed by ETL.
Atthis point you need data validation rules again!A better ETL strategy is to storethe ETL business rules in a RULES table by target table, source system. Theserules can be in SQL text.
This will create a repository of all the rules in asingle location which can be called by any ETL process/ auditor at any phase ofthe project life cycle.There is also no need to re-write/rethink rules.