automated completeness check
Implementation of an automated completeness check that gives a list of potential missing or duplicate resources. The comparison of the raw data between source and target database should be done by using timestamps. Right now timestamps are already used in the SQLs to enable time-filtering. To keep it simple, these timestamp should be used.
The DQA-Tool works with a two-column architecture. To ensure that the existing code is not affected by adding timestamps, all the new code should be done right after the SQL-queries. After executing the new completeness check the timestamp column should be discarded.
So the following steps need to be done:
- adjust the SQL Queries to give a timestamp column
- implement the completeness check right after the database query.
- the completeness check should group the data by the timestamp
- than count the items in each group
- calculate differences in the count between the source and target database
- if a difference is detected, write the raw data associated with this timestamp to a .csv
- add a result-tag to the rv dataframe. This result tag should include the total number of identified timestamps, the total number of identified resources and a list of the inconsistent timestamps
- delete the timestamp column from the dataframe
FYI: @mangjn @lorenz.kapsner