One of the biggest challenges when testing in a project that involves personally identifiable information is the test data. It may be possible to use some data from the production environment and still respect GDPR, but in most cases, production data is not available for use by the testers.
Most severely affected by this is performance tests, which needs adequate volumes of representative data, used by the different functions of the system, in order to obtain valid measurements. As we will show in this article, a good test data strategy along with an effective set of data generation techniques, can help to address these challenges.
Why is representative test data important?
As we have said, a major challenge in testing is obtaining a good test data set.
For functional testing, a lot of the effort is often in identifying and creating test data for the unusual “corner” cases, whereas when trying to reproduce a defect from production, the difficulty is knowing what aspect of the data set is needed to illustrate the problem, without having access to the production data set.
Non-functional testing complicates the situation further, especially when undertaking performance testing. The need is for a large volume of data which supports realistic scenarios that accurately simulate key user journeys, to determine if the application can handle the data volume while achieving the required performance. It is rarely possible to perform performance tests in the production environment as they are very disruptive. This means that the data sets in our test environments need to have very similar characteristics to the data found in production, so that we can perform representative tests.
In summary, without sufficient test data, the result is a plethora of defects not being found until they appear in production, thus increasing the disruption that they cause and the cost of fixing them.
However, let us not forget that creating good quality, representative test data is a significant cost in the development and testing work, which might get overlooked given today’s emphasis on “shifting left” and performing testing activities earlier in the development lifecycle.
Metrics for test data relevance
The process of acquiring test data is not as easy as it often seems at first. There are generally two possibilities for procuring test data - gathering it from production or generating it ourselves.
A one-to-one copy of the data from the production environment seems like an obvious approach, that ensures a representative data set to test against. However, a lot of production data is very sensitive (think of medical records, or financial data) and so it is often simply not available for use in testing.
Therefore, we are left with the option of generating the data ourselves (so called “synthetic” test data). The immediate question that arises with this approach is how to identify a metric to determine how representative a synthetic test data set is compared to the corresponding production data. The first step towards such a measure is to gather some metrics from the production data set to try to characterise it.
Let’s take for example a database for a financial transaction processing system. Some of the things we could usefully measure without access to the actual data include:
- Number of transactions per time unit (hour/day/month/year) - whatever time frame is needed to make it relevant for the tests
- Number of users in the system, classified by the types of user
- Number of transactions per user in the time frame
- Distribution of the different types of transaction per time unit and user
- Average values of key transaction attributes (such as financial value) with a minimum, a maximum and a standard deviation
- Number of countries/currencies supported and the average transaction distribution per country/currency
Obviously, it would often be valuable to add other application specific measurements depending on the nature of the application and data set.
To determine how similar any generated data set is compared to production, run the same queries on both databases and compare the results. This is a good indicator if your data is close to production or not, thus having an overall metric for “data resemblance factor”.
Data protection techniques
In situations where we can use production data, we will have to process it to protect any personal data in the data set. The key techniques to consider include:
- Anonymisation – the use of randomisation and generalisation to replace the personally identifiable information (PII) with a realistic generated value, so that a record cannot be tied to any real-life person
- Tokenisation – a simpler process, where the sensitive data is again replaced with a placeholder value, but one that is more generic and does not necessarily preserve the format of the original
- Pseudo-anonymisation – a technique that uses a mapping table between the real PII data and randomised data that replaces it, so allowing the original data to be restored at some point if needed, but requires that the mapping table needs is carefully protected
- Format-preserving encryption – encrypting the sensitive data in such a way that preserves the format so that the data is still relevant
- Synthetic data – entirely synthetic, generated data created for all of the fields in the data set, generated in such a way that the format is correct and data linkages between the tables are still valid
A diagram representing how each technique works is shown below:
The most complex approach is using completely synthetic data, as this requires that the relationships between the data are created as well as the individual data values, and the data set must respect the data constraints and foreign keys that link the tables. Creating data generator tools to achieve this is quite complicated, and sometimes requires a lot of business logic to be implemented in the tool, which could make it infeasible for some situations.
Test data generation strategy recommendations
While deciding what approach to use, the first thing that needs to be clarified is whether production data can be used or not, and if so, whether it contains personally identifiable information.
If you can use production data, then the amount required can be decided and a strategy for extracting it from production identified. Next the data protection techniques to use can be selected based on the protection level needed and the effort that the project can accommodate. Usually anonymisation and format-preserving encryption can provide a cost-effective option.
Choosing the most challenging approach, the synthetic data generation, will involve identifying the metrics to use for checking the match to the production data set, writing the SQL queries to derive those values from a data set and implementing the tool that generates the data which meets the needs of the testing process, but also accurately reflects the characteristics of the production data set.
Make sure that you remember to include the test data generation effort when planning the work and create explicit stories to allow it to be planned and scheduled. Without a good overview of what is needed in terms of creating the data, the overall testing process will be under pressure, and this will result in bottlenecks during development. Also make sure that you are using the best approach for your situation to obtain the relevant test data and explain effort needed to produce and manage it to your stakeholders.
PII and GDPR compliance in testing
The implementation of GDPR has resulted in data now having an increased significance for both the customer and the data processor. GDPR Article 6 defines the laws governing data processing and enforcing the need for consent, based on a contract that was acknowledged by the participants and the legal obligations of the data processor, including safeguards for protecting it. This is also referred to in articles 25 and 32 where the security is a mandatory activity while dealing with such data.
This means that using production data for testing purposes might violate GDPR guidelines if the customer did not give their consent for such usage. Article 5 of the legislation defines the fact that data should be collected with fairness and transparency, must be adequate, accurate, limited and relevant that would require the companies to conduct measures to ensure the integrity and confidentiality of the data collected.
In addition, Article 30 makes it clear that whenever a data processing event occurs, there should be a record of it.
The prevalence of PII in most databases, along with GDPR requirements, means that the use of production data by testers could pose imminent risks, as they could unintentionally share data using unsafe communication channels that could lead to data breaches. If a data breach does occur, this must be communicated to the owner of the data, as specified in Article 34 of the regulations.
The GDPR regulations allow for extremely large fines to be imposed and we have already seen fines as large as $126 M (€100 M) being imposed in January 2020 and the sums could get even higher in the future. This is one of the reasons that it is so important that we all understand the fundamentals of GDPR and our responsibilities that it implies.
The process to generate test data itself comes with challenges, but while adding on top of it the security constraints it increases the cost for delivering test quality. There are some methods that can be used in order to comply to the PII requirements that could be integrated in the overall process. Based on the data shape extracted from production, relevant new data could be generated that could be quantified using some defined metrics that could increase the confidence in the testing process. Failing to produce relevant data could jeopardise the testing process but failing to ensure the PII/GDPR compliance would result in consistent fines and potentially reputation loss.