<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=4958233&amp;fmt=gif">
RSS Feed

Test Automation | Alex Gatu |
25 February 2020


One of the biggest challenges when testing in a project that involves personally identifiable information is the test data. It may be possible to use some data from the production environment and still respect GDPR, but in most cases, production data is not available for use by the testers.

Most severely affected by this is performance tests, which needs adequate volumes of representative data, used by the different functions of the system, in order to obtain valid measurements. As we will show in this article, a good test data strategy along with an effective set of data generation techniques, can help to address these challenges.

Why is representative test data important?

As we have said, a major challenge in testing is obtaining a good test data set.

For functional testing, a lot of the effort is often in identifying and creating test data for the unusual “corner” cases, whereas when trying to reproduce a defect from production, the difficulty is knowing what aspect of the data set is needed to illustrate the problem, without having access to the production data set.

Non-functional testing complicates the situation further, especially when undertaking performance testing. The need is for a large volume of data which supports realistic scenarios that accurately simulate key user journeys, to determine if the application can handle the data volume while achieving the required performance. It is rarely possible to perform performance tests in the production environment as they are very disruptive. This means that the data sets in our test environments need to have very similar characteristics to the data found in production, so that we can perform representative tests.

In summary, without sufficient test data, the result is a plethora of defects not being found until they appear in production, thus increasing the disruption that they cause and the cost of fixing them.

However, let us not forget that creating good quality, representative test data is a significant cost in the development and testing work, which might get overlooked given today’s emphasis on “shifting left” and performing testing activities earlier in the development lifecycle.

Metrics for test data relevance

The process of acquiring test data is not as easy as it often seems at first. There are generally two possibilities for procuring test data - gathering it from production or generating it ourselves.

A one-to-one copy of the data from the production environment seems like an obvious approach, that ensures a representative data set to test against. However, a lot of production data is very sensitive (think of medical records, or financial data) and so it is often simply not available for use in testing.

Therefore, we are left with the option of generating the data ourselves (so called “synthetic” test data). The immediate question that arises with this approach is how to identify a metric to determine how representative a synthetic test data set is compared to the corresponding production data. The first step towards such a measure is to gather some metrics from the production data set to try to characterise it.

Let’s take for example a database for a financial transaction processing system. Some of the things we could usefully measure without access to the actual data include:

  • Number of transactions per time unit (hour/day/month/year) - whatever time frame is needed to make it relevant for the tests
  • Number of users in the system, classified by the types of user
  • Number of transactions per user in the time frame
  • Distribution of the different types of transaction per time unit and user
  • Average values of key transaction attributes (such as financial value) with a minimum, a maximum and a standard deviation
  • Number of countries/currencies supported and the average transaction distribution per country/currency

Obviously, it would often be valuable to add other application specific measurements depending on the nature of the application and data set.

To determine how similar any generated data set is compared to production, run the same queries on both databases and compare the results. This is a good indicator if your data is close to production or not, thus having an overall metric for “data resemblance factor”.

Data protection techniques

In situations where we can use production data, we will have to process it to protect any personal data in the data set. The key techniques to consider include:

  • Anonymisation – the use of randomisation and generalisation to replace the personally identifiable information (PII) with a realistic generated value, so that a record cannot be tied to any real-life person
  • Tokenisation – a simpler process, where the sensitive data is again replaced with a placeholder value, but one that is more generic and does not necessarily preserve the format of the original
  • Pseudo-anonymisation – a technique that uses a mapping table between the real PII data and randomised data that replaces it, so allowing the original data to be restored at some point if needed, but requires that the mapping table needs is carefully protected
  • Format-preserving encryption – encrypting the sensitive data in such a way that preserves the format so that the data is still relevant
  • Synthetic data – entirely synthetic, generated data created for all of the fields in the data set, generated in such a way that the format is correct and data linkages between the tables are still valid

A diagram representing how each technique works is shown below:


The most complex approach is using completely synthetic data, as this requires that the relationships between the data are created as well as the individual data values, and the data set must respect the data constraints and foreign keys that link the tables. Creating data generator tools to achieve this is quite complicated, and sometimes requires a lot of business logic to be implemented in the tool, which could make it infeasible for some situations.

Test data generation strategy recommendations

While deciding what approach to use, the first thing that needs to be clarified is whether production data can be used or not, and if so, whether it contains personally identifiable information.

If you can use production data, then the amount required can be decided and a strategy for extracting it from production identified. Next the data protection techniques to use can be selected based on the protection level needed and the effort that the project can accommodate. Usually anonymisation and format-preserving encryption can provide a cost-effective option.

Choosing the most challenging approach, the synthetic data generation, will involve identifying the metrics to use for checking the match to the production data set, writing the SQL queries to derive those values from a data set and implementing the tool that generates the data which meets the needs of the testing process, but also accurately reflects the characteristics of the production data set.

Make sure that you remember to include the test data generation effort when planning the work and create explicit stories to allow it to be planned and scheduled. Without a good overview of what is needed in terms of creating the data, the overall testing process will be under pressure, and this will result in bottlenecks during development. Also make sure that you are using the best approach for your situation to obtain the relevant test data and explain effort needed to produce and manage it to your stakeholders.

PII and GDPR compliance in testing

The implementation of GDPR has resulted in data now having an increased significance for both the customer and the data processor. GDPR Article 6 defines the laws governing data processing and enforcing the need for consent, based on a contract that was acknowledged by the participants and the legal obligations of the data processor, including safeguards for protecting it. This is also referred to in articles 25 and 32 where the security is a mandatory activity while dealing with such data.

This means that using production data for testing purposes might violate GDPR guidelines if the customer did not give their consent for such usage. Article 5 of the legislation defines the fact that data should be collected with fairness and transparency, must be adequate, accurate, limited and relevant that would require the companies to conduct measures to ensure the integrity and confidentiality of the data collected.

In addition, Article 30 makes it clear that whenever a data processing event occurs, there should be a record of it.

The prevalence of PII in most databases, along with GDPR requirements, means that the use of production data by testers could pose imminent risks, as they could unintentionally share data using unsafe communication channels that could lead to data breaches. If a data breach does occur, this must be communicated to the owner of the data, as specified in Article 34 of the regulations.

The GDPR regulations allow for extremely large fines to be imposed and we have already seen fines as large as $126 M (€100 M) being imposed in January 2020 and the sums could get even higher in the future. This is one of the reasons that it is so important that we all understand the fundamentals of GDPR and our responsibilities that it implies.


The process to generate test data itself comes with challenges, but while adding on top of it the security constraints it increases the cost for delivering test quality. There are some methods that can be used in order to comply to the PII requirements that could be integrated in the overall process. Based on the data shape extracted from production, relevant new data could be generated that could be quantified using some defined metrics that could increase the confidence in the testing process. Failing to produce relevant data could jeopardise the testing process but failing to ensure the PII/GDPR compliance would result in consistent fines and potentially reputation loss.

Alex Gatu

Senior Test Consultant

Alex is a passionate software testing engineer with a background in programming, who has dedicated the past decade to security and performance testing. He is involved in enhancing the technical excellence in Endava and can often be found creating custom performance and security automation frameworks. Outside work, Alex is enthusiastic about DJing (as long as it is a variety of electronic music) and spending time with family and friends.


From This Author

  • 15 May 2020

    Performance and security testing shifting left



  • 13 November 2023

    Delving Deeper Into Generative AI: Unlocking Benefits and Opportunities

  • 07 November 2023

    Retrieval Augmented Generation: Combining LLMs, Task-chaining and Vector Databases

  • 19 September 2023

    The Rise of Vector Databases

  • 27 July 2023

    Large Language Models Automating the Enterprise – Part 2

  • 20 July 2023

    Large Language Models Automating the Enterprise – Part 1

  • 11 July 2023

    Boost Your Game’s Success with Tools – Part 2

  • 04 July 2023

    Boost Your Game’s Success with Tools – Part 1

  • 01 June 2023

    Challenges for Adopting AI Systems in Software Development

  • 07 March 2023

    Will AI Transform Even The Most Creative Professions?

  • 14 February 2023

    Generative AI: Technology of Tomorrow, Today

  • 25 January 2023

    The Joy and Challenge of being a Video Game Tester

  • 14 November 2022

    Can Software Really Be Green

  • 26 July 2022

    Is Data Mesh Going to Replace Centralised Repositories?

  • 09 June 2022

    A Spatial Analysis of the Covid-19 Infection and Its Determinants

  • 17 May 2022

    An R&D Project on AI in 3D Asset Creation for Games

  • 07 February 2022

    Using Two Cloud Vendors Side by Side – a Survey of Cost and Effort

  • 25 January 2022

    Scalable Microservices Architecture with .NET Made Easy – a Tutorial

  • 04 January 2022

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 2

  • 23 November 2021

    How User Experience Design is Increasing ROI

  • 16 November 2021

    Create Production-Ready, Automated Deliverables Using a Build Pipeline for Games – Part 1

  • 19 October 2021

    A Basic Setup for Mass-Testing a Multiplayer Online Board Game

  • 24 August 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 3

  • 20 July 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 2

  • 29 June 2021

    EHR to HL7 FHIR Integration: The Software Developer’s Guide – Part 1

  • 08 June 2021

    Elasticsearch and Apache Lucene: Fundamentals Behind the Relevance Score

  • 27 May 2021

    Endava at NASA’s 2020 Space Apps Challenge

  • 27 January 2021

    Following the Patterns – The Rise of Neo4j and Graph Databases

  • 12 January 2021

    Data is Everything

  • 05 January 2021

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 3

  • 02 December 2020

    8 Tips for Sharing Technical Knowledge – Part 2

  • 12 November 2020

    8 Tips for Sharing Technical Knowledge – Part 1

  • 30 October 2020

    API Management

  • 22 September 2020

    Distributed Agile – Closing the Gap Between the Product Owner and the Team – Part 2

  • 25 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 2

  • 18 August 2020

    Cloud Maturity Level: IaaS vs PaaS and SaaS – Part 1

  • 08 July 2020

    A Virtual Hackathon Together with Microsoft

  • 30 June 2020

    Distributed safe PI planning

  • 09 June 2020

    The Twisted Concept of Securing Kubernetes Clusters – Part 2

  • 15 May 2020

    Performance and security testing shifting left

  • 30 April 2020

    AR & ML deployment in the wild – a story about friendly animals

  • 16 April 2020

    Cucumber: Automation Framework or Collaboration Tool?

  • 25 February 2020

    Challenges in creating relevant test data without using personally identifiable information

  • 04 January 2020

    Service Meshes – from Kubernetes service management to universal compute fabric

  • 10 December 2019

    AWS Serverless with Terraform – Best Practices

  • 05 November 2019

    The Twisted Concept of Securing Kubernetes Clusters

  • 01 October 2019

    Cognitive Computing Using Cloud-Based Resources II

  • 17 September 2019

    Cognitive Computing Using Cloud-Based Resources

  • 03 September 2019

    Creating A Visual Culture

  • 20 August 2019

    Extracting Data from Images in Presentations

  • 06 August 2019

    Evaluating the current testing trends

  • 23 July 2019

    11 Things I wish I knew before working with Terraform – part 2

  • 12 July 2019

    The Rising Cost of Poor Software Security

  • 09 July 2019

    Developing your Product Owner mindset

  • 25 June 2019

    11 Things I wish I knew before working with Terraform – part 1

  • 30 May 2019

    Microservices and Serverless Computing

  • 14 May 2019

    Edge Services

  • 30 April 2019

    Kubernetes Design Principles Part 1

  • 09 April 2019

    Keeping Up With The Norm In An Era Of Software Defined Everything

  • 25 February 2019

    Infrastructure as Code with Terraform

  • 11 February 2019

    Distributed Agile – Closing the Gap Between the Product Owner and the Team

  • 28 January 2019

    Internet Scale Architecture