Generate and Secure Test Data

Intermediate 20 min 4 steps

The problem

Your staging environment needs realistic data but using a production database copy violates GDPR, India's DPDP Act, and most enterprise data security policies. This workflow builds a clean, privacy-safe test dataset from scratch — and audits it for any PII or credentials that may have slipped in through fixture files or API response copies.

What you'll accomplish

Realistic fake user profiles covering edge cases in name formats, phone, and address
All real PII fields replaced with format-valid synthetic equivalents
The dataset scanned and cleared of accidentally included credentials and secrets
A validated, properly structured JSON dataset ready for staging or fixture use

Step-by-step

1

Generate realistic synthetic profiles for your test environment

Use the Fake Person Generator to create realistic fake user profiles — names, email addresses, phone numbers, physical addresses, dates of birth, and other PII fields — that are structurally valid but refer to no real individual. Realistic test data is important because: (a) validation rules that accept real-looking emails, phone formats (including Indian mobile formats: +91 9XXXXXXXXX), and address structures will work correctly, (b) UI rendering that depends on string lengths (long names wrapping, short initials fitting avatars) behaves as in production, and (c) QA testers can run through realistic user journeys without confusion. Generate enough records to cover edge cases: very long names, international characters, names with apostrophes, email addresses with subdomains, phone numbers in different country formats.

Tip: Generate a 10–20% overage of records — you'll discard some in Step 2 when anonymising, and having extras avoids regenerating mid-workflow.

2

Anonymise any real PII fields that slipped into the dataset

Even when you intend to use only fake data, real PII can enter test environments through: copying a production database export, using a seed file from a real customer interaction, or copy-pasting a real API response as a test fixture. Use the Data Anonymizer to mask or replace any real names, email addresses, phone numbers, national IDs (Aadhaar, PAN, SSN), passport numbers, financial account numbers, IP addresses, and physical addresses found in your dataset. Anonymisation replaces real values with realistic-looking synthetic equivalents that preserve format validity — a real phone number becomes a fake phone number in the same format, not just '***'. This is different from redaction (replacing with asterisks) — anonymised data still passes validation rules and produces realistic API responses.

Tip: Run anonymisation on any dataset that originated from a production system, even if you believe it was already cleaned — this step is a safety net, not a primary control.

3

Scan the dataset for accidentally included credentials and secrets

After generating and anonymising PII, use the Secret Scanner to scan your test dataset for a different category of sensitive data: credentials and secrets. These include: API keys (AWS, GCP, Stripe, Twilio, SendGrid format patterns), JWTs, OAuth tokens, private keys (RSA, EC, PGP headers), database connection strings with embedded passwords, .env variable patterns, and hardcoded passwords. Test datasets accumulate secrets because developers copy real API responses (which sometimes include tokens in response headers or bodies), copy fixture files from production debugging sessions, or paste Postman collections that contain real credentials. The Secret Scanner uses pattern matching to flag these before they reach version control or shared staging environments where they could be exposed.

Tip: Also scan your fixture files and seed data scripts in version control — secrets committed to git remain in history even after deletion from the working tree.

4

Format and validate the final clean dataset before export

After completing Steps 1–3, you have a clean, PII-free, secret-free dataset. Use the JSON Formatter & Validator to format the final JSON output and confirm it is syntactically valid before deploying it to your staging environment or committing it as a test fixture. The formatter will surface: malformed JSON from incomplete anonymisation (a field replacement that broke the JSON structure), unclosed arrays or objects, duplicate keys (which some parsers accept but cause non-deterministic behaviour), and encoding issues with Unicode characters in names and addresses. Validate that the schema matches your expected structure — field names, data types, nesting depth, and array lengths should match your production data model. A valid JSON structure is the contract between your test data and your application.

Tip: Use the JSON Formatter's prettify mode to visually scan the structure and confirm field coverage before using the minified version for production fixture files.

Why this workflow works

The sequence is layered defence: Step 1 generates clean synthetic data by design. Step 2 catches real PII that entered through external sources (API responses, fixture files, seed scripts). Step 3 catches credentials and secrets — a different category of sensitive data that PII anonymisation tools don't address. Step 4 validates the structure to ensure the cleaning steps didn't corrupt the JSON format. Running these in reverse — validating first, then scanning — means you might catch structural issues but miss security ones. Running Steps 2 and 3 in parallel works, but sequential is easier to audit and document for compliance purposes.

Frequently asked questions

Why can't I just use a copy of production data for testing?

Using production data in test environments is prohibited or heavily restricted under GDPR (EU), India's DPDP Act 2023, CCPA (California), HIPAA (healthcare), and PCI DSS (payment data). Even with contractual permission, production data in test environments creates risk: it may be exposed to developers, contractors, CI/CD pipelines, and third-party integrations that aren't authorised to access real customer data. A data breach in a staging environment is still a breach. The compliance and reputational cost of mishandled test data typically far exceeds the inconvenience of generating synthetic data.

What is the difference between fake data, anonymised data, and redacted data?

Fake data is synthetically generated data that never referred to a real person — it's fictional from the start (Fake Person Generator output). Anonymised data was originally real but has been transformed so the individual is no longer identifiable — real values are replaced with realistic synthetic equivalents in the same format. Redacted data has real values replaced with placeholders like '***' or '[REDACTED]' — still shows the field exists but removes the value. For testing purposes: fake data is the gold standard (starts clean). Anonymised data is acceptable (derived from real but deidentified). Redacted data is problematic (breaks validation rules and realistic rendering).

How do I check if my test data has accidentally included real PII?

The Secret Scanner catches credentials and tokens. For PII specifically, look for: email patterns that match real corporate domains (e.g., @realcompany.com vs @example.com), phone numbers in real carrier ranges (check against fake number ranges like +1-555-XXXX), Indian PAN numbers in valid format (AAAAA9999A), Aadhaar-like 12-digit numbers, real-looking but semantically suspicious names (run a name plausibility check). The Data Anonymizer can scan for and replace common PII patterns. For high-stakes environments, consider a dedicated DLP (Data Loss Prevention) tool as a secondary check after this workflow.

What data regulations apply to test environments in India?

India's Digital Personal Data Protection (DPDP) Act 2023 applies to processing of personal data of Indian residents, including in test and staging systems. Key requirements: personal data must only be used for the purpose it was collected (production analytics data cannot be repurposed for testing without consent), data principals have the right to erasure (which must propagate to all copies including test databases), and significant data fiduciaries must maintain data security standards across all environments. Using synthetic data in testing environments — rather than production copies — is the cleanest way to comply.

How realistic does test data need to be for effective QA?

Realistic enough to exercise your validation rules, UI rendering, and business logic — but not so realistic it risks being confused with real data. Essential realism: valid email formats, structurally valid phone numbers for target markets, correct name character sets (including Devanagari for Indian names if your app supports them), realistic address formats (with PIN codes in valid ranges), and data distributions that match production (don't use only 'John Smith' as a test name — test with long names, short names, names with hyphens). Non-essential: the data doesn't need to be internally consistent (a fake person's address doesn't need to match the PIN code of their stated city).

More workflows