seanpedrickcase's picture
Added a test suite based on the functions in cli_redact.py
084af54

CLI Redaction Test Suite

This test suite provides comprehensive testing for the cli_redact.py script based on all the examples shown in the CLI epilog.

Overview

The test suite includes tests for:

  1. PDF Redaction Examples

    • Default settings (local OCR)
    • Text extraction only (no redaction)
    • Text extraction with whole page redaction
    • Redaction with allow lists
    • Limited pages with custom fuzzy matching
    • Custom deny/allow/whole page lists
    • Image redaction
  2. Tabular Anonymisation Examples

    • CSV anonymisation with specific columns
    • Different anonymisation strategies
    • Word document anonymisation
  3. AWS Services Examples

    • Textract and Comprehend redaction
    • Signature extraction
    • Layout extraction
  4. Duplicate Detection Examples

    • Duplicate pages in OCR files
    • Line-level duplicate detection
    • Tabular duplicate detection
  5. Textract Batch Operations

    • Submit documents for analysis
    • Retrieve results by job ID
    • List recent jobs

Running the Tests

Method 1: Run the test suite directly

cd test
python test.py

Method 2: Use the convenience script

cd test
python run_tests.py

Method 3: Run with unittest

cd test
python -m unittest test.test.TestCLIRedactExamples -v

Test Behavior

  • File Dependencies: Tests will be skipped if required example files are not found in the example_data/ directory
  • AWS Tests: AWS-related tests may fail if credentials are not configured, but this is expected
  • Temporary Output: All tests use temporary output directories that are cleaned up automatically
  • Timeout: Each test has a 10-minute timeout to prevent hanging

Test Structure

The test suite uses Python's unittest framework with the following structure:

  • TestCLIRedactExamples: Main test class containing all test methods
  • run_cli_redact(): Helper function that executes the CLI script with specified parameters
  • run_all_tests(): Main function that runs all tests and provides a summary

Example Output

================================================================================
DOCUMENT REDACTION CLI TEST SUITE
================================================================================
This test suite runs through all the examples from the CLI epilog.
Tests will be skipped if required example files are not found.
AWS-related tests may fail if credentials are not configured.
================================================================================

Test setup complete. Script: /path/to/cli_redact.py
Example data directory: /path/to/example_data
Temp output directory: /tmp/test_output_xyz

=== Testing PDF redaction with default settings ===
βœ… PDF redaction with default settings passed

=== Testing PDF text extraction only ===
βœ… PDF text extraction only passed

...

================================================================================
TEST SUMMARY
================================================================================
Tests run: 20
Failures: 0
Errors: 0
Skipped: 2

Overall result: βœ… PASSED
================================================================================

Requirements

  • Python 3.6+
  • All dependencies for the main CLI script
  • Example data files in the example_data/ directory (for full test coverage)
  • AWS credentials (for AWS-related tests)

Notes

  • Tests are designed to be robust and will skip gracefully if files are missing
  • AWS tests are marked as completed even if they fail due to missing credentials
  • The test suite provides detailed output for debugging
  • All temporary files are cleaned up automatically