Tonic Textual redacts PII from unstructured text and logs and that was the gap our data pipeline had
Round 1 of this topic mostly covered Tonic Structural for database de-identification. I want to write about Tonic Textual specifically because it addresses a different and in some ways harder problem.
Most PII protection tooling is built for structured data. Named columns in a database. The email field, the name field, the phone number field. You identify the sensitive columns and replace the values. That is Tonic Structural's territory and it works well for that.
The problem we had was unstructured data. Support chat logs where a customer typed their home address into a free-text field. Application logs where error messages captured session data that happened to include personal details. Email thread exports where names and contact information appeared in the body text in unpredictable positions. You cannot point a database de-identification tool at a text blob and tell it which column to redact.
Tonic Textual uses NLP to read unstructured text and identify PII and PHI wherever it appears, regardless of format or position, and redact it while preserving the utility of the surrounding content for development and testing purposes. The log file still makes sense as a log file. The support chat still reflects the conversation structure. The sensitive information is gone.
The CI/CD Integration means fresh de-identified test data gets generated as part of the development pipeline rather than being a manual step someone remembers to run occasionally.
The compliance coverage for GDPR, HIPAA and CCPA applies to both Structural and Textual.