Computational methods for redacting identifying information in large text data

Organization: Pew Research Center

Tools: R, Python, NLP libraries, Hugging Face, transformers

Tags: data privacy, NLP, text analysis

This blog post explains how researchers used computational methods to redact identifying information from unstructured text data, a set of 1,314 mission statements from U.S. K-12 school districts, before releasing it publicly. Removing identifiers like district names is straightforward in structured datasets, but much harder with free-form text because there are no fixed labels for names or addresses. To tackle this, the researcher combined three different techniques:

Exact name matching against an external list of known district names,
Named Entity Recognition (NER) with pretrained models to detect organization names, and
Regular expressions to spot patterns like capitalized words preceding “school” or “district.”

Each approach had limitations on its own, so they were used together to maximize correctly redacted terms while minimizing false positives.

What I did

I developed and evaluated scalable NLP approaches for detecting and removing personally identifiable information from large text corpora, and narrated the implications of these methods for research transparency and data privacy in a public-facing blog post.