|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: gliner |
|
datasets: |
|
- gretelai/gretel-pii-masking-en-v1 |
|
pipeline_tag: token-classification |
|
tags: |
|
- PII |
|
- PHI |
|
- GLiNER |
|
- information extraction |
|
- encoder |
|
- entity recognition |
|
- privacy |
|
--- |
|
|
|
# Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection |
|
This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-small-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). |
|
Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types. |
|
For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-small-v1.0). |
|
|
|
The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities. |
|
|
|
1. **Training:** Utilized the training split of the synthetic dataset. |
|
2. **Validation:** Monitored performance using the validation set to adjust training parameters. |
|
3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth. |
|
|
|
For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1). |
|
|
|
### Model Performance |
|
|
|
All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score: |
|
|
|
| Model | Accuracy | Precision | Recall | F1 Score | |
|
|---------------------------------------|----------|-----------|--------|----------| |
|
| gretelai/gretel-gliner-bi-small-v1.0 | 0.89 | 0.98 | 0.91 | 0.94 | |
|
| gretelai/gretel-gliner-bi-base-v1.0 | 0.91 | 0.98 | 0.92 | 0.95 | |
|
| gretelai/gretel-gliner-bi-large-v1.0 | 0.91 | 0.99 | 0.93 | 0.95 | |
|
|
|
|
|
## Installation & Usage |
|
|
|
Ensure you have Python installed. Then, install or update the `gliner` package: |
|
|
|
```bash |
|
pip install gliner -U |
|
``` |
|
|
|
Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection: |
|
|
|
```python |
|
from gliner import GLiNER |
|
|
|
# Load the fine-tuned GLiNER model |
|
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-small-v1.0") |
|
|
|
# Sample text containing PII/PHI entities |
|
text = """ |
|
Purchase Order |
|
---------------- |
|
Date: 10/05/2023 |
|
---------------- |
|
Customer Name: CID-982305 |
|
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704 |
|
Phone: (312) 555-7890 (555-876-5432) |
|
Email: [email protected] |
|
""" |
|
|
|
# Define the labels for PII/PHI entities |
|
labels = [ |
|
"medical_record_number", |
|
"date_of_birth", |
|
"ssn", |
|
"date", |
|
"first_name", |
|
"email", |
|
"last_name", |
|
"customer_id", |
|
"employee_id", |
|
"name", |
|
"street_address", |
|
"phone_number", |
|
"ipv4", |
|
"credit_card_number", |
|
"license_plate", |
|
"address", |
|
"user_name", |
|
"device_identifier", |
|
"bank_routing_number", |
|
"date_time", |
|
"company_name", |
|
"unique_identifier", |
|
"biometric_identifier", |
|
"account_number", |
|
"city", |
|
"certificate_license_number", |
|
"time", |
|
"postcode", |
|
"vehicle_identifier", |
|
"coordinate", |
|
"country", |
|
"api_key", |
|
"ipv6", |
|
"password", |
|
"health_plan_beneficiary_number", |
|
"national_id", |
|
"tax_id", |
|
"url", |
|
"state", |
|
"swift_bic", |
|
"cvv", |
|
"pin" |
|
] |
|
|
|
# Predict entities with a confidence threshold of 0.7 |
|
entities = model.predict_entities(text, labels, threshold=0.7) |
|
|
|
# Display the detected entities |
|
for entity in entities: |
|
print(f"{entity['text']} => {entity['label']}") |
|
``` |
|
|
|
Expected Output: |
|
|
|
|
|
``` |
|
CID-982305 => customer_id |
|
1234 Oak Street, Suite 400 => street_address |
|
Springfield => city |
|
IL => state |
|
62704 => postcode |
|
(312) 555-7890 => phone_number |
|
555-876-5432 => phone_number |
|
[email protected] => email |
|
``` |
|
|
|
## Use Cases |
|
|
|
Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information: |
|
|
|
- Healthcare: Automating the extraction and redaction of patient information from medical records. |
|
- Finance: Identifying and securing financial data such as account numbers and transaction details. |
|
- Cybersecurity: Detecting sensitive information in logs and security reports. |
|
- Legal: Processing contracts and legal documents to protect client information. |
|
- Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI. |
|
|
|
## Citation and Usage |
|
|
|
If you use this dataset in your research or applications, please cite it as: |
|
|
|
```bibtex |
|
@dataset{gretel-pii-masking-en-v1, |
|
author = {Gretel AI}, |
|
title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents}, |
|
year = {2024}, |
|
month = {10}, |
|
publisher = {Gretel}, |
|
howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1} |
|
} |
|
``` |
|
|
|
For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/). |