mvansegbroeck's picture
Update README.md
ef9ca0d verified
---
license: apache-2.0
language:
- en
library_name: gliner
datasets:
- gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
- PII
- PHI
- GLiNER
- information extraction
- encoder
- entity recognition
- privacy
---
# Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-small-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI).
Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-small-v1.0).
The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.
1. **Training:** Utilized the training split of the synthetic dataset.
2. **Validation:** Monitored performance using the validation set to adjust training parameters.
3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth.
For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1).
### Model Performance
All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:
| Model | Accuracy | Precision | Recall | F1 Score |
|---------------------------------------|----------|-----------|--------|----------|
| gretelai/gretel-gliner-bi-small-v1.0 | 0.89 | 0.98 | 0.91 | 0.94 |
| gretelai/gretel-gliner-bi-base-v1.0 | 0.91 | 0.98 | 0.92 | 0.95 |
| gretelai/gretel-gliner-bi-large-v1.0 | 0.91 | 0.99 | 0.93 | 0.95 |
## Installation & Usage
Ensure you have Python installed. Then, install or update the `gliner` package:
```bash
pip install gliner -U
```
Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:
```python
from gliner import GLiNER
# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-small-v1.0")
# Sample text containing PII/PHI entities
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: [email protected]
"""
# Define the labels for PII/PHI entities
labels = [
"medical_record_number",
"date_of_birth",
"ssn",
"date",
"first_name",
"email",
"last_name",
"customer_id",
"employee_id",
"name",
"street_address",
"phone_number",
"ipv4",
"credit_card_number",
"license_plate",
"address",
"user_name",
"device_identifier",
"bank_routing_number",
"date_time",
"company_name",
"unique_identifier",
"biometric_identifier",
"account_number",
"city",
"certificate_license_number",
"time",
"postcode",
"vehicle_identifier",
"coordinate",
"country",
"api_key",
"ipv6",
"password",
"health_plan_beneficiary_number",
"national_id",
"tax_id",
"url",
"state",
"swift_bic",
"cvv",
"pin"
]
# Predict entities with a confidence threshold of 0.7
entities = model.predict_entities(text, labels, threshold=0.7)
# Display the detected entities
for entity in entities:
print(f"{entity['text']} => {entity['label']}")
```
Expected Output:
```
CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
[email protected] => email
```
## Use Cases
Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:
- Healthcare: Automating the extraction and redaction of patient information from medical records.
- Finance: Identifying and securing financial data such as account numbers and transaction details.
- Cybersecurity: Detecting sensitive information in logs and security reports.
- Legal: Processing contracts and legal documents to protect client information.
- Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.
## Citation and Usage
If you use this dataset in your research or applications, please cite it as:
```bibtex
@dataset{gretel-pii-masking-en-v1,
author = {Gretel AI},
title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
year = {2024},
month = {10},
publisher = {Gretel},
howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}
```
For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).