File size: 5,439 Bytes
bd7e8de f8f94d6 bd7e8de |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: apache-2.0
language:
- en
library_name: gliner
datasets:
- gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
- PII
- PHI
- GLiNER
- information extraction
- encoder
- entity recognition
- privacy
---
# Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-base-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI).
Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-base-v1.0).
The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.
1. **Training:** Utilized the training split of the synthetic dataset.
2. **Validation:** Monitored performance using the validation set to adjust training parameters.
3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth.
For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1).
### Model Performance
All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:
| Model | Accuracy | Precision | Recall | F1 Score |
|---------------------------------------|----------|-----------|--------|----------|
| gretelai/gretel-gliner-bi-small-v1.0 | 0.89 | 0.98 | 0.91 | 0.94 |
| gretelai/gretel-gliner-bi-base-v1.0 | 0.91 | 0.98 | 0.92 | 0.95 |
| gretelai/gretel-gliner-bi-large-v1.0 | 0.91 | 0.99 | 0.93 | 0.95 |
## Installation & Usage
Ensure you have Python installed. Then, install or update the `gliner` package:
```bash
pip install gliner -U
```
Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:
```python
from gliner import GLiNER
# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-base-v1.0")
# Sample text containing PII/PHI entities
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: [email protected]
"""
# Define the labels for PII/PHI entities
labels = [
"medical_record_number",
"date_of_birth",
"ssn",
"date",
"first_name",
"email",
"last_name",
"customer_id",
"employee_id",
"name",
"street_address",
"phone_number",
"ipv4",
"credit_card_number",
"license_plate",
"address",
"user_name",
"device_identifier",
"bank_routing_number",
"date_time",
"company_name",
"unique_identifier",
"biometric_identifier",
"account_number",
"city",
"certificate_license_number",
"time",
"postcode",
"vehicle_identifier",
"coordinate",
"country",
"api_key",
"ipv6",
"password",
"health_plan_beneficiary_number",
"national_id",
"tax_id",
"url",
"state",
"swift_bic",
"cvv",
"pin"
]
# Predict entities with a confidence threshold of 0.7
entities = model.predict_entities(text, labels, threshold=0.7)
# Display the detected entities
for entity in entities:
print(f"{entity['text']} => {entity['label']}")
```
Expected Output:
```
CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
[email protected] => email
```
## Use Cases
Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:
- Healthcare: Automating the extraction and redaction of patient information from medical records.
- Finance: Identifying and securing financial data such as account numbers and transaction details.
- Cybersecurity: Detecting sensitive information in logs and security reports.
- Legal: Processing contracts and legal documents to protect client information.
- Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.
## Citation and Usage
If you use this dataset in your research or applications, please cite it as:
```bibtex
@dataset{gretel-pii-masking-en-v1,
author = {Gretel AI},
title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
year = {2024},
month = {10},
publisher = {Gretel},
howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}
```
For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/). |