File size: 5,439 Bytes

---
license: apache-2.0
language:
  - en
library_name: gliner
datasets:
  - gretelai/gretel-pii-masking-en-v1
pipeline_tag: token-classification
tags:
  - PII
  - PHI
  - GLiNER
  - information extraction
  - encoder
  - entity recognition
  - privacy
---

# Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-base-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). 
Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-base-v1.0).

The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.

1. **Training:** Utilized the training split of the synthetic dataset.
2. **Validation:** Monitored performance using the validation set to adjust training parameters.
3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth.

For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1).

### Model Performance

All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:

| Model                                 | Accuracy | Precision | Recall | F1 Score |
|---------------------------------------|----------|-----------|--------|----------|
| gretelai/gretel-gliner-bi-small-v1.0   | 0.89     | 0.98      | 0.91   | 0.94     |
| gretelai/gretel-gliner-bi-base-v1.0    | 0.91     | 0.98      | 0.92   | 0.95     |
| gretelai/gretel-gliner-bi-large-v1.0   | 0.91     | 0.99      | 0.93   | 0.95     |


## Installation & Usage

Ensure you have Python installed. Then, install or update the `gliner` package:

```bash
pip install gliner -U
```

Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

```python
from gliner import GLiNER

# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-base-v1.0")

# Sample text containing PII/PHI entities
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: [email protected]
"""

# Define the labels for PII/PHI entities
labels = [
    "medical_record_number",
    "date_of_birth",
    "ssn",
    "date",
    "first_name",
    "email",
    "last_name",
    "customer_id",
    "employee_id",
    "name",
    "street_address",
    "phone_number",
    "ipv4",
    "credit_card_number",
    "license_plate",
    "address",
    "user_name",
    "device_identifier",
    "bank_routing_number",
    "date_time",
    "company_name",
    "unique_identifier",
    "biometric_identifier",
    "account_number",
    "city",
    "certificate_license_number",
    "time",
    "postcode",
    "vehicle_identifier",
    "coordinate",
    "country",
    "api_key",
    "ipv6",
    "password",
    "health_plan_beneficiary_number",
    "national_id",
    "tax_id",
    "url",
    "state",
    "swift_bic",
    "cvv",
    "pin"
]

# Predict entities with a confidence threshold of 0.7
entities = model.predict_entities(text, labels, threshold=0.7)

# Display the detected entities
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")
```

Expected Output:


```
CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
[email protected] => email
```

## Use Cases

Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:

- Healthcare: Automating the extraction and redaction of patient information from medical records.
- Finance: Identifying and securing financial data such as account numbers and transaction details.
- Cybersecurity: Detecting sensitive information in logs and security reports.
- Legal: Processing contracts and legal documents to protect client information.
- Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.

## Citation and Usage

If you use this dataset in your research or applications, please cite it as:

```bibtex
@dataset{gretel-pii-masking-en-v1,
  author       = {Gretel AI},
  title        = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
  year         = {2024},
  month        = {10},
  publisher    = {Gretel},
  howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
}
```

For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).