--- license: apache-2.0 language: - en library_name: gliner datasets: - gretelai/gretel-pii-masking-en-v1 pipeline_tag: token-classification tags: - PII - PHI - GLiNER - information extraction - encoder - entity recognition - privacy --- # Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection This **Gretel GLiNER** model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-base-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI). Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types. For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-base-v1.0). The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities. 1. **Training:** Utilized the training split of the synthetic dataset. 2. **Validation:** Monitored performance using the validation set to adjust training parameters. 3. **Evaluation:** Assessed final performance on the test set using PII/PHI entity annotations as ground truth. For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1). ### Model Performance All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score: | Model | Accuracy | Precision | Recall | F1 Score | |---------------------------------------|----------|-----------|--------|----------| | gretelai/gretel-gliner-bi-small-v1.0 | 0.89 | 0.98 | 0.91 | 0.94 | | gretelai/gretel-gliner-bi-base-v1.0 | 0.91 | 0.98 | 0.92 | 0.95 | | gretelai/gretel-gliner-bi-large-v1.0 | 0.91 | 0.99 | 0.93 | 0.95 | ## Installation & Usage Ensure you have Python installed. Then, install or update the `gliner` package: ```bash pip install gliner -U ``` Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection: ```python from gliner import GLiNER # Load the fine-tuned GLiNER model model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-base-v1.0") # Sample text containing PII/PHI entities text = """ Purchase Order ---------------- Date: 10/05/2023 ---------------- Customer Name: CID-982305 Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704 Phone: (312) 555-7890 (555-876-5432) Email: janedoe@company.com """ # Define the labels for PII/PHI entities labels = [ "medical_record_number", "date_of_birth", "ssn", "date", "first_name", "email", "last_name", "customer_id", "employee_id", "name", "street_address", "phone_number", "ipv4", "credit_card_number", "license_plate", "address", "user_name", "device_identifier", "bank_routing_number", "date_time", "company_name", "unique_identifier", "biometric_identifier", "account_number", "city", "certificate_license_number", "time", "postcode", "vehicle_identifier", "coordinate", "country", "api_key", "ipv6", "password", "health_plan_beneficiary_number", "national_id", "tax_id", "url", "state", "swift_bic", "cvv", "pin" ] # Predict entities with a confidence threshold of 0.7 entities = model.predict_entities(text, labels, threshold=0.7) # Display the detected entities for entity in entities: print(f"{entity['text']} => {entity['label']}") ``` Expected Output: ``` CID-982305 => customer_id 1234 Oak Street, Suite 400 => street_address Springfield => city IL => state 62704 => postcode (312) 555-7890 => phone_number 555-876-5432 => phone_number janedoe@company.com => email ``` ## Use Cases Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information: - Healthcare: Automating the extraction and redaction of patient information from medical records. - Finance: Identifying and securing financial data such as account numbers and transaction details. - Cybersecurity: Detecting sensitive information in logs and security reports. - Legal: Processing contracts and legal documents to protect client information. - Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI. ## Citation and Usage If you use this dataset in your research or applications, please cite it as: ```bibtex @dataset{gretel-pii-masking-en-v1, author = {Gretel AI}, title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents}, year = {2024}, month = {10}, publisher = {Gretel}, howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1} } ``` For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).