Update README.md

ef9ca0d verified 22 days ago

5.44 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: gliner
	datasets:
	- gretelai/gretel-pii-masking-en-v1
	pipeline_tag: token-classification
	tags:
	- PII
	- PHI
	- GLiNER
	- information extraction
	- encoder
	- entity recognition
	- privacy
	---

	# Gretel GLiNER: Fine-Tuned Models for PII/PHI Detection
	This Gretel GLiNER model is a fine-tuned version of the GLiNER base model `knowledgator/gliner-bi-small-v1.0`, specifically trained for the detection of Personally Identifiable Information (PII) and Protected Health Information (PHI).
	Gretel GLiNER helps to provide privacy-compliant entity recognition across various industries and document types.
	For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-small-v1.0).

	The model was fine-tuned on the `gretelai/gretel-pii-masking-en-v1` dataset, which provides a rich and diverse collection of synthetic document snippets containing PII and PHI entities.

	1. Training: Utilized the training split of the synthetic dataset.
	2. Validation: Monitored performance using the validation set to adjust training parameters.
	3. Evaluation: Assessed final performance on the test set using PII/PHI entity annotations as ground truth.

	For detailed statistics on the dataset, including domain and entity type distributions, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1).

	### Model Performance

	All fine-tuned Gretel GLiNER models demonstrate substantial improvements over their base counterparts in accuracy, precision, recall, and F1 score:

	\| Model \| Accuracy \| Precision \| Recall \| F1 Score \|
	\|---------------------------------------\|----------\|-----------\|--------\|----------\|
	\| gretelai/gretel-gliner-bi-small-v1.0 \| 0.89 \| 0.98 \| 0.91 \| 0.94 \|
	\| gretelai/gretel-gliner-bi-base-v1.0 \| 0.91 \| 0.98 \| 0.92 \| 0.95 \|
	\| gretelai/gretel-gliner-bi-large-v1.0 \| 0.91 \| 0.99 \| 0.93 \| 0.95 \|


	## Installation & Usage

	Ensure you have Python installed. Then, install or update the `gliner` package:

	```bash
	pip install gliner -U
	```

	Load the fine-tuned Gretel GLiNER model using the GLiNER class and the from_pretrained method. Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

	```python
	from gliner import GLiNER

	# Load the fine-tuned GLiNER model
	model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-small-v1.0")

	# Sample text containing PII/PHI entities
	text = """
	Purchase Order
	----------------
	Date: 10/05/2023
	----------------
	Customer Name: CID-982305
	Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
	Phone: (312) 555-7890 (555-876-5432)
	Email: [email protected]
	"""

	# Define the labels for PII/PHI entities
	labels = [
	"medical_record_number",
	"date_of_birth",
	"ssn",
	"date",
	"first_name",
	"email",
	"last_name",
	"customer_id",
	"employee_id",
	"name",
	"street_address",
	"phone_number",
	"ipv4",
	"credit_card_number",
	"license_plate",
	"address",
	"user_name",
	"device_identifier",
	"bank_routing_number",
	"date_time",
	"company_name",
	"unique_identifier",
	"biometric_identifier",
	"account_number",
	"city",
	"certificate_license_number",
	"time",
	"postcode",
	"vehicle_identifier",
	"coordinate",
	"country",
	"api_key",
	"ipv6",
	"password",
	"health_plan_beneficiary_number",
	"national_id",
	"tax_id",
	"url",
	"state",
	"swift_bic",
	"cvv",
	"pin"
	]

	# Predict entities with a confidence threshold of 0.7
	entities = model.predict_entities(text, labels, threshold=0.7)

	# Display the detected entities
	for entity in entities:
	print(f"{entity['text']} => {entity['label']}")
	```

	Expected Output:


	```
	CID-982305 => customer_id
	1234 Oak Street, Suite 400 => street_address
	Springfield => city
	IL => state
	62704 => postcode
	(312) 555-7890 => phone_number
	555-876-5432 => phone_number
	[email protected] => email
	```

	## Use Cases

	Gretel GLiNER is ideal for applications requiring detection and redaction of sensitive information:

	- Healthcare: Automating the extraction and redaction of patient information from medical records.
	- Finance: Identifying and securing financial data such as account numbers and transaction details.
	- Cybersecurity: Detecting sensitive information in logs and security reports.
	- Legal: Processing contracts and legal documents to protect client information.
	- Data Privacy Compliance: Ensuring data handling processes adhere to regulations like GDPR and HIPAA by accurately identifying PII/PHI.

	## Citation and Usage

	If you use this dataset in your research or applications, please cite it as:

	```bibtex
	@dataset{gretel-pii-masking-en-v1,
	author = {Gretel AI},
	title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
	year = {2024},
	month = {10},
	publisher = {Gretel},
	howpublished = {https://huggingface.co/gretelai/gretel-pii-masking-en-v1}
	}
	```

	For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).