ab-ai's picture
Update README.md
33768a4 verified
metadata
language: en
tags:
  - token-classification
  - pii-detection
license: apache-2.0
datasets:
  - custom_dataset

Model Name

PII Detection Model Based on DistilBERT

Model description

This model is a token classification model trained for detecting personally identifiable information (PII) entities such as names, addresses, dates of birth, credit card numbers, etc. The model is based on the DistilBERT architecture and has been fine-tuned on a custom dataset for PII detection.

Intended use

The model is intended to be used for automatically identifying and extracting PII entities from text data. It can be incorporated into data processing pipelines for tasks such as data anonymization, redaction, compliance with privacy regulations, etc.

Evaluation results

The model's performance was evaluated on a held-out validation set using the following metrics:

  • Precision: 94%
  • Recall: 96%
  • F1 Score: 95%
  • Accuracy: 99%

Limitations and bias

  • The model's performance may vary depending on the quality and diversity of the input data.
  • It may exhibit biases present in the training data, such as overrepresentation or underrepresentation of certain demographic groups or types of PII.
  • The model may struggle with detecting PII entities in noisy or poorly formatted text.

Ethical considerations

  • Care should be taken when deploying the model in production to ensure that it does not inadvertently expose sensitive information or violate individuals' privacy rights.
  • Data used to train and evaluate the model should be handled with caution to avoid the risk of exposing PII.
  • Regular monitoring and auditing of the model's predictions may be necessary to identify and mitigate any potential biases or errors.

Model Training and Evaluation Results

Epoch Training Loss Validation Loss Precision Recall F1 Score Accuracy
1 0.047 0.051537 91.35% 95.23% 93.25% 98.56%
2 0.0307 0.043873 93.27% 96.10% 94.66% 98.75%
3 0.0208 0.04702 91.83% 95.49% 93.62% 98.54%
4 0.0147 0.046979 93.27% 94.97% 94.11% 98.77%
5 0.0094 0.057863 93.41% 95.92% 94.65% 98.70%

Authors