Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed

This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on NeuralWave Hackthon.

Model Details

This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset.

We use the following arguments for training variable for finetuning:

learning_rate=3e-5,
per_device_train_batch_size=58,
per_device_eval_batch_size=58,
num_train_epochs=3,
weight_decay=0.01,
bf16=True,
seed=42

and other default hyperparameters of TrainingArguments.

Training Data

ai4privacy/pii-masking-400k

Preprocessing

def generate_sequence_labels(text, privacy_mask):
    # sort privacy mask by start position
    privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True)
    
    # replace sensitive pieces of text with labels
    for item in privacy_mask:
        label = item['label']
        start = item['start']
        end = item['end']
        value = item['value']
        # count the number of words in the value
        word_count = len(value.split())
        
        # replace the sensitive information with the appropriate number of [label] placeholders
        replacement = " ".join([f"{label}" for _ in range(word_count)])
        text = text[:start] + replacement + text[end:]
        
    words = text.split()
    # assign labels to each word
    labels = []
    for word in words:
        match = re.search(r"(\w+)", word)  # match any word character
        if match:
            label = match.group(1)
            if label in label_set:
                labels.append(label)
            else:
                # any other word is labeled as "O"
                labels.append("O")
        else:
            labels.append("O")
    return labels

k = 0
def tokenize_and_align_labels(examples):
    words = [t.split() for t in examples["source_text"]]
    tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512)
    source_labels = [
        generate_sequence_labels(text, mask)
        for text, mask in zip(examples["source_text"], examples["privacy_mask"])
    ]

    labels = []
    valid_idx = []
    for i, label in enumerate(source_labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # map tokens to their respective word.
        previous_label = None
        label_ids = [-100]
        try:
            for word_idx in word_ids:
                if word_idx is None:
                    continue
                elif label[word_idx] == "O":
                    label_ids.append(label2id["O"])
                    continue
                elif previous_label == label[word_idx]:
                    label_ids.append(label2id[f"I-{label[word_idx]}"])
                else:
                    label_ids.append(label2id[f"B-{label[word_idx]}"])
                previous_label = label[word_idx]
            label_ids = label_ids[:511] + [-100]
            labels.append(label_ids)
            # print(word_ids)
            # print(label_ids)
        except:
            global k
            k += 1
            # print(f"{word_idx = }")
            # print(f"{len(label) = }")
            labels.append([-100] * len(tokenized_inputs["input_ids"][i]))

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you can use any kinds of models and tokenizers to train on ai4privacy/pii-masking-400k.

Evaluation

Some evaluation of this model on validation set (model 2) is shown in the table.

Disclaimer Cooment of Non-Affiliation

The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA.

@NerualWave 2024 - The Last Ones Team.

bobocn
/

ai4privacy-mdeberta-v3-base-general-preprocessed