Edit model card

SentenceTransformer based on distilbert/distilbert-base-uncased-finetuned-sst-2-english

This is a sentence-transformers model finetuned from distilbert/distilbert-base-uncased-finetuned-sst-2-english. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("wasabibish/similarity-code-ai-generated")
# Run inference
sentences = [
    'def move_zeroes(nums):\n  count = 0\n  for i in range(len(nums)):\n    if nums[i] != 0:\n      nums[count], nums[i]= nums[i], nums[count]\n      count += 1\n  for i in range(count, len(nums)):\n    nums[i] =0\n\ninput = [int(x) for x in input("Enter integers separated by spaces: ").split()]\nmove_zeroes(input)\n\nprint(input)',
    'def move_zeros_to_end(lst):\n    zero_count = 0\n    for i in range(len(lst)):\n        if lst[i] != 0:\n            lst[i], lst[zero_count] = lst[zero_count], lst[i]\n            zero_count += 1\n\n# Test cases\nlst1 = [0, 1, 0, 3, 12]\nmove_zeros_to_end(lst1)\nprint(lst1)  # Output: [1, 3, 12, 0, 0]\n\nlst2 = [0, 0, 1]\nmove_zeros_to_end(lst2)\nprint(lst2)  # Output: [1, 0, 0]\n',
    'using System;\nusing System.Collections.Generic;\n\nclass BracketChecker\n{\n    private readonly Dictionary<char, char> bracketPairs = new Dictionary<char, char>\n    {\n        { \'(\', \')\' },\n        { \'[\', \']\' },\n        { \'{\', \'}\' }\n    };\n\n    public bool CheckBalancedBrackets(string input)\n    {\n        if (string.IsNullOrEmpty(input))\n        {\n            return true;\n        }\n\n        Stack<char> stack = new Stack<char>();\n\n        foreach (char c in input)\n        {\n            if (bracketPairs.ContainsValue(c))\n            {\n                if (stack.Count == 0 || bracketPairs[stack.Peek()] != c)\n                {\n                    return false;\n                }\n                stack.Pop();\n            }\n            else if (bracketPairs.ContainsKey(c))\n            {\n                stack.Push(c);\n            }\n        }\n\n        return stack.Count == 0;\n    }\n}\n\nclass Program\n{\n    static void Main()\n    {\n        BracketChecker bracketChecker = new BracketChecker();\n\n        string input1 = "(a+[b*c]-{d/e})";\n        Console.WriteLine("Input: \\"{0}\\"", input1);\n        Console.WriteLine("Output: {0}\\n", bracketChecker.CheckBalancedBrackets(input1));\n\n        string input2 = "(a+[b*c)-{d/e}]";\n        Console.WriteLine("Input: \\"{0}\\"", input2);\n        Console.WriteLine("Output: {0}", bracketChecker.CheckBalancedBrackets(input2));\n    }\n}\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.9
spearman_cosine 0.9014
pearson_manhattan 0.862
spearman_manhattan 0.802
pearson_euclidean 0.8685
spearman_euclidean 0.8234
pearson_dot 0.8495
spearman_dot 0.8948
pearson_max 0.9
spearman_max 0.9014

Training Details

Training Dataset

Unnamed Dataset

  • Size: 302 training samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 302 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 3 tokens
    • mean: 206.43 tokens
    • max: 512 tokens
    • min: 27 tokens
    • mean: 244.9 tokens
    • max: 512 tokens
    • min: 0.0
    • mean: 0.29
    • max: 0.9
  • Samples:
    sentence1 sentence2 score
    from django.views.generic import ListView

    class PersonListView(ListView):
    model = Person
    template_name = 'person_list.html'

    def get_queryset(self):
    return Person.objects.filter(birthdate__year__lte=2005)
    from myapp.models import Customer # Import the Customer model from your Django app

    def get_customers_with_zip_code_starting_with_123():
    customers = Customer.objects.filter(zip_code__startswith='123').values() # Query to filter customers with zip_code starting with '123'
    return list(customers) # Return a list of dictionaries for matching records
    0.4

    Welcome to our website!



    function createSentence(words, maxChars) {
    if (words.length === 0
    AAAAAA #include
    #include

    class KMP {
    public:
    std::vector findPatternIndices(const CString& text, const CString& pattern) {
    std::vector indices;
    if (pattern.IsEmpty()
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 76 evaluation samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 76 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 5 tokens
    • mean: 216.92 tokens
    • max: 512 tokens
    • min: 54 tokens
    • mean: 254.78 tokens
    • max: 512 tokens
    • min: 0.0
    • mean: 0.33
    • max: 0.9
  • Samples:
    sentence1 sentence2 score
    function stripHtmlTags(str) {
    return str.replace(/<[^>]*>/g, '');
    }

    const input = '

    Hello World!

    ';

    const output = stripHtmlTags(input);

    console.log(output);
    function stripHtmlTags(input) {
    if (!input) return '';

    const tagRegex = /<[^>]*>/g;
    return input.replace(tagRegex, '');
    }
    0.6
    function getTopThreeWords($text) {
    // Remove punctuation and convert to lowercase
    $words = str_word_count(strtolower(preg_replace('/[^\p{L}\p{N}\s]/u', ' ', $text)), 1);

    // Count the frequency of each word
    $wordFrequency = array_count_values($words);

    // Sort the words by frequency in descending order
    arsort($wordFrequency);

    // Get the top three words
    $topThreeWords = array_slice($wordFrequency, 0, 3, true);

    // Format the output
    $output = [];
    foreach ($topThreeWords as $word => $count) {
    $output[] = "('$word', $count)";
    }

    return '[' . implode(', ', $output) . ']';
    }

    // Example usage:
    $inputText = "The quick brown fox jumps over the lazy dog. The dog was lazy!";
    echo getTopThreeWords($inputText);
    ?>

    function countTopWords($inputString) {
    // Convert the input string to lowercase and remove punctuation
    $cleanString = preg_replace("/[\W_]+/", " ", strtolower($inputString));

    // Split the string into an array of words
    $words = explode(" ", $cleanString);

    // Count the frequency of each word
    $wordCount = array_count_values($words);

    // Sort the words by frequency in descending order
    arsort($wordCount);

    // Get the top three most common words
    $topWords = array_slice($wordCount, 0, 3);

    // Format the output as an array of tuples
    $output = [];
    foreach ($topWords as $word => $count) {
    $output[] = [$word, $count];
    }

    return $output;
    }

    // Test the function with the example input
    $inputString = "The quick brown fox jumps over the lazy dog. The dog was lazy!";
    $output = countTopWords($inputString);
    print_r($output);

    ?>
    0.3
    AAAAAA #include
    #include

    class KMP {
    public:
    std::vector findPatternIndices(const CString& text, const CString& pattern) {
    std::vector indices;
    if (pattern.IsEmpty()
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • weight_decay: 0.2
  • max_steps: 100
  • warmup_steps: 150

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.2
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3.0
  • max_steps: 100
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 150
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step loss spearman_max
0.5263 20 0.3765 0.5421
1.0526 40 0.1518 0.5774
1.5789 60 0.0501 0.8533
2.1053 80 0.0217 0.8900
2.6316 100 0.0168 0.9014

Framework Versions

  • Python: 3.9.10
  • Sentence Transformers: 3.1.0
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cpu
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
5
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Model tree for wasabibish/similarity-code-ai-generated

Evaluation results