v2ray commited on Mar 30

Commit

3363bac

•

1 Parent(s): 042dfd0

Fixed LoRA targeting.

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE.txt +176 -0
NOTICE.txt +1 -0
README.md +159 -0
__init__.py +2 -0
config.json +39 -0
configuration_dbrx.py +264 -0
generation_config.json +4 -0
model-00001-of-00054.safetensors +3 -0
model-00002-of-00054.safetensors +3 -0
model-00003-of-00054.safetensors +3 -0
model-00004-of-00054.safetensors +3 -0
model-00005-of-00054.safetensors +3 -0
model-00006-of-00054.safetensors +3 -0
model-00007-of-00054.safetensors +3 -0
model-00008-of-00054.safetensors +3 -0
model-00009-of-00054.safetensors +3 -0
model-00010-of-00054.safetensors +3 -0
model-00011-of-00054.safetensors +3 -0
model-00012-of-00054.safetensors +3 -0
model-00013-of-00054.safetensors +3 -0
model-00014-of-00054.safetensors +3 -0
model-00015-of-00054.safetensors +3 -0
model-00016-of-00054.safetensors +3 -0
model-00017-of-00054.safetensors +3 -0
model-00018-of-00054.safetensors +3 -0
model-00019-of-00054.safetensors +3 -0
model-00020-of-00054.safetensors +3 -0
model-00021-of-00054.safetensors +3 -0
model-00022-of-00054.safetensors +3 -0
model-00023-of-00054.safetensors +3 -0
model-00024-of-00054.safetensors +3 -0
model-00025-of-00054.safetensors +3 -0
model-00026-of-00054.safetensors +3 -0
model-00027-of-00054.safetensors +3 -0
model-00028-of-00054.safetensors +3 -0
model-00029-of-00054.safetensors +3 -0
model-00030-of-00054.safetensors +3 -0
model-00031-of-00054.safetensors +3 -0
model-00032-of-00054.safetensors +3 -0
model-00033-of-00054.safetensors +3 -0
model-00034-of-00054.safetensors +3 -0
model-00035-of-00054.safetensors +3 -0
model-00036-of-00054.safetensors +3 -0
model-00037-of-00054.safetensors +3 -0
model-00038-of-00054.safetensors +3 -0
model-00039-of-00054.safetensors +3 -0
model-00040-of-00054.safetensors +3 -0
model-00041-of-00054.safetensors +3 -0
model-00042-of-00054.safetensors +3 -0
model-00043-of-00054.safetensors +3 -0

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,176 @@

+Databricks Open Model License
+By using, reproducing, modifying, distributing, performing or displaying
+any portion or element of DBRX or DBRX Derivatives, or otherwise accepting
+the terms of this Agreement, you agree to be bound by this Agreement.
+Version Release Date: March 27, 2024
+Section 1: Definitions
+“Agreement” means these terms and conditions that govern the use, reproduction,
+modification, distribution, performance or display of DBRX and/or DBRX
+Derivatives and any terms and conditions incorporated by reference.
+“Databricks” or “we” means Databricks, Inc.
+“Licensee” or “you” means you, or your employer or any other person or entity
+(if you are entering into this Agreement on such person or entity’s behalf),
+of the age required under applicable laws, rules or regulations to provide
+legal consent and that has legal authority to bind your employer or such other
+person or entity if you are entering in this Agreement on their behalf.
+“DBRX Derivatives” means all (i) modifications to DBRX, (ii) works based on
+DBRX and (iii) any other derivative works thereof. Outputs are not deemed DBRX
+Derivatives.
+“DBRX” means the foundational large language models and software and
+algorithms, including machine-learning model code, trained model weights,
+inference-enabling code, training-enabling code, fine-tuning enabling code,
+documentation and other elements of the foregoing identified by Databricks at
+https://github.com/databricks/dbrx, regardless of the source that you obtained
+it from.
+“Output” means the results of operating DBRX or DBRX Derivatives.
+As used in this Agreement, “including” means “including without limitation.”
+Section 2: License Rights and Conditions on Use and Distribution
+2.1 Grant of Rights
+You are granted a non-exclusive, worldwide, non-transferable and royalty-free
+limited license under Databricks’ intellectual property or other rights owned
+by Databricks embodied in DBRX to use, reproduce, distribute, copy, modify,
+and create derivative works of DBRX in accordance with the terms of this
+Agreement.
+2.2 Reproduction and Distribution
+	1. All distributions of DBRX or DBRX Derivatives must be accompanied by a
+    "Notice" text file that contains the following notice: "DBRX is provided
+    under and subject to the Databricks Open Model License, Copyright ©
+    Databricks, Inc. All rights reserved."
+	2. If you distribute or make DBRX or DBRX Derivatives available to a third
+    party, you must provide a copy of this Agreement to such third party.
+	3. You must cause any modified files that you distribute to carry prominent
+    notices stating that you modified the files.
+You may add your own intellectual property statement to your modifications of
+DBRX and, except as set forth in this Section, may provide additional or
+different terms and conditions for use, reproduction, or distribution of DBRX
+or DBRX Derivatives as a whole, provided your use, reproduction, modification,
+distribution, performance, and display of DBRX or DBRX Derivatives otherwise
+complies with the terms and conditions of this Agreement. Any additional or
+different terms and conditions you impose must not conflict with the terms of
+this Agreement and in the event of a conflict, the terms and conditions of this
+Agreement shall govern over any such additional or different terms and conditions.
+2.3 Use Restrictions
+You will not use DBRX or DBRX Derivatives or any Output to improve any other
+large language model (excluding DBRX or DBRX Derivatives).
+You will not use DBRX or DBRX Derivatives:
+	1. for any restricted use set forth in the Databricks Open Model Acceptable
+    Use Policy identified at
+    https://www.databricks.com/legal/acceptable-use-policy-open-model
+    ("Acceptable Use Policy"), which is hereby incorporated by reference into
+    this Agreement; or
+	2. in violation of applicable laws and regulations.
+To the maximum extent permitted by law, Databricks reserves the right to
+restrict (remotely or otherwise) usage of DBRX or DBRX Derivatives that
+Databricks reasonably believes are in violation of this Agreement.
+Section 3: Additional Commercial Terms
+If, on the DBRX version release date, the monthly active users of the products
+or services made available by or for Licensee, or Licensee’s affiliates, is
+greater than 700 million monthly active users in the preceding calendar month,
+you must request a license from Databricks, which we may grant to you in our
+sole discretion, and you are not authorized to exercise any of the rights under
+this Agreement unless or until Databricks otherwise expressly grants you such
+rights.
+If you receive DBRX or DBRX Derivatives from a direct or indirect licensee as
+part of an integrated end user product, then this section (Section 3) of the
+Agreement will not apply to you.
+Section 4: Additional Provisions
+4.1 Updates
+Databricks may update DBRX from time to time, and you must make reasonable
+efforts to use the latest version of DBRX.
+4.2 Intellectual Property
+a. No trademark licenses are granted under this Agreement, and in connection
+with DBRX or DBRX Derivatives, neither Databricks nor Licensee may use any name
+or mark owned by or associated with the other or any of its affiliates, except
+as required for reasonable and customary use in describing and redistributing
+DBRX or DBRX Derivatives.
+b. Subject to Databricks’ ownership of DBRX and DRBX Derivatives made by or for
+Databricks, with respect to any DBRX Derivatives that are made by you, as
+between you and Databricks, you are and will be the owner of such DBRX
+Derivatives.
+c. Databricks claims no ownership rights in Outputs. You are responsible for
+Outputs and their subsequent uses.
+d. If you institute litigation or other proceedings against Databricks or any
+entity (including a cross-claim or counterclaim in a lawsuit) alleging that
+DBRX or Outputs or results therefrom, or any portion of any of the foregoing,
+constitutes infringement of intellectual property or other rights owned or
+licensable by you, then any licenses granted to you under this Agreement shall
+terminate as of the date such litigation or claim is filed or instituted. You
+will indemnify and hold harmless Databricks from and against any claim by any
+third party arising out of or related to your use or distribution of DBRX or
+DBRX Derivatives.
+4.3 DISCLAIMER OF WARRANTY
+UNLESS REQUIRED BY APPLICABLE LAW, DBRX AND ANY OUTPUT AND RESULTS THEREFROM
+ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER
+EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE,
+NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU
+ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR
+REDISTRIBUTING DBRX OR DBRX DERIVATIVES AND ANY OUTPUT AND ASSUME ANY RISKS
+ASSOCIATED WITH YOUR USE OF DBRX OR DBRX DERIVATIVES AND ANY OUTPUT AND RESULTS.
+4.4 LIMITATION OF LIABILITY
+IN NO EVENT WILL DATABRICKS OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF
+LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR
+OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT,
+SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF
+DATABRICKS OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE
+FOREGOING.
+4.5 Term and Termination
+The term of this Agreement will commence upon your acceptance of this Agreement
+or access to DBRX or DBRX Derivatives and will continue in full force and
+effect until terminated in accordance with the terms and conditions herein.
+Databricks may terminate this Agreement if you are in breach of any term or
+condition of this Agreement. Upon termination of this Agreement, you shall
+delete and cease use of DBRX or any DBRX Derivatives. Sections 1, 4.2(d), 4.3,
+4.4, and 4.6 shall survive the termination of this Agreement.
+4.6 Governing Law and Jurisdiction
+This Agreement will be governed and construed under the laws of the State of
+California without regard to choice of law principles, and the UN Convention
+on Contracts for the International Sale of Goods does not apply to this
+Agreement. The courts of California shall have exclusive jurisdiction of any
+dispute arising out of this Agreement.

NOTICE.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ DBRX is provided under and subject to the Databricks Open Model License, Copyright © Databricks, Inc. All rights reserved.

README.md ADDED Viewed

	@@ -0,0 +1,159 @@

+---
+inference: false
+license: other
+license_name: databricks-open-model-license
+license_link: https://www.databricks.com/legal/open-model-license
+---
+# Fix for the DBRX Code
+The original DBRX implementation code has a few bugs which only affect training, which I fixed in this re-upload.
+The issues - How I fixed them:
+1. Error when using gradient checkpointing - Fixed by using positional arguments instead because `_gradient_checkpointing_func` doesn't support kwargs.
+2. VRAM usage go zoom and `CUDA Out of Memory` when backpropping through the MLP layer - Fixed by separating the experts' weights into different tensors instead of using a single tensor for all the experts. IDK why this fixed it but **maybe** it's because torch is trying to compute gradient for every expert at once, which shouldn't happen since it's a MoE model.
+# DBRX Base
+* DBRX Base is a mixture-of-experts (MoE) large language model trained from scratch by Databricks.
+* We are releasing both DBRX Base, a pretrained base model, and DBRX Instruct, a fine-tuned version for few-turn interactions, under [an open license](https://www.databricks.com/legal/open-model-license).
+* This is the repository for DBRX Base. DBRX Instruct can be found [here](https://huggingface.co/databricks/dbrx-instruct).
+* For full details on the DBRX models, please read our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).
+## Model Overview
+DBRX is a [transformer-based](https://www.isattentionallyouneed.com/) decoder-only large language model (LLM) that was trained using next-token prediction.
+It uses a *fine-grained* mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input.
+It was pre-trained on 12T tokens of text and code data.
+Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.
+This provides 65x more possible combinations of experts and we found that this improves model quality.
+DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA).
+It uses the GPT-4 tokenizer as provided in the [tiktoken](https://github.com/openai/tiktoken) repository.
+We made these choices based on exhaustive evaluation and scaling experiments.
+DBRX was pretrained on 12T tokens of carefully curated data and a maximum context length of 32K tokens.
+We estimate that this data is at least 2x better token-for-token than the data we used to pretrain the MPT family of models.
+This new dataset was developed using the full suite of Databricks tools, including Apache Spark™ and Databricks notebooks for data processing, and Unity Catalog for data management and governance.
+We used curriculum learning for pretraining, changing the data mix during training in ways we found to substantially improve model quality.
+* **Inputs:** DBRX only accepts text-based inputs and accepts a context length of up to 32768 tokens.
+* **Outputs:** DBRX only produces text-based outputs.
+* **Model Architecture:** More detailed information about DBRX Instruct and DBRX Base can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).
+* **License:** [Databricks Open Model License](https://www.databricks.com/legal/open-model-license)
+* **Acceptable Use Policy:** [Databricks Open Model Acceptable Use Policy](https://www.databricks.com/legal/acceptable-use-policy-open-model)
+* **Version:** 1.0
+* **Owner:** Databricks, Inc.
+## Usage
+These are several general ways to use the DBRX models:
+* DBRX Base and DBRX Instruct are available for download on HuggingFace (see our Quickstart guide below). This is the HF repository for DBRX Base; DBRX Instruct can be found [here](https://huggingface.co/databricks/dbrx-instruct).
+* The DBRX model repository can be found on GitHub [here](https://github.com/databricks/dbrx).
+* DBRX Base and DBRX Instruct are available with [Databricks Foundation Model APIs](https://docs.databricks.com/en/machine-learning/foundation-models/index.html) via both *Pay-per-token* and *Provisioned Throughput* endpoints. These are enterprise-ready deployments.
+* For more information on how to fine-tune using LLM-Foundry, please take a look at our LLM pretraining and fine-tuning [documentation](https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/README.md).
+## Quickstart Guide
+**NOTE: This is DBRX Base, and has not been instruction finetuned. It has not been trained for interactive chat and is only a completion model.**
+If you are looking for the finetuned model, please use [DBRX Instruct](https://huggingface.co/databricks/dbrx-instruct).
+Getting started with DBRX models is easy with the `transformers` library. The model requires ~264GB of RAM and the following packages:
+```bash
+pip install transformers tiktoken
+```
+If you'd like to speed up download time, you can use the `hf_transfer` package as described by Huggingface [here](https://huggingface.co/docs/huggingface_hub/en/guides/download#faster-downloads).
+```bash
+pip install hf_transfer
+export HF_HUB_ENABLE_HF_TRANSFER=1
+```
+### Run the model on a CPU:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+tokenizer = AutoTokenizer.from_pretrained("v2ray/dbrx-base-fixed", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("v2ray/dbrx-base-fixed", device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)
+input_text = "Databricks was founded in "
+input_ids = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**input_ids, max_new_tokens=100)
+print(tokenizer.decode(outputs[0]))
+```
+### Run the model on multiple GPUs:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+tokenizer = AutoTokenizer.from_pretrained("v2ray/dbrx-base-fixed", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("v2ray/dbrx-base-fixed", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
+input_text = "Databricks was founded in "
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=100)
+print(tokenizer.decode(outputs[0]))
+```
+If your GPU system supports [FlashAttention2](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2), you can add `attn_implementation=”flash_attention_2”` as a keyword to `AutoModelForCausalLM.from_pretrained()` to achieve faster inference.
+## Limitations and Ethical Considerations
+### Training Dataset Limitations
+The DBRX models were trained on 12T tokens of text, with a knowledge cutoff date of December 2023.
+The training mix used for DBRX contains both natural-language and code examples. The vast majority of our training data is in the English language. We did not test DBRX for non-English proficiency. Therefore, DBRX should be considered a generalist model for text-based use in the English language.
+DBRX does not have multimodal capabilities.
+### Associated Risks and Recommendations
+All foundation models are novel technologies that carry various risks, and may output information that is inaccurate, incomplete, biased, or offensive.
+Users should exercise judgment and evaluate such output for accuracy and appropriateness for their desired use case before using or sharing it.
+Databricks recommends [using retrieval augmented generation (RAG)](https://www.databricks.com/glossary/retrieval-augmented-generation-rag) in scenarios where accuracy and fidelity are important.
+We also recommend that anyone using or fine-tuning either DBRX Base or DBRX Instruct perform additional testing around safety in the context of their particular application and domain.
+## Intended Uses
+### Intended Use Cases
+The DBRX models are open, general-purpose LLMs intended and licensed for both commercial and research applications.
+They can be further fine-tuned for various domain-specific natural language and coding tasks.
+DBRX Base can be used as an off-the-shelf model for text completion for general English-language and coding tasks.
+Please review the Associated Risks section above, as well as the [Databricks Open Model License](https://www.databricks.com/legal/open-model-license) and [Databricks Open Model Acceptable Use Policy](https://www.databricks.com/legal/acceptable-use-policy-open-model) for further information about permissible uses of DBRX Base and its derivatives.
+### Out-of-Scope Use Cases
+DBRX models are not intended to be used out-of-the-box in non-English languages and do not support native code execution, or other forms of function-calling.
+DBRX models should not be used in any manner that violates applicable laws or regulations or in any other way that is prohibited by the [Databricks Open Model License](https://www.databricks.com/legal/open-model-license) and [Databricks Open Model Acceptable Use Policy](https://www.databricks.com/legal/acceptable-use-policy-open-model).
+## Training Stack
+MoE models are complicated to train, and the training of DBRX Base and DBRX Instruct was heavily supported by Databricks’ infrastructure for data processing and large-scale LLM training (e.g., [Composer](https://github.com/mosaicml/composer), [Streaming](https://github.com/mosaicml/streaming), [Megablocks](https://github.com/stanford-futuredata/megablocks), and [LLM Foundry](https://github.com/mosaicml/llm-foundry)).
+Composer is our core library for large-scale training.
+It provides an optimized training loop, easy [checkpointing](https://docs.mosaicml.com/projects/composer/en/latest/trainer/checkpointing.html) and [logging](https://docs.mosaicml.com/projects/composer/en/latest/trainer/logging.html#wood-logging),
+[FSDP](https://pytorch.org/docs/stable/fsdp.html)-based [model sharding](https://docs.mosaicml.com/projects/composer/en/latest/notes/distributed_training.html#fullyshardeddataparallel-fsdp),
+convenient [abstractions](https://docs.mosaicml.com/projects/composer/en/latest/trainer/time.html), extreme customizability via [callbacks](https://docs.mosaicml.com/projects/composer/en/latest/trainer/callbacks.html), and more.
+Streaming enables fast, low cost, and scalable training on large datasets from cloud storage. It handles a variety of challenges around deterministic resumption as node counts change, avoiding redundant downloads across devices, high-quality shuffling at scale, sample-level random access, and speed.
+Megablocks is a lightweight library for MoE training. Crucially, it supports “dropless MoE,” which avoids inefficient padding and is intended to provide deterministic outputs for a given sequence no matter what other sequences are in the batch.
+LLM Foundry ties all of these libraries together to create a simple LLM pretraining, fine-tuning, and inference experience.
+DBRX was trained using proprietary optimized versions of the above open source libraries, along with our [LLM training platform](https://www.databricks.com/product/machine-learning/mosaic-ai-training).
+## Evaluation
+We find that DBRX outperforms established open-source and open-weight base models on the [Databricks Model Gauntlet](https://www.databricks.com/blog/llm-evaluation-for-icl), the [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and HumanEval.
+The Databricks Model Gauntlet measures performance on more than 30 tasks across six categories: world knowledge, common sense reasoning, language understanding, reading comprehension, symbolic problem solving, and programming.
+The Hugging Face Open LLM Leaderboard measures the average of ARC-Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8k.
+HumanEval measures coding ability.
+Full evaluation details can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).
+## Acknowledgements
+The DBRX models were made possible thanks in large part to the open-source community, especially:
+* The [MegaBlocks](https://arxiv.org/abs/2211.15841) library, which established a foundation for our MoE implementation.
+* [PyTorch FSDP](https://arxiv.org/abs/2304.11277), which we built on for distributed training.

__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .configuration_dbrx import *
2	+ from .modeling_dbrx import *

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "_name_or_path": "dbrx_old",
+  "architectures": [
+    "DbrxForCausalLM"
+  ],
+  "attn_config": {
+    "clip_qkv": 8,
+    "kv_n_heads": 8,
+    "model_type": "",
+    "rope_theta": 500000
+  },
+  "auto_map": {
+    "AutoConfig": "configuration_dbrx.DbrxConfig",
+    "AutoModelForCausalLM": "modeling_dbrx.DbrxForCausalLM"
+  },
+  "d_model": 6144,
+  "emb_pdrop": 0.0,
+  "ffn_config": {
+    "ffn_hidden_size": 10752,
+    "model_type": "",
+    "moe_jitter_eps": 0.01,
+    "moe_loss_weight": 0.05,
+    "moe_num_experts": 16,
+    "moe_top_k": 4
+  },
+  "initializer_range": 0.02,
+  "max_seq_len": 32768,
+  "model_type": "dbrx",
+  "n_heads": 48,
+  "n_layers": 40,
+  "output_router_logits": false,
+  "resid_pdrop": 0.0,
+  "router_aux_loss_coef": 0.05,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.39.1",
+  "use_cache": true,
+  "vocab_size": 100352
+}

configuration_dbrx.py ADDED Viewed

	@@ -0,0 +1,264 @@

+"""Dbrx configuration."""
+from typing import Any, Optional
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+DBRX_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+class DbrxAttentionConfig(PretrainedConfig):
+    """Configuration class for Dbrx Attention.
+    [`DbrxAttention`] class. It is used to instantiate attention layers
+    according to the specified arguments, defining the layers architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        attn_pdrop (`float`, *optional*, defaults to 0.0):
+            The dropout probability for the attention layers.
+        clip_qkv (`float`, *optional*, defualts to None):
+            If not `None`, clip the queries, keys, and values in the attention layer to this value.
+        kv_n_heads (Optional[int]): For grouped_query_attention only, allow user to specify number of kv heads.
+        rope_theta (float): The base frequency for rope.
+    """
+    def __init__(
+        self,
+        attn_pdrop: float = 0,
+        clip_qkv: Optional[float] = None,
+        kv_n_heads: int = 1,
+        rope_theta: float = 10000.0,
+        **kwargs: Any,
+    ):
+        super().__init__(**kwargs)
+        self.attn_pdrop = attn_pdrop
+        self.clip_qkv = clip_qkv
+        self.kv_n_heads = kv_n_heads
+        self.rope_theta = rope_theta
+        for k in ['model_type']:
+            if k in kwargs:
+                kwargs.pop(k)
+        if len(kwargs) != 0:
+            raise ValueError(f'Found unknown {kwargs=}')
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str,
+                        **kwargs: Any) -> 'PretrainedConfig':
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path,
+                                                  **kwargs)
+        if config_dict.get('model_type') == 'dbrx':
+            config_dict = config_dict['attn_config']
+        if 'model_type' in config_dict and hasattr(
+                cls,
+                'model_type') and config_dict['model_type'] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                +
+                f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
+            )
+        return cls.from_dict(config_dict, **kwargs)
+class DbrxFFNConfig(PretrainedConfig):
+    """Configuration class for Dbrx FFN.
+    [`DbrxFFN`] class. It is used to instantiate feedforward layers according to
+    the specified arguments, defining the layers architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        ffn_act_fn (dict, optional): A dict specifying activation function for the FFN.
+            The dict should have a key 'name' with the value being the name of
+            the activation function along with any additional keyword arguments.
+        ffn_hidden_size (int, optional): The hidden size of the feedforward network.
+        moe_num_experts (int, optional): The number of experts in the mixture of experts layer.
+        moe_top_k (int, optional): The number of experts to use in the mixture of experts layer.
+        moe_jitter_eps (float, optional): The jitter epsilon for the mixture of experts layer.
+        moe_loss_weight (float, optional): The loss weight for the mixture of experts layer.
+        moe_normalize_expert_weights (float, optional): The normalization factor for the expert weights.
+        uniform_expert_assignment (bool, optional): Whether to use uniform expert assignment.
+            This should only be used for benchmarking purposes.
+    """
+    def __init__(
+        self,
+        ffn_act_fn: Optional[dict] = None,
+        ffn_hidden_size: int = 3584,
+        moe_num_experts: int = 4,
+        moe_top_k: int = 1,
+        moe_jitter_eps: Optional[float] = None,
+        moe_loss_weight: float = 0.01,
+        moe_normalize_expert_weights: Optional[float] = 1,
+        uniform_expert_assignment: bool = False,
+        **kwargs: Any,
+    ):
+        super().__init__()
+        if ffn_act_fn is None:
+            ffn_act_fn = {'name': 'silu'}
+        self.ffn_act_fn = ffn_act_fn
+        self.ffn_hidden_size = ffn_hidden_size
+        self.moe_num_experts = moe_num_experts
+        self.moe_top_k = moe_top_k
+        self.moe_jitter_eps = moe_jitter_eps
+        self.moe_loss_weight = moe_loss_weight
+        self.moe_normalize_expert_weights = moe_normalize_expert_weights
+        self.uniform_expert_assignment = uniform_expert_assignment
+        for k in ['model_type']:
+            if k in kwargs:
+                kwargs.pop(k)
+        if len(kwargs) != 0:
+            raise ValueError(f'Found unknown {kwargs=}')
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str,
+                        **kwargs: Any) -> 'PretrainedConfig':
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path,
+                                                  **kwargs)
+        if config_dict.get('model_type') == 'dbrx':
+            config_dict = config_dict['ffn_config']
+        if 'model_type' in config_dict and hasattr(
+                cls,
+                'model_type') and config_dict['model_type'] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                +
+                f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
+            )
+        return cls.from_dict(config_dict, **kwargs)
+class DbrxConfig(PretrainedConfig):
+    """Configuration class for Dbrx.
+    [`DbrxModel`]. It is used to instantiate a Dbrx model according to the
+    specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        d_model (`int`, *optional*, defaults to 6144):
+            Dimensionality of the embeddings and hidden states.
+        n_heads (`int`, *optional*, defaults to 48):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        n_layers (`int`, *optional*, defaults to 40):
+            Number of hidden layers in the Transformer encoder.
+        max_seq_len (`int`, *optional*, defaults to 32768):
+            The maximum sequence length of the model.
+        vocab_size (`int`, *optional*, defaults to 100352):
+            Vocabulary size of the Dbrx model. Defines the maximum number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`DbrxModel`].
+        resid_pdrop (`float`, *optional*, defaults to 0.0):
+            The dropout probability applied to the attention output before combining with residual.
+        emb_pdrop (`float`, *optional*, defaults to 0.0):
+            The dropout probability for the embedding layer.
+        attn_config (`dict`, *optional*):
+            A dictionary used to configure the model's attention module.
+        ffn_config (`dict`, *optional*):
+            A dictionary used to configure the model's FFN module.
+        use_cache (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        output_router_logits (`bool`, *optional*, defaults to `False`):
+            Whether or not the router logits should be returned by the model. Enabling this will also
+            allow the model to output the auxiliary loss. See [here]() for more details
+        router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
+            The aux loss factor for the total loss.
+    Example:
+    ```python
+    >>> from transformers import DbrxConfig, DbrxModel
+    >>> # Initializing a Dbrx configuration
+    >>> configuration = DbrxConfig()
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = DbrxModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = 'dbrx'
+    attribute_map = {
+        'num_attention_heads': 'n_heads',
+        'hidden_size': 'd_model',
+        'num_hidden_layers': 'n_layers',
+        'max_position_embeddings': 'max_seq_len'
+    }
+    def __init__(
+        self,
+        d_model: int = 2048,
+        n_heads: int = 16,
+        n_layers: int = 24,
+        max_seq_len: int = 2048,
+        vocab_size: int = 32000,
+        resid_pdrop: float = 0.0,
+        emb_pdrop: float = 0.0,
+        attn_config: Optional[DbrxAttentionConfig] = None,
+        ffn_config: Optional[DbrxFFNConfig] = None,
+        use_cache: bool = True,
+        initializer_range: float = 0.02,
+        output_router_logits: bool = False,
+        router_aux_loss_coef: float = 0.05,
+        **kwargs: Any,
+    ):
+        if attn_config is None:
+            self.attn_config = DbrxAttentionConfig()
+        elif isinstance(attn_config, dict):
+            self.attn_config = DbrxAttentionConfig(**attn_config)
+        else:
+            self.attn_config = attn_config
+        if ffn_config is None:
+            self.ffn_config = DbrxFFNConfig()
+        elif isinstance(ffn_config, dict):
+            self.ffn_config = DbrxFFNConfig(**ffn_config)
+        else:
+            self.ffn_config = ffn_config
+        self.d_model = d_model
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.max_seq_len = max_seq_len
+        self.vocab_size = vocab_size
+        self.resid_pdrop = resid_pdrop
+        self.emb_pdrop = emb_pdrop
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+        self.output_router_logits = output_router_logits
+        self.router_aux_loss_coef = router_aux_loss_coef
+        tie_word_embeddings = kwargs.pop('tie_word_embeddings', False)
+        if tie_word_embeddings:
+            raise ValueError(
+                'tie_word_embeddings is not supported for Dbrx models.')
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.39.1"
+}

model-00001-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:19fbfca36b21fae548f2fe7ddf86ea444420a06501952f817b5888be59df5ddd
+size 4976767312

model-00002-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60e92feb3fbaa9a5fe4cded206020fdac8373e63ef101fc56003cb97dbcb867f
+size 4932728256

model-00003-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6edb0b60e27d77d404df003bd53f525220b03c1e5b8fee199a6837d4d6d893b7
+size 4932728256

model-00004-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f013307c64633a918029b4f5bba2c8303d271f58041264e3fe5c723e1b4f51ed
+size 4888466376

model-00005-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5828fd1e9fd5a7843df77ab428aa1e8a2ec65984a5d06a6edf58678f24c734d4
+size 4932728248

model-00006-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:752d18fa174550b38e0f57cba20d0881a3cb4ad343b63801e7ea6ef877c000ef
+size 4932728256

model-00007-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:236dab52058e35463b0df20f1b2d6e3a73309db0a64344824008406e5e508f1d
+size 4932728256

model-00008-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:427eb68d0aa3aace4eebd05ff029446086f69f79a74091aaba8b296a943eaa26
+size 4888466376

model-00009-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:317349f07d0538415ace7dbf4935873b5214bdae84342d6491105e71b85060b6
+size 4932728248

model-00010-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b25c985829b95966535b2f89d5c35a5d18535b71c172f4cc6e57b2d0c44c7912
+size 4932728256

model-00011-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ad7b9c00f70301577b5505e81b9cbee42dff4937c784cc4fc21c5923cf947955
+size 4932728256

model-00012-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:014272db51c88d6899f698baf7adceed1b52490205aa618cd9a4a75756640cac
+size 4888466376

model-00013-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7875904e40352508740d81090cf838036174fd45e19f94f58d9eae21d9c06312
+size 4932728240

model-00014-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ad94f408f30f8e47292547ee9bd550f6799d54f2d3bd38834abeec4104b79475
+size 4932728280

model-00015-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1780de45e106eb7eb2e36b02667bd5271a73958addcf0e4da55fe7fa004181cb
+size 4932728296

model-00016-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8ea5dd7e1bd9b18553bb5c2933b03c597cd1e5318e6e0e048ac41726c2086a6d
+size 4888466416

model-00017-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9d913f299740d1d485c636442d84b909dfe892add1f320fd0ab92395006b64de
+size 4932728288

model-00018-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b8b8c3ec11f6a87b7267a1eaa58f7928449161c1fb2a3a2d36bf7549e239828
+size 4932728296

model-00019-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:660062506cf4f17f78e4ef06c26814a940a890f510a39cc4e3c5e696543c43e1
+size 4932728296

model-00020-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:df6d9fea4010988f8eb078bc0264c24ddf2adc45546fff3a8ba1fbf303453fda
+size 4888466416

model-00021-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc30253d5c9a92c14b5e673557784dd3f107a5e914ad15a5b42de69bc4cdf837
+size 4932728288

model-00022-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab138aef999b5d52f7499d118b7ea3ca15f0099557bbdf74e325764b4bd65a0f
+size 4932728296

model-00023-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:042526a2936e7fb570abd769da46aedd6fcc65745aaee13f00bcdc70a5e06b81
+size 4932728296

model-00024-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:67fd60628e07b5d9949dcece5a3df3401103254236dfb529bc8611495d717b44
+size 4888466416

model-00025-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:608f8b7adaa5ee361a3bcfc861cbd19dd4d25cdfacba9e66d65c71f4f1a12590
+size 4932728288

model-00026-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:125927560348254678be9980be2e2f5356752a12fe5c10083fd2cd01c9771825
+size 4932728296

model-00027-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4aaa116bfa3aada012f42dba1e9a947e3ec71603cbd08ccd42b952770f5c26b9
+size 4932728296

model-00028-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1585e6fefa1db90f1da8fc755dfd5a85586c7a965426f6bbf51f607fce1f509d
+size 4888466416

model-00029-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2bf4270c4e2ad7736d243ee204e1a0dc2edf7b7b5bb468cb10eac60991b1268b
+size 4932728288

model-00030-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a6c5cec08dde81c379e81e3889ccad8f582c7d20cd9199b0c30e4c73bb9cbc6d
+size 4932728296

model-00031-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:23a95281a3b86e707a5b49fa545ba11b532bd3caeb2e28bba7a7fc2a82f76fd8
+size 4932728296

model-00032-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:abb9e3f9e4b4c7b54eab2c13625a47669613a7e23a4ac6d49e220f05095a0f22
+size 4888466416

model-00033-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c6787a8612292eaabb2e86c2202b58538013bcb813d1a5d89835f3e9feb4fd78
+size 4932728288

model-00034-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc24fee799cd1a543c7431a5fe91b5c0a618b4334790e5bbc1f91decbb497fb7
+size 4932728288

model-00035-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0801edd2ee96f386a3fc72198da027a7eb5d129c073a8efb89f0e3282b0c5f01
+size 4932728296

model-00036-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e638ca6b4d67de41d2b45a71b0093892c3cfb9f4bd02ab456f6c3c9b5674d65
+size 4989142256

model-00037-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52ccefb4eac2e288a0bbc3fdc8ad46e4766dd3de2138450986a7318e5dc894f1
+size 4964173160

model-00038-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:baa596dfbde688b0596e47272ede67b059c3b50ae6c97ec3713e93d1c8d38c83
+size 4932728288

model-00039-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3370713b9fd67d33b19e436912495607e4d928bd84197474263f9a417e6959f
+size 4932728296

model-00040-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd31b2109e45f5eb2fc906a53b8e20f35ffba88959c4f9cf17bcb3d73f439ecc
+size 4932728296

model-00041-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35d5effa40761808648e92c7902c7fc7c9769260ada2df4a0ce884e356f1295b
+size 4888466408

model-00042-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b94fc632abb875f851c52234873dac13fd31fc3d35f68415ed8b41ed5822c1bc
+size 4932728288

model-00043-of-00054.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5c1ced190ebe41c3fdc0650c234ffe648db7f685850b94c54261edeb2977a2b
+size 4932728296