ylacombe HF staff commited on
Commit
4f10945
β€’
1 Parent(s): 70a8a7d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md CHANGED
@@ -127,6 +127,48 @@ scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cp
127
 
128
  For more details on using the Bark model for inference using the πŸ€— Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
129
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  ## Suno Usage
131
 
132
  You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):
 
127
 
128
  For more details on using the Bark model for inference using the πŸ€— Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
129
 
130
+
131
+ ### Optimization tips
132
+
133
+ Refers to this [blog post](https://huggingface.co/blog/optimizing-bark#benchmark-results) to find out more about the following methods and a benchmark of their benefits.
134
+
135
+ #### Get significant speed-ups:
136
+
137
+ **Using πŸ€— Better Transformer**
138
+
139
+ Better Transformer is an πŸ€— Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to πŸ€— Better Transformer:
140
+ ```python
141
+ model = model.to_bettertransformer()
142
+ ```
143
+ Note that πŸ€— Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
144
+
145
+ **Using Flash Attention 2**
146
+
147
+ Flash Attention 2 is an even faster, optimized version of the previous optimization.
148
+ ```python
149
+ model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
150
+ ```
151
+ Make sure to load your model in half-precision (e.g. `torch.float16``) and to [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2.
152
+
153
+ **Note:** Flash Attention 2 is only available on newer GPUs, refer to πŸ€— Better Transformer in case your GPU don't support it.
154
+
155
+ #### Reduce memory footprint:
156
+
157
+ **Using half-precision**
158
+
159
+ You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision (e.g. `torch.float16``).
160
+
161
+ **Using CPU offload**
162
+
163
+ Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
164
+
165
+ If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU's submodels when they're idle. This operation is called CPU offloading. You can use it with one line of code.
166
+
167
+ ```python
168
+ model.enable_cpu_offload()
169
+ ```
170
+ Note that πŸ€— Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
171
+
172
  ## Suno Usage
173
 
174
  You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):