|
(modelopt) PS E:\ModelOpt_Windows_Scripts_2\modelopt-windows-scripts\ONNX_PTQ> python quantize_script.py --model_name=nvidia/Nemotron-Mini-4B-Instruct --onnx_path=E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\model.onnx --output_path="E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\default_quant_dml_ep_calib\model.onnx"
|
|
|
|
--Quantize-Script-- algo=awq_lite, dataset=cnn, calib_size=32, batch_size=1, block_size=128, add-position-ids=True, past-kv=True, rcalib=False, device=cpu, use_zero_point=False
|
|
|
|
|
|
|
|
--Quantize-Script-- awqlite_alpha_step=0.1, awqlite_fuse_nodes=False, awqlite_run_per_subgraph=False, awqclip_alpha_step=0.05, awqclip_alpha_min=0.5, awqclip_bsz_col=1024, calibration_eps=['dml']
|
|
|
|
C:\Users\vrl\miniconda3\envs\modelopt\Lib\site-packages\transformers\models\auto\configuration_auto.py:1002: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
|
|
warnings.warn(
|
|
C:\Users\vrl\miniconda3\envs\modelopt\Lib\site-packages\transformers\models\auto\tokenization_auto.py:809: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
|
|
warnings.warn(
|
|
|
|
--Quantize-Script-- number_of_batched_samples=32, batch-input-ids-list-len=32, batched_attention_mask=32
|
|
|
|
|
|
--Quantize-Script-- number of batched inputs = 32
|
|
|
|
INFO:root:
|
|
Quantizing the model....
|
|
|
|
INFO:root:Quantization Mode: int4
|
|
INFO:root:Finding quantizable weights and augmenting graph output with input activations
|
|
INFO:root:Augmenting took 0.03900003433227539 seconds
|
|
INFO:root:Saving the model took 35.37520098686218 seconds
|
|
2024-11-05 06:08:38.8247274 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
|
|
2024-11-05 06:08:38.8385074 [W:onnxruntime:, session_state.cc:1170 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
|
|
Getting activation names maps...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 192/192 [00:00<?, ?it/s]
|
|
Running AWQ scale search per node...: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 192/192 [05:08<00:00, 1.61s/it]
|
|
INFO:root:AWQ scale search took 308.7233784198761 seconds
|
|
Quantizing the weights...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ| 192/192 [00:05<00:00, 32.75it/s]
|
|
INFO:root:Quantizing actual weights took 5.864110231399536 seconds
|
|
INFO:root:Inserting DQ nodes and input_pre_quant_scale node using quantized weights and scales ...
|
|
INFO:root:Inserting nodes took 0.1272134780883789 seconds
|
|
INFO:root:Exporting the quantized graph ...
|
|
Loading extension modelopt_round_and_pack_ext...
|
|
|
|
INFO:root:Exporting took 33.892990589141846 seconds
|
|
INFO:root:
|
|
Quantization process took 394.4490396976471 seconds
|
|
INFO:root:Saving to E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\default_quant_dml_ep_calib\model.onnx took 33.43196678161621 seconds
|
|
|
|
Done
|
|
|
|
(modelopt) PS E:\ModelOpt_Windows_Scripts_2\modelopt-windows-scripts\ONNX_PTQ> |