Nemotron-Mini-4B-Instruct-ONNX-INT4 / quantization_log.txt

Upload 11 files

03275d1 verified 7 days ago

3.63 kB

	(modelopt) PS E:\ModelOpt_Windows_Scripts_2\modelopt-windows-scripts\ONNX_PTQ> python quantize_script.py --model_name=nvidia/Nemotron-Mini-4B-Instruct --onnx_path=E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\model.onnx --output_path="E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\default_quant_dml_ep_calib\model.onnx"

	--Quantize-Script-- algo=awq_lite, dataset=cnn, calib_size=32, batch_size=1, block_size=128, add-position-ids=True, past-kv=True, rcalib=False, device=cpu, use_zero_point=False



	--Quantize-Script-- awqlite_alpha_step=0.1, awqlite_fuse_nodes=False, awqlite_run_per_subgraph=False, awqclip_alpha_step=0.05, awqclip_alpha_min=0.5, awqclip_bsz_col=1024, calibration_eps=['dml']

	C:\Users\vrl\miniconda3\envs\modelopt\Lib\site-packages\transformers\models\auto\configuration_auto.py:1002: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
	warnings.warn(
	C:\Users\vrl\miniconda3\envs\modelopt\Lib\site-packages\transformers\models\auto\tokenization_auto.py:809: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
	warnings.warn(

	--Quantize-Script-- number_of_batched_samples=32, batch-input-ids-list-len=32, batched_attention_mask=32


	--Quantize-Script-- number of batched inputs = 32

	INFO:root:
	Quantizing the model....

	INFO:root:Quantization Mode: int4
	INFO:root:Finding quantizable weights and augmenting graph output with input activations
	INFO:root:Augmenting took 0.03900003433227539 seconds
	INFO:root:Saving the model took 35.37520098686218 seconds
	2024-11-05 06:08:38.8247274 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
	2024-11-05 06:08:38.8385074 [W:onnxruntime:, session_state.cc:1170 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
	Getting activation names maps...: 100%\|██████████████████████████████████████████████████████\| 192/192 [00:00<?, ?it/s]
	Running AWQ scale search per node...: 100%\|██████████████████████████████████████████\| 192/192 [05:08<00:00, 1.61s/it]
	INFO:root:AWQ scale search took 308.7233784198761 seconds
	Quantizing the weights...: 100%\|█████████████████████████████████████████████████████\| 192/192 [00:05<00:00, 32.75it/s]
	INFO:root:Quantizing actual weights took 5.864110231399536 seconds
	INFO:root:Inserting DQ nodes and input_pre_quant_scale node using quantized weights and scales ...
	INFO:root:Inserting nodes took 0.1272134780883789 seconds
	INFO:root:Exporting the quantized graph ...
	Loading extension modelopt_round_and_pack_ext...

	INFO:root:Exporting took 33.892990589141846 seconds
	INFO:root:
	Quantization process took 394.4490396976471 seconds
	INFO:root:Saving to E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\default_quant_dml_ep_calib\model.onnx took 33.43196678161621 seconds

	Done

	(modelopt) PS E:\ModelOpt_Windows_Scripts_2\modelopt-windows-scripts\ONNX_PTQ>