一、问题现象(附报错日志上下文):
拉取master分支代码,跟着readme(https://gitee.com/ascend/mxDriving/tree/master/model_examples/CenterPoint)成功创建环境,处理数据集,执行bash train_centerpoint3d_full_8p.sh单机8卡训练之后,开始迭代一个epoch之后报错,报错日志如下
二、日志信息:
2025-01-13 16:52:11,317 INFO Train: 1/20 ( 5%) [ 0/204 ( 0%)] Loss: 435.8 (436.) LR: 1.000e-04 Time cost: 00:33/1:53:51 [00:33/37:57:13] Acc_iter 1 Data time: 0.73(0.73) Forward time: 15.43(15.43) Batch time: 16.16(16.16)
Traceback (most recent call last):
File "train.py", line 232, in
Traceback (most recent call last):
File "train.py", line 232, in
main()
File "train.py", line 177, in main
train_model(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
main()
File "train.py", line 177, in main
accumulated_iter = train_one_epoch(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
train_model(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
loss, tb_dict, disp_dict = model_func(model, batch)
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/../pcdet/models/init.py", line 44, in model_func
accumulated_iter = train_one_epoch(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
loss, tb_dict, disp_dict = model_func(model, batch)
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/../pcdet/models/init.py", line 44, in model_func
Traceback (most recent call last):
File "train.py", line 232, in
ret_dict, tb_dict, disp_dict = model(batch_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
main()
File "train.py", line 177, in main
ret_dict, tb_dict, disp_dict = model(batch_dict)return self._call_impl(*args, **kwargs)
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "train.py", line 232, in
Traceback (most recent call last):
File "train.py", line 232, in
train_model(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
Traceback (most recent call last):
File "train.py", line 232, in
Traceback (most recent call last):
File "train.py", line 232, in
accumulated_iter = train_one_epoch(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
main()return forward_call(*args, **kwargs)
File "train.py", line 177, in main
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1515, in forward
loss, tb_dict, disp_dict = model_func(model, batch)
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/../pcdet/models/init.py", line 44, in model_func
main()
File "train.py", line 177, in main
main()
File "train.py", line 177, in main
train_model(return forward_call(*args, **kwargs)
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1515, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
train_model( File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
main()
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
File "train.py", line 177, in main
train_model(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
accumulated_iter = train_one_epoch(
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
ret_dict, tb_dict, disp_dict = model(batch_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1409, in _pre_forward
accumulated_iter = train_one_epoch(train_model(
accumulated_iter = train_one_epoch( File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
loss, tb_dict, disp_dict = model_func(model, batch) File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 176, in train_model
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/../pcdet/models/init.py", line 44, in model_func
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 5: 0 3 7 11 15 19 22 26 30 34 38 41 45 49 53 57 60 64 68 72 76
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
loss, tb_dict, disp_dict = model_func(model, batch)
File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/../pcdet/models/init.py", line 44, in model_func
loss, tb_dict, disp_dict = model_func(model, batch)
if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/../pcdet/models/init.py", line 44, in model_func
accumulated_iter = train_one_epoch(
RuntimeError File "/data/jindan/mxDriving/model_examples/CenterPoint/OpenPCDet/tools/train_utils/train_utils.py", line 56, in train_one_epoch
: ret_dict, tb_dict, disp_dict = model(batch_dict)Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 4: 0 3 7 11 15 19 22 26 30 34 38 41 45 49 53 57 60 64 68 72 76
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error