Pytorch GPU内存泄漏在双反向上
发布时间:2022-04-14 01:51:55 590
相关标签: # node.js
我试图实现本文中描述的递归神经网络算法。代码运行时没有错误,但是GPU内存泄漏,我找不到它的位置,它很快就会耗尽GPU内存并导致OOM错误。具体来说,我已经为trainer
课程:
def get_resampled_weights(self, batch_train, batch_val):
"""
Using the resample technique (Learning to Reweight Examples for Robust Deep Learning,
https://arxiv.org/pdf/1803.09050.pdf) to train the model
"""
# prepare two models inside the function so that GPU memory could be released when exit function
pred_model = deepcopy(self.model)
dual_model = deepcopy(self.model)
# first forward pass with gradient-tracking sample weights
preds_train = pred_model(batch_train)
sample_weights = torch.zeros(batch_train.masks.shape, requires_grad=True, dtype=torch.float).to(self.device[0])
train_loss = self.loss_fn(preds_train, batch_train.targets, sample_weights=sample_weights)
train_loss.backward(create_graph=True)
gradient_containing_params = {k:v for k,v in pred_model.named_parameters()}
# do one "proposed" gradient update step and evaluate on validation data using the dual model
updated_state_dict = {k: gradient_containing_params[k].detach() - self.current_lr * gradient_containing_params[k].grad
for k in gradient_containing_params}
# set model parameters by manually setting the attributes of the model to the corresponding tensors
# using load_state_dict does not work because it will lose the gradients of the parameters with respect to sample weights
set_model_parameters(dual_model, updated_state_dict)
# calculate batch validation loss with the dual model
preds_val = dual_model(batch_inputs_val)
val_loss = self.loss_fn(preds_val, batch_val.targets)
# calculate gradients on the sample weights using autograd
sample_weight_grads = torch.autograd.grad(val_loss, sample_weights)[0].detach()
# calculate adjusted sample weights from gradients
clamped_grads = torch.maximum(-sample_weight_grads, torch.tensor(0., device=sample_weight_grads.device))
grads_sum = torch.sum(clamped_grads) + 1e-8
new_sample_weights = clamped_grads / grads_sum
# delete models and try to release GPU memory
del pred_model
del dual_model
torch.cuda.empty_cache()
return new_sample_weights
# Here is what is done for the set_model_parameters, borrowed from
# https://github.com/danieltan07/learning-to-reweight-examples/blob/182aad0ddd11a38d86b7abfb34e35b54bc7efb39/meta_layers.py
def set_model_parameters(model, parameter_dict):
for item in parameter_dict:
set_param(model, item, parameter_dict[item])
def set_param(curr_mod, name, param):
if '.' in name:
n = name.split('.')
module_name = n[0]
rest = '.'.join(n[1:])
for name, mod in curr_mod.named_children():
if module_name == name:
set_param(mod, rest, param)
break
else:
delattr(curr_mod, name)
setattr(curr_mod, name, param)
我在训练循环中调用这个函数,获得每一批数据的重采样权重。原则上new_sample_weights
与计算图分离,退出函数后,其他所有内容都应被销毁。然而,每次迭代后,我都会看到监控带来的GPU内存净增加约1GBNVIDIA-smi
,最终只需几个训练步骤就可以实现OOM。我错过了什么?
非常感谢您的帮助!
特别声明:以上内容(图片及文字)均为互联网收集或者用户上传发布,本站仅提供信息存储服务!如有侵权或有涉及法律问题请联系我们。
举报