Pytorch GPU内存泄漏在双反向上

Tempo lv.1

发布时间：2022-04-14 01:51:55 590

相关标签： # node.js

我试图实现本文中描述的递归神经网络算法。代码运行时没有错误，但是GPU内存泄漏，我找不到它的位置，它很快就会耗尽GPU内存并导致OOM错误。具体来说，我已经为trainer课程：

def get_resampled_weights(self, batch_train, batch_val):
        """
        Using the resample technique (Learning to Reweight Examples for Robust Deep Learning, 
        https://arxiv.org/pdf/1803.09050.pdf) to train the model
        """
        # prepare two models inside the function so that GPU memory could be released when exit function
        pred_model = deepcopy(self.model)
        dual_model = deepcopy(self.model)
        
        # first forward pass with gradient-tracking sample weights
        preds_train = pred_model(batch_train)
        sample_weights = torch.zeros(batch_train.masks.shape, requires_grad=True, dtype=torch.float).to(self.device[0])
        train_loss = self.loss_fn(preds_train, batch_train.targets, sample_weights=sample_weights)
        train_loss.backward(create_graph=True)
        gradient_containing_params = {k:v for k,v in pred_model.named_parameters()}

        # do one "proposed" gradient update step and evaluate on validation data using the dual model
        updated_state_dict = {k: gradient_containing_params[k].detach() - self.current_lr * gradient_containing_params[k].grad 
            for k in gradient_containing_params}
        
        # set model parameters by manually setting the attributes of the model to the corresponding tensors
        # using load_state_dict does not work because it will lose the gradients of the parameters with respect to sample weights
        set_model_parameters(dual_model, updated_state_dict)
        

        # calculate batch validation loss with the dual model
        preds_val = dual_model(batch_inputs_val)
        val_loss = self.loss_fn(preds_val, batch_val.targets)

        # calculate gradients on the sample weights using autograd
        sample_weight_grads = torch.autograd.grad(val_loss, sample_weights)[0].detach()


        # calculate adjusted sample weights from gradients
        clamped_grads = torch.maximum(-sample_weight_grads, torch.tensor(0., device=sample_weight_grads.device))
        grads_sum = torch.sum(clamped_grads) + 1e-8
        new_sample_weights = clamped_grads / grads_sum


        # delete models and try to release GPU memory
        del pred_model
        del dual_model
        torch.cuda.empty_cache()

        return new_sample_weights

# Here is what is done for the set_model_parameters, borrowed from 
# https://github.com/danieltan07/learning-to-reweight-examples/blob/182aad0ddd11a38d86b7abfb34e35b54bc7efb39/meta_layers.py

def set_model_parameters(model, parameter_dict):
    for item in parameter_dict:
        set_param(model, item, parameter_dict[item])

def set_param(curr_mod, name, param):
    if '.' in name:
        n = name.split('.')
        module_name = n[0]
        rest = '.'.join(n[1:])
        for name, mod in curr_mod.named_children():
            if module_name == name:
                set_param(mod, rest, param)
                break
    else:
        delattr(curr_mod, name)
        setattr(curr_mod, name, param)

我在训练循环中调用这个函数，获得每一批数据的重采样权重。原则上new_sample_weights与计算图分离，退出函数后，其他所有内容都应被销毁。然而，每次迭代后，我都会看到监控带来的GPU内存净增加约1GBNVIDIA-smi，最终只需几个训练步骤就可以实现OOM。我错过了什么？

非常感谢您的帮助！

特别声明：以上内容（图片及文字）均为互联网收集或者用户上传发布，本站仅提供信息存储服务！如有侵权或有涉及法律问题请联系我们。