加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
mxnet.rst 2.19 KB
一键复制 编辑 原始数据 按行查看 历史

Horovod with MXNet

Horovod supports Apache MXNet and regular TensorFlow in similar ways.

See full training MNIST and ImageNet examples. The script below provides a simple skeleton of code block based on the Apache MXNet Gluon API.

import mxnet as mx
import horovod.mxnet as hvd
from mxnet import autograd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()

# Build model
model = ...
model.hybridize()

# Create optimizer
optimizer_params = ...
opt = mx.optimizer.create('sgd', **optimizer_params)

# Initialize parameters
model.initialize(initializer, ctx=context)

# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
    hvd.broadcast_parameters(params, root_rank=0)

# Create DistributedTrainer, a subclass of gluon.Trainer
trainer = hvd.DistributedTrainer(params, opt)

# Create loss function
loss_fn = ...

# Train model
for epoch in range(num_epoch):
    train_data.reset()
    for nbatch, batch in enumerate(train_data, start=1):
        data = batch.data[0].as_in_context(context)
        label = batch.label[0].as_in_context(context)
        with autograd.record():
            output = model(data.astype(dtype, copy=False))
            loss = loss_fn(output, label)
        loss.backward()
        trainer.step(batch_size)

Note

Some MXNet versions do not work with Horovod:

  • MXNet 1.4.0 and earlier have GCC incompatibility issues. Use MXNet 1.4.1 or later with Horovod 0.16.2 or later to avoid these incompatibilities.
  • MXNet 1.5.1, 1.6.0, 1.7.0, and 1.7.0.post1 are missing MKLDNN headers, so they do not work with Horovod. Use 1.5.1.post0, 1.6.0.post0, and 1.7.0.post0, respectively.
  • MXNet 1.6.0.post0 and 1.7.0.post0 are only available as mxnet-cu101 and mxnet-cu102.
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化