Bugs in ML code are notoriously hard to fix - they don’t cause
compile errors but silently regress accuracy. Once you have endured
the pain and fixed one of these, the lesson is forever etched into
your brain, right?
Wrong. Recently, an
old foe made a comeback - a familiar bug bit me again! As before,
the performance improved significantly after fixing it.
The bug was subtle and easy to make. How many others has it done
damage to? Curious, I downloaded over a hundred thousand
repositories from GitHub that import PyTorch, and analysed their
source code. I kept projects that define a custom dataset, use
NumPy’s random number generator with multi-process data loading, and
are more-or-less straightforward to analyse using abstract syntax
trees. Out of these, over 95% of the repositories are plagued by
this problem. It’s inside PyTorch’s official
tutorial,
OpenAI’s
code,
and NVIDIA’s
projects. Even
Karpathy
admitted
falling prey to it.