Vulnerabilities in Deep Learning File Formats
Neural networks are trained via a process of backpropagation. After feeding in a sample input with a known correct output, an error is computed by comparing the network's prediction to the desired output (for example, by calculating their squared difference). Neural networks can be thought of as a chain of simple operations – individual neurons. Using the chain rule, we can compute how a small change to any individual neuron affects the final error. By adjusting the strength of the connection amongst each of the neurons in the appropriate direction, the network learns to produce outputs closer to the correct answers when given similar inputs again.
This process of making incremental modifications results in a set of decimal numbers– floating point numbers– that are referred to as neural network weights (and biases). Some frameworks, like PyTorch, offer ways to separate the neural network weights into a file that is separate from the file describing the architecture of the network; other formats always combine both sets of information into one file. In any case, any neural network is going to have a set of weights.
Serialization, deserialization, and python pickles
In many frameworks, PyTorch included, it is common to store these neural network weights in a serialized format. The object serialization format built into Python is a library called pickle, and that is what we will build an example around today. But, keep in mind that a similar vulnerability exists in many other object serialization paradigms across many other languages, including npz format in NumPy (.npz), RData in R (.rds, .rdata, .rda), Julia serialization format (.jls, .jldata, .juliaserial), and the built-in serialization in Java (.ser).
Arbitrary Code Execution in Pickle files
What these formats have in common, is that they allow arbitrary code in their respective language to be included in the serialized data, and this code will be executed during the deserialization of this data.
Scanning Pickle files for Arbitrary Code Execution
If you just want to know whether it is safe to download and work with a pickle ‘.pkl’ file, the consensus is that there is no 100% bullet-proof solution to verifying the safety of a pickle file without execution. If you can’t avoid downloading a pickle from an untrusted source, the main tool in your toolbelt will be fickling, which can scan pickle files for well-known dangerous patterns, but cannot guarantee detection for more complex patterns (such as tensor steganography).
# Basic scan
fickling pickle_file.pkl
# Scan and display trace
fickling --trace pickle_file.pkl
We have covered pickle vulnerabilities before, but not in the context of deep learning. In a .pt or .pth file, a similar vulnerability exists. The underlying implementation of ‘pth’ files relies on the same pickle library mentioned above as a dependency. For these files, fickling cannot be directly used to conduct a scan, though you may be able to unpack them and scan the included pickle files themselves. However, because fickling cannot always guarantee the safety of a pickle file on its own, it would be advisable to test files among any of these types in a sandbox environment, at least in any situation where you cannot trust the source.
Underlying File Format (pth/pt) | Description | Uses Pickle | fickling support for inserting code |
PyTorch v1.3 | ZIP file containing data.pkl (1 pickle file) | Yes | Yes |
TorchScript v1.3 | ZIP file with data.pkl and constants.pkl | Yes | Yes |
TorchScript v1.4 | ZIP file with data.pkl, constants.pkl, and version set at 2 or higher (2 pickle files and a folder) | Yes | Experimental |
A more secure approach with Safetensors
A more secure representation of neural network weights, and one which has been widely adopted, would be safetensors. The safetensors format only contains raw tensor data and associated metadata.
The security benefit with the safetensors format is that it doesn’t allow serializing of arbitrary Python code, and the architecture of the neural network is defined separately.
Formats like ONNX and tensorflow’s pb are also among the file formats that are safer than pickles, because they do not serialize the weights in a format that can be exploited for arbitrary code execution. Each of these can still contain custom neural network layers, though, the implementation of which could include arbitrary python code. In contrast, the safetensors format only contains raw tensor data, so there are no custom neural network layers in the file itself.
File Format | Associated Ecosystems | Can Contain Deserialization Code Execution Exploit(s) |
Arrow | Spark | No |
dill | scikit-learn | Yes |
HDF5 (h5) | Keras | Yes |
Java serialization | Java language | Yes |
joblib | scikit-learn | Yes |
json | Multiple | No |
Julia Serialization | Flux.jl | Yes |
MOJO | H2O.ai | Yes |
MsgPack | Flax | No |
Numpy | Numpy | Yes |
ONNX | Multiple | Yes |
pickle | PyTorch, scikit-learn | Yes |
POJO | H2O.ai | Yes |
RDS | R language | Yes |
SafeTensors | Multiple | No |
SavedModel | Tensorflow | No |
TFLite (FlatBuffers) | Tensorflow | No |
TorchScript | PyTorch | Yes |
Snyk Open Source Security for Deep Learning Libraries
Standing up a secure service reliant on neural networks begins with managing neural network weights in a secure format. Then, one must consider the possibility of malicious code in custom-implemented layers within the architecture of the network. Finally, even if the file representation is understood to be safe, and no malicious layers are present in the architecture, out-of-date versions of popular deep learning libraries can still have other vulnerabilities. Historical examples of this include the ONNX Directory Traversal vulnerability, the Torchserve ShellTorch vulnerability, and a number of historical tensorflow vulnerabilities.
You can search for lots of relevant vulnerabilities in the Snyk Security database, just by searching for the name of the package, like torchserve. The scope of this article is focused on the secure representation of neural network weights themselves, though – particularly those affected by pickle. In the next article in this series, we will explore poisoning pickles with malicious code, with some hands-on examples.
Absicherung für Ihre Anwendungen
Mit Snyk sichern Ihre Developer Ihre Anwendungen vom ersten Tag nahtlos ab.