AI & Data Privacy with Valerii Babushkin
Towards our online summit on June 10th, AI and Data Privacy we set down for an interview with Valerii Babushkin, WhatsApp User Data Privacy Tech Lead at Facebook, who will be joining the event.
How do you see privacy in 2021 in your line of work?
It's impossible to answer how I see privacy overall but in my line of work privacy is when you strive to store as little data as possible because you will have no problem with the data which you don’t have. So privacy in my line of work is not storing any data or using as little data as possible.
Let's look at voice recognition services for example. Voice assistants have to be listening to everything we say to know when to respond. How does the balance between privacy and convenience work?
Voice recognition services don’t always listen to what we are saying because it’s not feasible. It’s impossible in terms of computing power which is needed to process it, because you need a lot of computing power, a lot of expensive CPU and electricity to do that. Voice recognition services usually have 2 or 3 trigger words which they listen to. When these words trigger the device, it connects itself to the server, which has all the models that recognize speech. The device itself can’t recognize anything but 2 or 3 words. So it isn’t a big problem. If you want to be more protected, either don’t buy these devices, and if you do, make sure to switch off the microphone mechanically, it’s possible to do it on devices such as Alexa. But it doesn’t constantly listen to everything you say, the CPU inside these small devices can only recognize 2-3 words, which trigger the system and then the system connects to the server.
What do you see in the future to protect privacy from a policy perspective? from a technology perspective?
My field is machine learning and in machine learning we have many different areas of research to protect the privacy, like federated research, when there is no central place to gather all the data but each edge, each port, operates independently and sends just a small portion of output which later can be accumulated, so there is no data sharing. Another research area which is called differential privacy. It is a way to protect the data which has been used to train the machine learning models. For example, there is a model which is called GPT3 which has been trained by a huge amount of data, and as far as I remember, there was some kind of reverse engineering because these models remembered their training and they generated the text, so it’s technically possible to take this data from this model. There is another area of research to prevent that. That said, the answer to this question is related to the answer to the first question: the less data you have, or if you have no data, the data can’t be leaked and you won’t have any privacy issues with the data.
What about data sharing for machine learning that benefits all? For example, genetic info used for medical research and machine learning to find cures for cancer. Do we need to change privacy regulations to more easily enable such situations?
I would love to speed up these processes because I have a dream that my mother will live for the next 100 years, which is unfortunately impossible right now but I hope that it will become possible due to developments in the field of medicine. Of course it’s not that easy because what can be more sensitive than medical data? I think privacy regulations should be changed because the potential here is immense.
Do you see any new technologies helping with such situations?
We can talk about privacy again, and there is federated learning where there is no central entity and you obviously don’t have to share the data, you can skip the data and just share some results which are irreversible. If we speak about machine learning for genetics and for medicine, recently there was a huge breakthrough , the AlphaFold (an artificial intelligence program) has been released by google, which helps scientists who work with proteins to model and predict what will happen to protein molecules and use that information to increase the speed of the results.
Based on your experience, what technologies are most effective for maintaining an individual's privacy going forward?
End-to-end encryption is a good example. What is end-to-end encryption? For example, WhatsApp is an end-to-end encrypted messenger, which means that there is no middleman, either server or any other entity which has a key to your message content. Your message can be read only on the device you send it from or on the device you send it to. By default. This is a principle, if you have no data – you will have no issue with the data. And the end-to-end encryption, behind it lies a very brilliant idea, a protocol which is called diffie hellman protocol for end-to-end encryption, this is the best technology so far. If you have end-to-end encryption, like in WhatsApp, you can rest assured, because no one can see the content of the message. Let’s say it would be much easier to hack your phone and have access to your phone and read the messages there, than to break the encryption by itself, because every message has its own key, every message is encrypted on its own, so it’s impossible to break it.
Valerii Babushkin joined Facebook in November 2020 as the WhatApp User Data Privacy Tech Lead. Before Facebook, Valerii was the VP of Machine Learning at Alibaba Russia, where he led all the initiatives with machine learning and Senior Director of Data Science at X5 Retail Group, where he led a team of 140+ people. Also, Valerii is a Kaggle competition Grand Master, ranked globally in the top 30.