In the latest episode of our podcast, Machine Learning that Works, I had a great pleasure to talk to Gabriel Preda, a Lead Data Scientist at Endava and a Kaggle Grandmaster.
We talked about:
- His work in Endava,
- Kaggle competitions,
- And how to put yourself in a position to be always learning.
For those of you who want to see the full interview, here is the video version.
If, on the other hand, you prefer to read, I prepared a summary as well. It’s not a faithful transcript of our conversation, but a structured and rephrased version of the interview, that includes the key points and observations.
Without further ado, let’s meet Gabriel Preda!
What are you working on right now?
I work for Endava, which is a software service company, and our projects are actually our clients’ projects. So in the beginning, I want to say that when I talk about my activities, projects that I’m working on, the structure of the team, and the skills toolset and technologies we put in practice to answer our clients’ needs, I can’t always share all the details.
Anyway, when we deliver services to our clients, we work in multifunctional teams, and this is not different from data projects in general, or more specifically data projects where you have a Data Science or Machine Learning component.
To give you an example, for a recent project that we conducted (it was in the healthcare industry) the main Endava team included:
For the POC part of the project:
- Data Architect,
- Data Analyst,
- Data Engineer,
- and Data Scientist.
For the production part:
- we added the DevOps Developer, as well as a Data Scientist.
- we adjusted Data Analyst and Data Engineering to be just part-time.
So we started by focusing on understanding the problem and generating a possible solution for the client. And then, when we agree on one particular solution that the client wanted us to implement in production, our focus turns from developing ideas to implementing them. And so the structure of the team changed.
The size and structure of the team depend on the structure and stage of the project of course.
Are your clients’ projects similar, do they have common focus points, or they can be completely different?
We have developed certain aptitudes or specialization in specific fields, and we continuously develop these skills. But, most of the time, we have to pick up the client problem, understand it, and come up with a solution. And this solution has to respond to the business needs of the client, but also fit the specific environment.
So there are cases when we have full liberty, when we can, for example, build our solution in a cloud and implement it in Azure and AWS. Sometimes, on the other hand, we have very strict restrictions.
Then, we have to build something under a lot of constraints, on-premises, and so, our solution has to be adapted to these conditions. Generally, for many clients, we have to do a lot of NLP. This is because a lot of data from our clients is in the form of text information like documents. It really depends on the business, and of course, I will not be specific here, but I will say that we do a lot of NLP. Sometimes it is a combination of computer vision and NLP to extract various content from documents.
We use already existing NLP techniques, but sometimes we need to be extremely creative. For example, for one of our projects, we had to develop techniques to cope with the very nonstandard English. It included a lot of tedious, manual work, but sometimes you have to do this kind of cleaning work as well. In a lot of projects, it’s about exploring the data, cleaning the data, and preparing data, before even starting to work with models.
You do a lot of explorations. I got to know your work through some of the beautiful data exploration notebooks you shared with people on Kaggle. Is it some sort of toolkit that you have? Do you reuse them at work?
Of course, the knowledge you acquire, doing all kinds of exercises, publishing content on some industry sites, doing personal projects integrates in some way. You will, at some point, use it for a new project.
I don’t say to copy and paste it, of course, but it will serve you for a new purpose. I think this is a way that most of us learn and progress.
How did you become a Data Scientist?
I think there were two or three steps that led to this. Around 20 years ago, I was in a smaller group that was doing machine learning, without knowing that there was a word for that back then. Sometimes I joke about it. I was doing a postdoc at the university in the Faculty of Nuclear Engineering, and the problem we were working on was very simple. It was about water reactors in the water cooling system that had cracks. These cracks could develop and lead to accidents. To solve it, we were trying to guess the geometry of the effect. So we tried training a neural network, and at that time, training took us like two or three days for just a few hundred cases. The network was written in Fortran, as well as our simulation code.
So I would say that I was always curious and I was publishing a bit at that time. But only recently, I started to be more curious about the application in data science. About 3 or 4 years ago, a colleague invited me to an internal professional conference to do a presentation about data science in front of a big audience. But I didn’t really know what Data Science was. Besides, I had to do a presentation about using R for data science. And I also didn’t know much about R. So basically, for two months, I was studying every day. I remember it was a holiday season. My family was having fun somewhere at the Black Sea while I was spending nights studying. But I did it, I faced the challenge, and I did a decent presentation.
In the meantime, when I was looking for all the information and knowledge, I also discovered Kaggle.
At that time, you were managing software projects. Is that right?
Yes, I was a project manager. You can imagine that as a project manager, you don’t have a lot of time. It seems as if project managers don’t do anything, but that is the secret. If you do your work well, it appears that you’re not doing anything. So, I guess I was a good project manager because people were quite happy with my projects.
I think this is the best definition of a good project manager – make things go smoothly.
It’s common that project managers constantly create a crisis so that they can solve them. And people tend to say, this is the guy who solved that crisis. But the problem is that he’s the guy that created the crisis as well. So I try to avoid this.
Coming back to my Kaggle activity – I didn’t have much time, so I was a project manager by day and a Kaggler by night. When I developed my knowledge in data science and gained experience on Kaggle, I started to do quite a bit of machine learning at that time. Also, in my company, people started to get interested in Data Science. At that time, we had some projects related to data, but we didn’t consider it to be Data Science.
We had a lot of professional communities in Endava, and we started to invite people to have presentations on technical topics or even a project topic. We invited people from the company, but sometimes also external people. It was happening almost everywhere (Endava has offices in a lot of countries in Europe and South America as well), people were interested and wanted to learn. The goal was to aggregate some structure before having a formal organization.
Anyway, most of the time was also learning outside the company. About two years ago, I was very active and started to do competitions. I’m not very successful in competitions. So I would say that I’m not a good data scientist from a computational predictive modeling point of view. And people are arguing that this is not so relevant. I would say that it’s extremely relevant. It’s very important not only for your status but actually, it shows you the level of knowledge in problem-solving.
The most important way that I learn new things in Data Science, or Machine Learning specifically, is when I take part in competitions even if I don’t get a very high rank.
The effort you put into solving typically quite a complex problem in a very short time while competing with others is what accelerates your learning curve.
At a certain moment, I felt that I reached a certain level and was not progressing. So I came back to learning – I learned a lot from Coursera and technical articles.
So you were learning during those times, competing but also changing your role at Endava, right?
I stopped working as a project manager and I started to work as a Data Scientist, and it’s still my position. Sometimes it’s a bit frustrating because I see project management issues there. But I try to support the existing project managers.
What does your average day look like as a Lead Data Scientist? Do you do a lot of management these days, or are your tasks more technical?
At Endava, depending on your seniority, you can have some management tasks. The role also may change from project to project. For example, in the current project, I’m not leading the team, I’m just a part of the team. My role now is just to develop solutions. I work on proof of concept, and then with a team, we work on implementing it inside the current application. So I had to re-develop my development skills as well. Because from the POC to writing production code, that is a journey. But, I can recognize easily when my code is crap.
That’s a good thing, I guess. I feel that once you start looking at your code from half a year ago, and it doesn’t look that bad, you think it’s pretty good, it means that you haven’t really progressed that much.
Yes, that’s a good way to put it. I wanted to be part of the team that implements solutions because you learn a lot about this. I wanted to be on a fast learning curve in my day to day job, not only in competitions and Kaggle world. I also wanted to be able to cover the entire lifecycle of procurement from early investigation to writing code for production. It’s not always successful, but not being comfortable is also a good thing. Because the moment I say I’m done and there’s no more progress to be made it, will be the moment to rest.
Do you think this moment ever comes?
If you get tired of having to combat, I guess.
If I were to say in just a few words, how to be a good Data Scientist, I’d say you have to keep learning and put yourself as much as possible in the roles or environments where you can continuously learn from various professional experiences.
Be active in communities. And as I said, bring that experience into your projects as well. And, of course, it also helps the other way around. The organization, the robustness of the code that you’re writing can also help you in competitive predictive modeling. Because it pushes you to write more robust and reusable code.
Keeping you busy while trying to solve difficult problems is a very good way to cope with the current conditions when a lot of people should be at home. It’s good to use this as an opportunity to develop your skills.
Do you have any other methods or tricks to keep learning all the time?
I try to learn about domains that are new for me in data science or machine learning. It’s not always a great success, but sometimes I’m just scratching a bit in one direction, and so I feel uncomfortable. I just go and study something else, and then I return with more energy, determination. Because I was processing the first thing somewhere in the back of my head, I never give up.
What are the core skills that you think are important in your job, especially if you want to develop your Data Science career?
- I got a Ph.D. in Computational Electro-magnetics, and I think that the knowledge of all the mathematical methodology has been helping me a lot. The right background is helpful.
- Another thing is that every time I’m learning something, I feel like I need to see it to understand it. I try to visualize things.
- Finally, if you progress in a field, try to use the knowledge on a certain practical problem.
When it comes to the tools that I use, I don’t have specific tools. I have a basic toolset for NLP (NLTK, SpaCy, GenSim), but sometimes I have to switch. I started to use PyTorch. I used to work with Keras and TensorFlow. I had to get familiar with cloud-based tools, like SageMaker.
It’s always good to use tools that can accelerate the process.
At the same time, you need to be ready for projects when you don’t have those tools. It happened that although I had a good, prepared solution, I couldn’t use it because of the constraints (missing library, memory restriction). I had to go back to the early days when I was programming in Fortran when we were reusing variables because we couldn’t fit things into memory. Sometimes there are budget constraints as well. Also, when you create something for the business, and it’s supposed to work for years, you shouldn’t use very complicated resources. So, yes, it’s good to have flexibility when it comes to tools.
If you were to educate Data Scientists (people at the beginning of the path, or even already working but with little experience), what would you advise them to do?
The first and very important thing is to start from the basics. Now, everybody wants to learn the newest solutions, but they should really understand how the very simple algorithms work. It’s especially useful later when they will have to explain their work to the business and gain their trust. Interpreting the results is equally important as building a good solution. It’s also important to be able to visualize these results.
And when you have a problem to solve, try the simple solution first.
You always have to start as simple as possible. I also recommend learning the whole Machine Learning pipeline, as it gives you more opportunities.
Do you have any final thoughts you’d like to share?
I’ll turn your question into a joke. There’s no final, so we’ll continue… and keep learning!
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It
Jakub Czakon | Posted November 26, 2020
Let me share a story that I’ve heard too many times.
”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…
…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…
…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”
– unfortunate ML researcher.
And the truth is, when you develop ML models you will run a lot of experiments.
Those experiments may:
- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)
And as a result, they can produce completely different evaluation metrics.
Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.
This is where ML experiment tracking comes in.Continue reading ->