There’s an old saying in Argentina that goes: “A río revuelto, ganancia de pescadores”. This translates to “when the rivers are stirred, the fishermen profit”.
2022 will be remembered as a defining year for the crypto ecosystem. If this will be for good or for bad reasons it is yet to be known, but there’s no doubt that the massive waves caused by the ecosystem’s volatility (including the fall and destruction of both an entire cryptocurrency and arguably the second most important exchange company) impacted everything they touched.
And that includes data.
Given that the whole theory of machine learning assumes today will behave at least somewhat like yesterday, what can algorithms and models do for you in such a chaotic context? And even more: how do you even harness the power of the enormous amount of data inherent to crypto, keeping track of these extreme changes and actually pulling value out of them?
Those were the questions that the guys at CTF Capital —a trading fund— had. And we at deployr, worked alongside them to find the best possible answers for everyone involved and build their Data and ML Pipelines.
We hope our experience dealing with these challenges can help you understand the complexity of the crypto world and perhaps give you cool insights on how to deal with your own data problems and team management.
A quick note: the focus of this article is not to discuss the intricacies of crypto trading per se (an economic perspective, so to say) but rather to talk about how we used the best practices of the MLOps methodology to lead a transformation process for a company working in one of the most technically and computationally demanding fields.
With that out of the way, let’s dig in!
Building data and ML pipelines: from the ground to the cloud
It was the beginning of 2022, and things were looking bright after the lockdown’s end. At that time, we received a message from a friend and colleague of ours working at a crypto trading fund: “I managed to assemble an amazing data team, but we lack direction and focus”.
To give you some insight into how things are on our side of the world (we’re from Argentina, would you like a mate?), in Latin America, we face the unique challenge of reaching top-level performance while addressing three main problems:
- Lack of formative resources in Spanish: while luckily both teams are fluent in English, the harsh reality is that most of the technical frameworks are designed and coded in English, and trying to find online examples and guidelines in Spanish most of the time leads to poor quality resources or even no results at all.
- The inherent cost of cloud computing: To illustrate the point, Argentina’s minimum wage is currently around 200 dollars per month. It may not seem like a real problem, but when you have to spend money to implement something in the cloud, spending as little as a dollar or two can make the cost of failing (and that’s a natural part of developing these solutions, to fail and iterate fast to get the best results) very steep.
- The senior-level brain drain: Caused mainly by the immense currency disparity between the peso and dollar.
The highest highs and the lowest lows
The guys at CTF Capital were developing a model that could be integrated into a larger decision-making system. The idea was to help the traders (actual human persons) to perform either a sell or buy on certain crypto currencies pairs (mostly, compared with a stablecoin, such as USDC or USDT).
When it comes to crypto, every single part of a pipeline can turn into something extreme and nasty very quickly. And while that is true of pretty much every project in every single industry, keep in mind that changes in the crypto world can happen out of nowhere.
Being a trading fund that deals with millions of dollars (translated into several cryptocurrencies), the decision-making process must be as sharp as possible, and tracking and accountability are paramount for that.
With that scenario in mind, before we could all start getting our hands dirty and write a single line of code, we had to set the strongest possible foundations for everything we were going to build.
Building a map of the data & ML pipelines
If you’ve ever been a part of this type of transformational process, I’m sure you already know that this is an all-hands-on-deck moment.
Reaching this level of excellence and performance requires that everyone must have clear visibility of the situation. To do that, you must have an honest diagnosis of the organization as is on a technical, business, and human level. Using that as a starting point, it’s a matter of designing and guiding your client to be where they want to be.
We also believe that communication matters, and there’s no point in being mysterious about what will happen every step of the way. If you want to get data scientists, engineers, architects, stakeholders, third-party consultants, and a whole myriad of other actors on board, you have to build two things:
- 1 Bridges between stakeholders and members from all over an organization—from marketing to sales to engineering—working with data on different theoretical and practical levels.
- 2 Excitement about what the end goal is: to be better at what we do, using best software practices to bring the most out of what we do, and using the most appropriate technology to do it.
The way you deal with these problems makes the entire difference, and it takes a careful balance between the most highly advanced technical aspect (“can we build, maintain and deploy this?”, “what does reasonable scale mean for us?”) and a human side of the business (“how many positions will rotate from here to the next months?”, “what is needed from me and my team, and how will we be able to grow?”) to design a clear and consistent path to take you from where you are now to where you want to be.
That’s why we took our combined years of being part of development and management teams and the best of several methodologies, and we developed the Architecture Canvas: a framework for helping organizations to move forward with their data projects.
Talk is cheap (and useful)
Without diving too much into details, the Architecture Canvas consists of two big stages:
Stage 1: Architecture
Here we try to get a clear picture of the technological, cultural, and human capabilities of a given organization in terms of data and ML adoption, being kind in pointing out the weakest points and honest in providing suitable improvement alternatives.
This is the bottom half of the diagram above, and it aims to take a deep look at the company’s status in terms of:
- 1 Value proposition
- 2 Data sources and versioning
- 3 Data analysis and experimentation
- 4 Feature store and workflows
- 5 Foundations
- 6 ML Pipeline orchestrations
- 7 Model registry and versioning
- 8 Model deployment
- 9 Prediction serving
- 10 Data and model monitoring
- 11 Metadata store
Stage 2: ML and IA Maturity Scale
This methodology starts from the assumption that every team can be analyzed in four big dimensions: people, technology, data, and process. Depending on each one of the possible combinations for one to the other, you get six methodological axes:
- 1 Learn: People + Tech
- 2 Lead: People + Process
- 3 Access: People + Data
- 4 Scale: Data + Technology
- 5 Secure: Data + Process
- 6 Automate: Technology + Process
The goal of this stage is to score the team in terms of how those six axes are in terms of three different maturity levels: tactical, strategic and transformational.
You can take a look at the upper half of the diagram above to see how these intersections end up on a scale that goes from 1 (an organization completely immature in terms of data adoption) to 5 (an organization completely mature, professional and advanced in terms of data adoption).
We code for (and work with) fellow humans
The framework needs complete honesty, transparency, and collaboration. To get all this information, we need to have sessions with the different areas of the team (scientists, engineers, QA, managers, C-level executives if the need arises) in order to fully understand each one’s expectations with the project and reach a common understanding.
Luckily, the sessions were fruitful and fun: we were able to reach the bottom and design a new plan that allowed us to grow fast and measure it in a quarterly fashion, providing value and peace of mind to every area of business.
In order to promote the mentioned transparency and collaboration, right after the sessions were done and the plan was laid out, our first step was to rearrange the management and follow-up tool and adapt it to our needs.
Seriously: we can’t stress enough how very important it is to keep and maintain a serious tracking tool. Use Trello, Asana, or anything of your choice, but keeping full visibility of what everyone’s doing is crucial if you want to get everybody on the same page.
On that note, we use Notion, and we consider it an amazing tool on its own, but our ace under the sleeve is this amazing free template by Thomas Frank: getting it started can be a little complex and intimidating, but once you get the hang out of it is a very intuitive way of keeping work organized with the best of Kanban and a calendarized approach.
We also defined new Slack channels to speed some things up, and they would eventually become an integral part of the Automated Notification Center for the monitoring of the entire system (more on that later!).
Knowing what needs to be done and in what order (the whole process and management side of data) is often overlooked, and we know sometimes keeping everyone up to date can be a bit tedious in its own way, but if you can orchestrate pipelines with dozens of steps in your sleep, you surely can take a moment to write what you’re up to, right?
May be useful
Phase 1—Data pipeline: getting the house in order
Once the dust was settled, we got the Architecture Canvas completed, and the plan was clear to everyone involved, the next step was to take a closer look at the architecture. And that’s when what usually happens, happened:
We came for the ML models, we stayed for the ETLs.
Back in the day, the main pain points were:
- 1 The architecture deployment wasn’t clearly tagged, and there were parts that heavily relied on manual executions.
- 2 The entry point of the whole system was dependent on multiple web sockets, which made it difficult to scale.
- 3 The SQS queues were messy to maintain.
What’s in the box?
First of all, the origin of the data comes from the two biggest exchanges. But even when the ETLs were well thought out, they were a bit “outdated” in their approach. Since they were mostly doing PoCs for testing, most of the architecture was compute-oriented: setting things up in EC2 via ssh, no CI/CD, almost unused RDSs, some problems with missing data, and close to no containerization at all, etc.
We quickly realized that while the dev team was outstanding in the data science and engineering aspect of ML (algorithmics, ETLs, and how to use those to move a business forward), they lacked the knowledge of how to squeeze the most value out of the cloud services and the new best practices in the field.
With that scenario, we agreed on three things:
- 1 To slowly shift towards a more “dated” architecture (diving deeply into serverless).
- 2 To teach them how to use the stack considered best for them (mostly focusing on fundamentals of MLOps and AWS Sagemaker / Sagemaker Studio).
- 3 To redesign and rewrite the architecture as Infrastructure as Code (using AWS Cloudformation).
Luckily, we were able to deliver quickly, and in one month we had already built a whole new system.
- The pipeline is triggered by Eventbridge, and can be done either manually or by cron.
- This triggers a series of Lambda functions that initializes the pairs: that is, how one cryptocurrency stands compared to the other (BTC/USDC, ETH/BTC, etc.). Quick shout out to the amazing data engineering team at CTF Capital, they really poured their hearts and brains into this!
- We decided to do one exchange at a time: first, the biggest one, and once everything worked fine, we integrated the second.
- The key here was to do a good partition of the data. On each trigger, we check each one of them to see if there’s new data, and if there is, it dumps it into S3.
Phase 2—ML pipeline: And now, the models
Finally, the time came to take a look and start working on the ML pipeline. Yay!
The first thing we noticed was that it wasn’t “framework compliant”: in other words, every model (a .gz file) was stored in S3 after the execution of a manual process. That model wasn’t used in a production environment and was only used in some test batch predictions.
We promoted the use of Sagemaker Pipelines: on the background of a neat UI, the service runs Step Functions, which is easier to integrate with Docker images needed for certain critical steps and AWS Lambdas. Besides, it makes it simpler to keep track of the artifacts since it handles a lot of the back and forth in S3 for you.
The CI/CD was crucial for preventing accidents such as unwanted pipeline executions, and we implemented the use of GitHub Actions to trigger some tasks, such as the data pipeline deployment.
When that side was taken care of, we started building the model registry. We started from a situation where some models were stored in S3 and used to do some small batch predictions, but since they wanted to use it in real-time, we needed to use an endpoint and a model registry.
ML training pipeline
Considering we framed this as a time series, here’s a brief summary of what the data pipeline does:
- The first stage handles the filtering, labelling, and feature generation, which then gets divided into training and test sets.
- Next comes a stage of hyperparameter tuning for the models.
- Then comes a very exhaustive evaluation stage. We took special care at this point since we need the models to be fully tested. We submit those to five different methods and some feature-importance extraction processes as well.
- With all of that, the model gets retrained with all the data and stored in the Sagemaker Model Registry.
- After that, a chosen model gets deployed and used in the model pipeline.
This is a relatively straightforward process that handles training with cross-validation, optimization, and, later on, full dataset training.
Explore other tools
ML model pipeline
The model pipeline was also challenging, as we were shifting from a batch prediction inference to an event-driven one, which would allow us to reach real-time. While we can’t show the architecture diagram for disclosure reasons, what this pipeline does is:
- Take the raw data and perform some feature engineering in order to get a common metric known as OHLC (open-high-low-close).
- It’s worth mentioning that the data gets persisted through the entire process on hierarchically designated buckets (raw, bronze, silver, and gold).
- Taking that as input, there’s a step that builds the features and then another one that filters the data, allowing the process to continue only if the data looks promising in terms of our scientific and methodological approach.
- Next, it builds a general model with a combination of the outputs of some algorithms, business rules, and how much the data is changing in a period in order to estimate how much money should be invested and whether to go long or short.
- That prediction is provided to BI dashboards, used by traders to decide what to do at any given time.
Eventually, we grew bored of watching the console and waiting for jobs to end. So we decided to do a tiny side project with some “fun but useful” coding, and we built for ourselves an Automated Notification Center.
It’s a very simple but elegant process that uses Eventbridge to monitor when a given job fails and triggers a lambda function with Notion and Slack clients.
At this moment, we’re implementing it to notify processing job changes. You may tell from this excerpt from the Cloudformation template that we’re looking for the events generated automatically by AWS Sagemaker, specifically the processing job state changes. Within the context of the lambda function, it does one of the following:
- In case of a successful job, some key parameters get logged in a Notion table (for doing that, we squeezed the most out of Notion API).
- If something goes wrong, using Webhooks, we send a notification to a Slack channel, so our engineers and scientists can take a look at it.
The good and the bad
After months of development, we finally got to production!
We’d like to share the things we liked and those we didn’t so much to give you our perspective in case it may come in handy.
- Let’s start with the not-so-cool first. While AWS Sagemaker makes things a lot easier, as we all know, not everything is as it looks on the documentation. An example of this is the multi-model-server library, which is what AWS recommends teams to use when they’re dealing with certain DL models, but it hasn’t been maintained in years.
Because of that, we had to use some outdated versions of Python and Ubuntu, which forced us to do some unwanted refactoring in order to adapt our workflow to what the tool and AWS standards needed to be (when it really should be the other way around).
- We met with a couple of limitations in the serverless paradigm, which mostly boiled down to some memory issues. We know that Pandas isn’t the top performing library, and it can be a bit RAM intensive. Still, we really didn’t want to switch to some other framework unless it was essential in order to make things a bit easier for the data scientists.
- We were luckily able to sort them out, and since Lambda’s limits have already expanded lately (more uptime and power), it’s likely it’s only a matter of time until this gets a bit more polished.
- Building custom containers to perform inferences using custom libraries wasn’t as easy as the documentation may make it seem. We fought hard battles (and won) against it, mostly when it came to sizing.
- Without a doubt, even with the limitations we found, serverless is the way to go if you want to be as efficient as possible. The setup and technical head start are always a bit bumpy at first, but we managed to drop the costs to a third by shifting from a resource-based architecture to an event-driven one.
- The roadmap was clear to everyone involved, the responsibilities well defined, and the tasks correctly framed in time. As we mentioned before, we can’t stress enough the importance of giving full visibility of the entire project to everyone involved.
- Switching to an Infrastructure-as-a-Code approach proved to be the best choice for a lot of reasons, and not all of those are necessarily economic. Here’s a funny one, very typical of this industry: dodging certain regulations.
- In order to operate with these exchanges, you have to be very careful and choose carefully where you are trying to ping them from since there are several countries’ regulations that prevent a certain level of access (we were being clearly tagged by the origin IP addresses). It was only a matter of changing the AWS region and going somewhere else, and since we had the template for the infrastructure, to change that, we only needed to modify a single line of code. As easy as it gets!
- Working together helped the team to really level up their coding game. We started promoting code reviews and the use of linters (Pylint in our case), and we made it, so the lead data scientist has to approve every pull request.
- While the guys at CTF Capital learned a lot about cloud and infrastructure and got the chance to deploy a lot of their developments in a much more independent and consistent way, we at deployr learned a lot (and we mean A LOT) about the crypto ecosystem and how to solve certain specific problems in the industry.
Of course, we still have a long way to go (as it always is). The most important next steps are:
- We track and monitor experiments in a somewhat “rustic and artisanal” way. Since we want to take it to the next level and we don’t want to depend even more on some AWS solutions, the next quarter will find us diving deep into tools like neptune.ai to use it to track our experiments and metadata.
- We’re setting things up to dive deep into real-time: the endgame is to automate some of the trading operations, and for that, we have to work on the prediction pipeline.
- We’re working towards a monitoring and logging system for the prediction endpoints, controlling performance prediction and infrastructure wise. This includes model quality drift, feature importance drift, and model bias drift.
- Also, we’re looking into further development of the Automated Notification Center, building on top of that one so it can notify in case of underperformance or interruption of service and re-run the training pipeline if needed.
See what experiment tracking functionality neptune.ai has to offer.
If you’ve reached here, thanks for reading until the very end! If you have any doubts, questions, or comments, please feel free to write to me.