How do I automate the training pipeline with my CI/CD framework?

When automating a pipeline of any kind, whether it be ML pipelines like model training or data-focused pipelines such as feature engineering, the basic first task is to remove and/or replace any step in the pipeline that requires a human to do the work. There are instances where human logic is ok in a pipeline, such as “Review” and “Approve”, but generally we want everything else in the pipeline to be able to execute without help from us. There are many tools out there for creating pipelines and job workflows. Here at Iguazio, we use Kubeflow Pipelines.

Next, you need to design the pipeline in such a way that it’s robust and resilient. Think about how to answer questions like these:

  1. If my pipeline fails, can I easily restart it without manual fixes?
  2. If my pipeline executes multiple times in a row, do I still get the desired results?
  3. Is my pipeline dynamic in how it handles new data, parameters, scale, etc?
  4. If my pipeline routinely has issues/risks, what are they and how do I mitigate them?

Before involving any CI/CD framework, you should be able to manually “trigger” the pipeline and answer these questions through some basic tests. Once you are satisfied with the pipeline, it’s time to automate it. What are the conditions under which this pipeline should run? If the answer is something along the lines of “every night at X time”, then you can accomplish that with a simple scheduling mechanism, which is often times part of the pipeline tool you are using. If you are using some event as a basis to execute the pipeline, then using your CI/CD framework (Jenkins, circleci, Github actions) should be able to cover that. Generally, these events are development focused like “new commit to master branch”, but they don’t have to be.

Finally, we need to connect our two major tools. Can my event be captured by my CI/CD framework? Can my CI/CD framework connect to my pipeline tool (such as an API call)? If the answer to these questions is yes, then you can automate your pipeline using your CI/CD tool to trigger it based on whatever events you can define in that tool.

Need help?

Contact our team of experts or ask a question in the community.

Have a question?

Submit your questions on machine learning and data science to get answers from out team of data scientists, ML engineers and IT leaders.