MLOps Live

Join our webinar on Improving LLM Accuracy & Performance w/ Databricks - Tuesday 30th of April 2024 - 12 noon EST

What is an Open Source Model?

What Are Open Source Models in Machine Learning?

Open source has always been an integral part of artificial intelligence. By definition, the term open source refers to software for which the original source code is made publicly available. Thus, anyone can become a contributor, redistributor, or user of this software within the terms of its open-source license.

The values of open source are the same values to which the machine learning (ML) community has always aspired: collaboration, peer review, transparency, reliability, flexibility, and accessibility. Originating from academia, this mindset of sharing and transparency now also permeates industry, with leading companies such as Google and Microsoft being among the most well-known contributors of open-source machine learning models.

Due to the complexity and fast pace of the ML world, the most-adopted models, pipelines, frameworks, and infrastructures have generally been open source. Following this open-source ML trend, the number of publicly available resources for open source machine learning models keeps increasing.

This article defines open-source AI models, reviews the most popular releases, discusses pros and cons of the current open-source landscape, and concludes with an overview of how to productionize these models.

What Are Open-Source Models?

Open-source models are binaries of machine learning algorithms pre-trained on often-large datasets in order to achieve state-of-the-art performance in a machine learning application. These model binaries are released to the public for everyone to use, for either model inference or transfer learning, as we’ll explore in the last section of this article.

Usually, these trained models are released with the code that implements the underlying machine learning algorithm and, sometimes the data is also publicly available. In such cases, full reproducibility is ensured, and users can also review, modify, and contribute to the solution, as per the standard open-source definition.

What Open-Source Models Are Available?

Most open-source AI models are deep learning models. This is because neural networks benefit from huge datasets and sophisticated architectures that can grow to encompass a vast number of parameters; thus, training them requires extensive time and hardware. In the few cases where code and data are both available, fully replicating the training of these models is extremely resource inefficient and thus, for most individuals and organizations, unfeasible.

One example of a deep learning open-source model—and one of the largest ever released—is a GPT-like model named YaLM 100B which has been trained on 1.7 TB of text for 65 days on a pool of 800 high-grade A100 graphic cards.

To find this and other open-source models, we can look at some of the most well-established providers:

  • Model Zoo is probably the most popular collection of deep learning code and models for a variety of frameworks, platforms, and applications.
  • Framework-specific collections exist too, such as TensorFlow Hub and PyTorch Hub.
  • For those looking for open-source code to replicate, tailor, or just better understand specific ML algorithms, Papers with Code is the place for all open-source machine learning algorithms—and the source of many ofthe latest state-of-the-art ML models.
  • Hugging Face is the rising star for open-source models with a focus on natural language processing (NLP) applications. This platform has a unique production focus, and deploying Hugging face models is as simple as calling a function from their Python library.

These platforms predominantly redirect to GitHub which is, ultimately, the largest open-source model repository.

Pros and Cons of Open-Source Machine Learning Models

Open-source models offer various benefits, which help boost adoption of a wider variety of AI applications:

  • Time and cost savings: Open-source models are pre-trained, eliminating the most expensive phase of data science workloads.
  • Quality: Open-source models are extensively tested, implement best practices, and often achieve state-of the-art performance.
  • Minimal entry requirements: Open-source models support the democratization of AI for individuals and companies when there is a high entry cost for good data, big data, computational power, budget, and availability of talent.

Open-source AI is sometimes criticized too, with some of the reasons being:

  • Environmental impact: The computational requirements of open-source models are often extremely high, with models training for days and weeks on large hardware pools.
  • Lack of regulation: Open-source models are often trained on large datasets of web-scraped data, with data owners and creators not having clear control of ownership rights.
  • Lack of comprehensive testing: Open-source models are often released without complete testing, in part due to model interpretation and theoretical analysis being an open field of study. This could lead to unexpected—sometimes “evil”—model behaviors and applications. Meta’s Galactica model offers a case study.

New call-to-action

Productionizing Open-Source Machine Learning Models

The AI community seems to have reached the consensus that adopting open-source models is the new standard when delivering AI applications, especially for NLP and computer vision tasks. An alternative often considered is AutoML.

To productionize open-source models, the first step is to download the model library to the development and production environment. From here, two paths are most commonly taken:

  1. Use the pre-trained model directly for inference.
  2. Use transfer learning on the pre-trained model to extend its trainable parameters and layers, and then tune it for the specific proprietary data set and use case.

When deploying the original or tuned pre-trained model in production, similar considerations for the model deployment of a model trained from scratch should be taken, with the exception that the continuous retraining pipeline is not needed for path one.

The philosophy of productionizing machine learning models and pipelines built upon open source principles and software is often referred to as “open-source model infrastructure.” Iguazio supports this approach by offering MLRun, the first end-to-end open source MLOps orchestration framework.