There is a new player in the hot MLOps market: Galileo. Founded by Google and Uber AI alumni, the company sneaked out this week with $5.1 million in seed funding.
Galileo is an intelligence platform for unstructured data, and the company claims it can help data scientists find and fix “critical ML data errors 10x faster across the entire lifecycle.” life of ML, from pre-training to post-training and post-production”.
A major hurdle in training AI models is the need to inspect and correct the data used for training, as much of it is unstructured (e.g. text, image, or speech) , and data scientists should use Python and spreadsheets to identify and fix errors in the data. This process takes valuable time, is less transparent, and can lead to biased and erroneous production models.
“The motivation for Galileo came from our personal experiences at Apple, Google and Uber AI and conversations with hundreds of ML teams working with unstructured data where we noticed that although they have a long list of model-focused MLOps tools to choose from, the biggest bottleneck and time-waster for high-quality ML is always fixing the data they’re working with, it’s essential, but overly manual, ad hoc and slow , resulting in poor model predictions and avoidable model biases creeping into enterprise output,” said Vikram Chatterji, co-founder and CEO of Galileo. “With unstructured data all over the generated at unprecedented scale and now rapidly leveraged for ML, we are building Galileo to be the smart databank for data scientists to inspect, Systematically and quickly fix and track their ML data in one place.”
Chatterji, a former head of product management at Google AI, co-founded Galileo with Atindriyo Sanyal, a senior software engineer formerly at Apple and Uber AI, and Yash Sheth, a former software engineer at Google. All three had experience with ML using unstructured data. At Google AI, Chatterji experimented with the slow and expensive process of training models while spending weeks analyzing data from his ML workflow. Sanyal was a co-architect of Uber’s feature store and was also an early member of Apple’s Siri team. In both cases, he was instrumental in creating ML data quality tools and infrastructure. Sheth led Google’s voice recognition platform and gained experience with unstructured voice data to build and promote its cloud-based voice API.
Galileo says its approach is unique, and “with just a few lines of code added by the data scientist when training a model, Galileo automatically logs data, leverages some advanced statistical algorithms the team created, and then Intelligently surfaces model failure points with actions and integrations to fix them immediately, all in one platform This shortens the time needed to proactively find critical errors in ML data in training models and production, from a few weeks to a few minutes with Galileo.”
Calling its platform “a collaborative record system” for training models, Galileo says it brings transparency to the process through its ability to show how specific data and model parameter changes affect performance. global.
“It is common knowledge that we often achieve greater gains in our model performance by improving data quality rather than tuning the model. Data errors can creep into your datasets and cause repercussions. catastrophic in many ways,” said the founders of one company. blog post.
These errors come in many forms. Sampling errors can result from inefficient data curation, and blind spots can arise in modeling if data scientists overlook important aspects of their data, which, according to the founders of Galileo, “look for the right sources of data, [having] a good mix of features, avoiding sample waste, ensuring data generalization and much more.
Man-made or synthetic labeling errors are also common, and since labeled datasets are often reused for long periods of time due to the high cost of labeling, models launched with labeling errors are rarely recycled with fresh and properly tagged data. “This leads the model to serve new/unseen traffic in production, which forces ML teams to react to customer complaints due to model prediction errors, caused by data staleness and inability to s ‘proactively train with the right training data’, said the founders of Galileo.
The founders want the collaborative nature of Galileo’s platform to enable everyone in the ML workflow, from sales engineers having to manually fix a customer’s data dump, to data scientists fixing model data training, and subject matter experts reviewing data errors to provide expert insights. advice on next steps, to PM and engineering managers keeping track of ROI on data sourcing and annotation costs.
This week’s $5.1 million seed funding was led by The Factory with participation from other investors, including Anthony Goldbloom, founder and CEO of ML and data science coding community Kaggle. Pete Warden, co-creator of Tensorflow, is among the company’s advisors.
“Finding and fixing data errors is one of the biggest hurdles to effective ML in the enterprise. The founders of Galileo felt that pain themselves while leading ML products at Apple, Google and Uber,” said Andy Jacques, investor at The Factory and Galileo board member. “Galileo has built an incredible team, made product innovations across the stack, and created an industry-first platform for ML data intelligence of its kind. It has been exciting to see the rapid market adoption and positive feedback with one of the customers even calling the product ‘magical’!”
The birds are not real. And neither MLOps
‘Glut’ of innovation spotted in data science and ML platforms
A “breakthrough year” for ModelOps, according to Forrester