How Chipotle leveraged curated data for a COVID risk assessment model
Join executive leaders at the Data, Analytics, & Intelligent Automation Summit, presented by Accenture. Register here.
During Transform 2021, Ahmad Khan, head of AI/ML Strategy at Snowflake sat down with Mash Syed, lead data scientist at Chipotle Mexican Grill, to talk about the importance of data, and how high-quality data sets can improve the performance and accuracy of machine learning models and their ability to provide business value. For Chipotle, the business value was the company’s ability to create a risk assessment model during the pandemic.
At the start of the lockdown, the company put together an internal COVID task force to make decisions around customer employee safety and customer safety protocols, supplier constraints, how suppliers were being affected, and even whether travel for employees was safe.
The team tapped Syed to lead a project to use publicly available data for COVID case numbers, along with their internal data, to help them better understand how cases were growing and rising, and what the risk level was at the county level. This data was then turned into a visualization tool to plot the risk level for restaurants in that county. The team assigned color codes to assess the risk levels — green, yellow, and red. With these basic classifications, they created a geographical map of the U.S. where they could see a gradient occurring across the country.
They used the XGBoost classification model to classify whether a given county in the U.S. was going to go up in its classification ranking in terms of case count. The Python and the code were straightforward. The challenge was the third-party source they relied on for reliable data and the need to build infrastructure to get the data to do its job — which was just to show up in their Azure data storage folders, so they could pull that data into Databricks.
“Leveraging the county-level data got us to that insight, but we were beholden to that data source,” Syed said. “The days when it was available, it did its job and it was a beautiful thing, but there were definitely many instances where it wasn’t available when we needed it to be.”
At the county level, they’d get case counts one day, and then nothing the next, and in a lot of instances, data would not get updated at all until some time later in the afternoon.
“Time was of the essence,” he said. “We needed to make decisions as soon as possible so that we’re staying safe and our employees are safe and we’re meeting those objectives in terms of what our task force was set up for.”
He and another data scientist on the team had to leverage Data Factory to go out and ping the API, retrieve the file, to see if the max date of the data matched the previous day’s date, and if not, continue to loop in 15-minute increments to continuously look for the data. The other challenge with that data source was that not all counties would have reported case counts in a given day.
The solution to missing and unreliable third-party data was the Snowflake marketplace, Syed said, which offers access to more than 500 live and ready-to-query data sets from over 140 third-party data providers and data-service providers.
“It’s curated for you, it’s being maintained, it’s in a nice format, so a lot of that legwork I was previously doing was taken out of the picture,” he said.
Before he made the leap to switch out that third-party data source, Sayad did a deep dive on the Snowflake data to make sure the data had the attributes they needed — which were case counts, deaths, and county identifier, or the FIPS code specifically. Once he implemented it, he was able to go into the notebooks where he was maintaining all the code to run the pipeline that they had created, which involved so many different components to make it work with the third-party data, and remove major parts to simplify it. The data is also available to stakeholders in the company.
They’re planning on leveraging the marketplace for demographics and marketing-related enriched data sets in the future, such as a U.S. Census curated data set, for use cases in marketing, LTV and RFM, and projects they’re doing in understanding customers, building personalized journeys, and more.
“The key takeaway technically is that not only do you have a win from the curated, reliable data sets being readily available to your customer, and to your stakeholders, but you’ve also streamlined the process so it’s less of a headache to maintain for you as a data scientist,” Sayad said. “That’s been really satisfying.”
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more
Source: Read Full Article