The Data Science ROI Of Moving To The Data Lakehouse
Many CDOs/CDAOs (and even CMOs) have raised some interesting questions about these three data management and data science titans: Snowflake, Databricks, and DataRobot. Some of the questions include:
- Can Snowflake replace the CDP?
- Do I need Databricks if I have ML Ops capabilities or other capabilities that are similar?
- What is the difference between DataRobot and Databricks?
- How do we prevent the siloing of marketing data and smooth access by other domains?
Later in this post, I’ll be discussing these questions with Ed Lucio, a New Zealand data science expert for Spark (telecom provider) and former lead data scientist for ASB Bank. We’ll be giving our POV on these questions as well as highlighting a few data analytics use cases that can be driven by these tools once they are in place. I would love to hear from you regarding other use cases and your experiences with data lakehouses.
Before diving into my conversation with Ed, a quick overview of environments and tools…
Types Of Storage Environments
We, as an industry, have gone from the data warehouse to data lakes, and now to data lakehouses. Here’s a brief summary of each.
The data warehouse: Closed format, good for reporting. Very rigid data models that require moving data, and ETL processes. Most cannot handle unstructured data. Most of these are on-prem and expensive and resource-intensive to run.
The data lake:
- Handles ALL data, supporting data science and machine learning needs. Can handle data with structure variability.
- Handles ALL data, supporting data science and machine learning needs
- Difficult to:
- Append data
- Modify existing data
- Stream data
- Costly to keep history
- Metadata too large
- File-oriented architecture impacting performance
- Poor data quality
- Data duplication – hard to implement BI tasks, leading to two data copies: one in the lake, and another in a warehouse, often creating sync issues.
- Requires heavy data ops infrastructure
Data lakehouse: Merges the benefits of its predecessors. It has a transactional layer on top of the data lake that allows you to do both BI and data science in one platform. The data lakehouse cleans up all of the issues of the data lake, supporting structured, unstructured, semi-structured, and streaming data.
Current Data Environments and Tools
The following tools summary is from my deploying the tools as a CDO/CDAO and executive general manager, not as an architecture or engineer. This is a synopsis of the top-line features of each but if you want to add to your experience with the features please reply to the post and add to the synopsis.
What is Snowflake?
Snowflake is a highly flexible cloud-based big data warehouse that has some unique and specialized data security capabilities allowing businesses to transition their data to the cloud as well as to partner and share data. Snowflake has made much progress in building partnerships and APIs and integrations. One interesting possibility that marketers may want to consider is that snowflake can be leveraged as the CDP directly and activate campaign data through a number of their partners. See their website for more details.
Snowflake is a data lakehouse that like its competitors is indifferent to structure variability and can support structured, semi-structured, and unstructured data. Its uniqueness for me is a few folds:
- Ability to create highly secure data zones (a key strength) – You can set security at the field and user level. Strong partners like Alation and High Touch (a reverse ETL tool or ELT).
- Ability to migrate structured and SQL-based databases to the cloud.
- Ability to build unstructured data in the cloud for new data science applications.
- Ability to use Snowflake in a variety of contexts as a CDP or a marketing subject area. If Snowflake becomes your CDP, you save the expense and other issues of having multiple marketing subject areas.
Many organizations today are using data clouds to create a single source of truth. Snowflake can ingest data from any source, or format, using any method (batched, streaming, etc.), from anywhere. In addition, Snowflake can provide data in real time. Overall, it is good practice to have the marketing and analytics environments reside in one place such as Snowflake. Many times, as you generate insights you want to operationalize those insights into campaigns hence having them in one CDP environment improves efficiency. High-touch marketers, supported by their data analytics colleagues and Snowflake, can activate their data and conduct segmentation and analysis all in one place. Snowflake data clouds enable many other use cases:
- One version of the truth.
- Identity resolution can live in the Snowflake data cloud. Native integrations include Acxiom, LiveRamp, Experian, and Neustar.
- You don’t have to move your data, so you increase consumer privacy with Snowflake. There are advanced security and PII protection features.
- Clean room concept: No need to match PII to other data providers and move data. Snowflake has a media data cloud, so working with media publishers who are on Snowflake (such as Disney ad sales and other advertising platforms) simplifies targeting. As a marketer, you can work with publishers who built their business models on Snowflake without exposing PII, etc. Given the transformation that is happening due to the death of the third-party cookie, this functionality/capability could be quite impactful.
What is Databricks?
Databricks is a large company that was founded by some of the original creators of Apache Spark. A key strength of Databricks is that it’s an open unified lakehouse platform with tooling that helps clients collaborate, store, clean, and monetize data. Data science teams report the collaboration features were incredible. See the interview below with Ed Lucio.
It supports data science and ML, BI, real-time, and streaming activities:
- It’s software as a service with cloud-based data engineering at its core.
- The lakehouse paradigm allows for every type of data.
- No or low-performance issues.
- Databricks uses a Delta Lake storage layer to improve data reliability, using ACID transactions, scalable metadata, and table-level and row-level access control (RLAC).
- Able to specify the data schema
- Delta Lake allows you to do SQL Analytics, an easy-to-use interface for analysts.
- Can easily connect to PowerBI or Tableau.
- Supports workflow collaboration via Microsoft Teams connectivity.
- Azure Databricks is another version for the Azure Cloud.
- Databricks allows access to open-source tools, such as ML Flow, TensorFlow, and more.
Based on managing data scientists and large analytics teams, I would say that Databricks is preferred over other tools due to its interface and collaboration capabilities. But as always it depends on your business objectives in terms of which tool you select.
What is DataRobot?
DataRobot is a data science tool that can also be considered an autoML approach: it automates data science activities and thus furthers the democratization of machine learning and AI. The automation of the modeling process is excellent. This tool is different from Databricks which deals with data collection and other tasks. It helps fill the gap in skill sets given the shortage of data scientists. DataRobot:
- Builds machine learning models rapidly.
- Has very robust ML Ops to deploy models quickly into production. ML Ops brings the monitoring of models into one central dashboard.
- Creates a repository of models and methods.
- Allows you to compare models by methods and assess the performance of models.
- Easily exports scoring code to connect the model to the data via an API.
- Offers a historical view of model performance, including how the model was trained. (Models can easily be retrained.)
- Includes a machine learning resource to manage model compliance.
- Has automated feature engineering; it stores the data and the catalog.
Using Databricks and DataRobot together helps with both data engineering AND data science.
Now that we have a level set on the tools and vendors in the space, let’s turn to our interview with Ed Lucio.
Interview With Ed Lucio
Many firms struggle to deploy machine learning and data operations tools in the cloud and to get the data needed for data science into the cloud. Why is that? How have you seen firms resolve these challenges?
Ed if you could unpack this one in detail? Thanks, Tony.
From my experience, the challenge to migrate data infrastructures and deploy cloud-based advanced analytics models is a more common issue in traditional/larger organizations which have at least one of the following: built significant processes on top of legacy systems, is in vendor/tooling lock-in, in a ‘comfortable’ position where cloud-based advanced analytics adoption is not the immediate need, and the information security team is not yet well adept on how this modern technology is aligned to their data security requirements.
However, in progressive and/or smaller organizations where there is an alignment from senior leaders down to the front line (coupled with the immediate need to innovate), cloud-based migration for infrastructures and deployment of models is almost natural. It is scalable, more cost-effective, and flexible enough to adjust to dynamic business environments.
I have seen some large organizations resolve these obstacles through strong senior leadership support where the organization starts building cloud-based models and deploys smaller use cases with less critical components for the business. The objective is just to prove the value of the cloud first, then once a pattern has been established, the company can scale up to accommodate bigger processes.
Why is Databricks so popular as one tool category (Data Ops/ML Ops), and what does their lakehouse concept give us that we could not get from Azure, AWS, or other tools?
How does Databricks help data scientists?
What I personally like about Databricks is the unified environment that helps promote collaboration across teams and reduces overhead when navigating through the “data discovery to production” phases of an advanced analytics solution. Whether you belong to the Data Science, Data Engineering, or Insights Analytics team, the tool provides a familiar interface where teams can collaborate and solve business problems together. Databricks provide a smooth flow from managing data assets, performing data exploration, dashboarding, and visualization, to machine model prototyping and experiment logging, and code and model version control. When the models are deemed to be ready by the team, deploying these through a job scheduler and/or an API endpoint is just a couple of clicks away and flexible enough for the business needs whether it is needed for either batch or real-time scoring. Lastly, it is built on open-source technology which means that when you need help, the internet community would almost always have an answer (if not, your teammates or the Databricks Solutions Architect would be there to assist). Other sets of cloud tools would be able to provide similar functionalities, but I haven’t seen one as seamless as Databricks.
On Snowflake, AutoML tools, and others, how do you view these tools, and what is your view on best practices?
Advanced analytics discovery is a journey where you have a business problem, a set of data and hypotheses, and a toolkit of mathematical algorithms to play with. For me, there is no “unicorn” tool (yet) in the market able to serve all the data science use-case needs for the business. Each tool has its own strengths, and it would need some tinkering on how every component would fit the puzzle in achieving the end business objective. For instance, Snowflake has nice features for managing the data asset in an organization, while the AutoML tooling (DataRobot/H2O) is good for automated machine learning model building and deployment.
However, even before proceeding to create an ML model, an analyst would need to explore the dataset for quality checks, understand relationships, and test basic statistical hypotheses. The data science process is iterative, and organizations would need the tools to be linked together so that interim outputs are communicated to the stakeholders to pivot or confirm the initial hypothesis, and to share the outputs with the wider business to create value. Outputs from each step would typically need to be stitched together to get the most from the data. For example, a data asset could be fed into a machine learning model where the output would be used in a dashboarding tool for reporting. Then, the same ML model output could be further enhanced by business rules and another ML model to be fit for purpose on certain use cases.
On top of these, there must be a proper change control environment governing the code and model versioning and transitioning of codes/models from development, pre-prod, and prod environments. After deploying the ML model, it needs to be monitored to ensure that model performance is within a tolerable range, and the underlying data has not drifted from the training set.
Are there any tips and tricks you’d recommend to leaders in data analytics (DA) or data science (DS) to help us evaluate these tools?
Be objective when evaluating the data science tooling and work with the data science and engineering teams to gather requirements. Design the business architecture which supports the organization’s goals, then work backward together with the enterprise architect and platform team to see which tools would enable these objectives. If the information security team objects to any of the candidate toolings, have a solution-oriented mindset to find alternative configurations to make things work.
Lastly, strong support from the senior leadership team and business stakeholders is essential. Having a strong focus on the potential business value, the need to enable the data science tools would always come in handy.
What is the difference between a data engineer and a data scientist, and an ML engineer (in some circles referred to as a data science data engineer)? Is it where they report or do they have substantial skill differences? Should they be on the same team? How do we define roles more clearly?
I see the data engineers, ML engineers, and data scientists being part of a wider team working together to achieve a similar set of objectives: to solve business problems and deliver value using data. Without going into too much detail:
- Data engineers build reliable data pipelines to be used by insight analysts and data scientists.
- Data scientists experiment (i.e., apply the scientific process) and explore the data on hand to build models addressing business problems.
- ML model engineers work collaboratively with the data scientists and data engineers to ensure that the developed model is consumable by the business within the acceptable range of standards (i.e., batch scoring vs real-time? Will the output be surfaced in a mobile app? What is the acceptable latency?).
Each of these groups would have their own sets of specialized skills, but at the same time, should have a common level of understanding of how each of their roles work side-by-side.
Many thanks to Ed Lucio, Senior Data Scientist at Spark in New Zealand, for his contributions to this article.
In summary, this article has provided a primer on data lakehouses and three of what I consider to be the leading-edge tools in the cloud data lakehouse and machine learning space. I hope Ed Lucio’s POV on the tools and their importance to data science was helpful to those considering their use. At the end of the day, all of this—selection of environments and tools—depends on the business needs and goals: what are the problems that need solving, as well as the level of automation the business is driving towards.
As always, I’d love to hear about your experiences. What has your experience been with data lakehouses, ML Ops, and data science tooling? I look forward to hearing from you regarding this post.
From Your Site Articles
Related Articles Around the Web