Debunking 5 Myths on Adopting Databricks

Eshwaran Venkat
11 min readDec 24, 2023

--

Photo by Waldemar on Unsplash

This article is primarily for any teams within companies that are performing a survey or due diligence for their data engineering, analytics, machine learning or business intelligence needs. I’m not affiliated with Databricks, and all thoughts and opinions in the article are my own, based on tech stack evaluations at our company.

About Databricks

Databricks is a company founded by the creators of Apache Spark, one of the major frameworks for big data processing in the modern era, born out of UC Berkeley. They still maintain a major portion of the Spark OSS, and have created and maintain additional open source tooling like Delta Lake, a data storage format and Mlflow, for tracking machine learning experiments.

When using Databricks, all data is stored within the company cloud provider, and all resources to compute, transform and work on that data is used from company cloud resources. As such, your data never leaves your tech ecosystem (without your explicit directive).

The Myths / Misconceptions About Databricks

  1. It’s specifically built only for MS Azure 🧢
  2. It’s a Jupyter notebook with add-ons 🧶
  3. It’s only for big enterprises with big data 👔
  4. It’s no good if you don’t use Apache Spark ⚡️
  5. It’s cost compared to alternatives is unjustifiable 💰

Databricks is only on Azure 🧢

While Azure Databricks is a Microsoft Azure service that does indeed exist, Databricks can be deployed on AWS and Google Cloud as well, and has no Azure dependencies when run elsewhere. The deployment is straightforward because Databricks maintains IaaC that can run the platform out of your cloud without tedious configuration (For example, using existing CloudFormation templates to get fully setup on AWS).

Microsoft offers a cloud service with “Databricks” in the name, but Databricks is the one that’s running it ~ CNBC

Databricks documentation even provides specific documentation for different clouds (like S3 for AWS, and GCS for Google Cloud). Image by Author. Screenshot from docs.databricks.com

Microsoft has invested a large capital into Databricks, which is a part of the reason why Azure Databricks exists. The Databricks features, services and support however, are very much apples-to-apples on AWS and Azure, with minor caveats on GCP. AWS and other companies have also recently been a part of Databricks’ funding round, so the stakeholder portfolio is certainly getting diverse.

Databricks is a Jupyter Notebook with Frills 🧶

Databricks offers a Jupyter-Notebook-like interface to develop code within their platform, and they have improved on the interface in multiple ways such as google-docs-like live editing of notebooks, highlighting code and leaving comments, a notebook section browser and more. However, the core philosophy of Databricks is to have all aspects of your company’s data lifecycle in one place, and on one managed platform. Three (of many) additional features include:

Workflows: A fully managed orchestration tool, to run, manage and track jobs. Each task of a job can be a notebook, SQL query, event, etc and they can be put together to run batch or streaming pipelines

Unity Catalog: A managed view of the data lake as a set of coherent tables, schemas and catalogs with lineage information and controlled access for data governance.

Databricks SQL: It uses Delta Lake under the hood to store data, which provides a SQL interface to query data in a fast and clean way that lives on Object stores like S3, GCS or Blob storage. Databricks SQL adds on to ANSI-SQL with useful functions for slicing and dicing data a lot easier, and even incorporates frameworks like H3 directly into their SQL language reference to use on large swaths of data with SELECT statements.

Databricks is only for Big Data & Big Enterprises 👔

While Databricks is already in use by large enterprises including many Fortune-500s, the goal here is to assess its viability for startups, or small data teams. At Dotlas, we’ve been using Databricks for a while now and I’m personally of the opinion that choosing and evaluating options like Databricks is just as crucial as getting setup with platforms like AWS or Azure when starting out.

If you’re a company that has an application or platform that produces data in one or many databases at the beginning, that’s still good impetus for executives to start pushing for basic reporting. It’s preferable to use a platform like this instead of rebuilding the wheel, or even worse: thinking that software and data engineering are identical, and underestimating the problem in terms of getting a basic reporting pipeline and dashboard setup.

The Databricks pay-as-you-go model can help you get setup with a grassroots data lake ecosystem, that can scale up as your business grows, while keeping initial cost at no more than a couple hundred USD per month.

Databricks can’t be adopted without up-skilling Apache Spark ⚡️

The short answer is yes, Databricks is heavily built on top of Spark and when you’re not using Spark, you’re not harnessing the full capability of Databricks. But here’s the catch; When starting out — or migrating from an existing solution — you don’t always need to go max throttle and there are a number of reasons for this.

  1. Databricks uses a Spark Runtime, but a lot of it is also abstracted

In SQL queries within your data lake, Spark knowledge isn’t necessary, as the underlying engine utilizes Spark SQL. Efficiency depends on delta partitions, but technologies like Liquid clustering can enhance performance. You can query in SQL as if you’re using a regular database, with full table and schema support.

For complex pipelines involving Python, Scala, or R, Spark isn’t mandatory. Tools like Pandas or Polars are viable for tasks like processing a 12M row dataset, especially with a large machine (32GB RAM) and optimized techniques like vectorized functions. Databricks offers a smooth transition between Pandas and Spark DataFrames, with compatibility in data types.

However, using Spark is crucial for handling large batches of data or streaming tasks and for substantial data volumes.

2. For basic pipelines, don’t use multi-node clusters

Let’s say that you have a use-case where you have to transform sales data and store it in the warehouse. You may want to perform some aggregations, timestamp casting, fixing erroneous sales values and more. Perhaps your daily sales ingestion volume is a (1 Million x 25 Col) dataset. You may be tempted to bust your Jupyter notebook and create a Pandas transformation, or perhaps setup a job that loads this into a Postgres or mySQL database and then use SQL to do your thing.

Databricks will by default recommend multi-node clusters for the job. You can define a cluster configuration per specific job or pipeline requirement, or spin up a cluster for development too. Here’s an example in the image — when you click “Create Compute” and view the default (customisable) configs:

Default Configuration when Creating a New Cluster. This can be customized to upsize or downsize the hardware requirement. Image by Author.

It starts with 2–8x 30.5 GB machines. However, you can customize the hardware needs for that job, and in this case — perhaps a single node (1x 16 GB) or (1x 32 GB) machine will suffice. Keep in mind that Pandas or Polars do not natively use additional machines or cores, even if they’re available. Databricks allows you to track cluster metrics such as memory usage and CPU usage as well, so that you can make more informed decisions on downsizing or upsizing.

Default Job Cluster Recommendation by Databricks. Image by Author.

I’ve heard a story or two where prototyping Databricks for small jobs broke the bank due to using the default configurations. Take this cluster sizing article by Databricks for example, where the cluster recommendations for data analysis is attached in the image (Omg!). While this could certainly be relevant for certain volumes, it’s largely a factor on the size of data you’re working with, and our initial example of sales data would not require this.

Databricks recommended Cluster Sizing Configurations for Data Analysis. Source

Databricks infra recommendations will be for mid-size jobs. The target market for Databricks is companies that have a threshold data volume and data related activity. If you’re willing to forego some support, talking to an account manager, etc. then you could use the pay-as-you-go model until you’ve scaled to the point where a dedicated plan or support will be beneficial.

Databricks vs. Competition & Cost 💰

Understanding Databricks’ Cost Structure

With Databricks, you only pay for compute costs. Specifically Databricks Units (DBUs), which are a unit of processing capability per hour, billed per second of usage. There’s no Databricks invoiced cost of storing your data, or the ability to orchestrate workflows, lineage information, etc. Let’s take some simple scenarios and unfurl the cost factors.

  1. Scenario A: Running a data ingestion for 32 minutes on AWS hosted Databricks to ingest data from a Postgres DB in Australia, and saving it to a data lake in California. The cost factors would be:
    1. Databricks compute cost to run the AWS VM for 32 minutes
    2. AWS EC2 Compute Cost to run a VM for 32 minutes
    3. AWS Data Transfer Costs b/w Australia to California
    4. AWS S3 Data Storage Costs for ingested GBs in California

Keep in mind that there’s only one VM on AWS, and you pay Databricks through DBUs and AWS is paid based on standard VM charges. The implication is not that you pay for 2 separate VMs.

  1. Scenario B: Training a Machine Learning Model on a GPU in AWS Hosted Databricks from Data in Unity Catalog (Delta Lake)
    1. Databricks GPU compute cost in minutes
    2. AWS GPU compute cost in minutes
    3. AWS Data Transfer cost from S3 (Delta) to AWS/Databricks VM in GBs
    4. AWS S3 Storage costs where initial data is stored

Scenario B is interesting because you don’t pay for the lineage insights and governance from Unity Catalog provided by Databricks.

Comparison Overview

This section will further (non-meticulously, cause that’s for you to do) compare:

  1. Databricks with Snowflake
  2. Databricks with Native Cloud Analytics Tools (AWS / GCP)
  3. Databricks compared with Open Source & Other Managed Tooling

Snowflake

Snowflake is probably one of the closest competitors to Databricks, and have seen a meteoric rise (partially due to effective marketing) in the last few years. Snowflake also bills customers by compute costs only. Databricks excels in advanced data processing, particularly for machine learning and data science tasks, leveraging Apache Spark. Snowflake, while also powerful, is more focused on data warehousing and SQL-based analytics, although have recently introduced Snowpark, to allow python based data science and big data engineering capabilities. Their acquisitions of Streamlit, a popular data science web application development kit, and Ponder, a UC Berkeley School of Information based startup born from the creators of Modin, suggest that they want to play the data science platform game head-on with Databricks for market share. The larger industry almost treats Spark as a standard for big-data processing, which is Databricks’ home turf. This is a bridge Snowflake (and its users) have to cross, and acquisitions of data processing frameworks like Ponder can be the first step to building their own big-data processing engine to rival Spark.

Both Snowflake and Databricks are considered leaders in this space, on a tier of companies right after big tech giants by Gartner:

Gartner Magic Quadrant for Data Analytics Providers December 2023. Source.

Native Cloud Tools

Each cloud provider has native tools for data ingestion, engineering, analysis and machine learning. Here are some examples:

  1. AWS: AWS Glue for Ingestion, Athena for Analysis, SageMaker for training models, Step Functions & Lambda for launching jobs.
  2. Azure: Data Factory (ADF) for data ingestion, integrating with various data stores. Synapse Analytics combines big data and data warehousing for analysis. Azure ML for building and training machine learning models. Logic Apps and Azure Functions for orchestrating and launching jobs.
  3. GCP: Dataflow for ingestion and data processing, particularly for stream and batch data. BigQuery for large-scale data analysis using SQL queries. AI Platform for training and deploying machine learning models. Cloud Functions and Cloud Composer for job orchestration and automation.

In contrast to Databricks’ unified approach, AWS, Azure, and GCP offer a more compartmentalized suite of services, each tailored to specific data management tasks. While Databricks tends to offer a more streamlined workflow, the choice between it and cloud services depends on your budget and team expertise. If your team is adept in managing cloud services (through certificatons or experience), then they may be viable options. Furthermore, it’s important to consider the user-friendliness of cloud services, especially for team members more directly involved in business operations. For example, accessibility is crucial when considering tools for team members who may not be as technically versed

Hence consider not just infrastructure costs, but also the savings in developer time, ease of maintenance and accessibility (coupled with security). For instance, setting up workflows in Databricks can be more straightforward compared to configuring AWS CloudWatch Triggers and Lambda functions. Ultimately, the choice should align with your business goals, emphasizing efficiency and outcome.

Open Source & Other Managed Services

Finally, there’s been an explosion of data infrastructure and analytics products in the past few years that either start off, or build on an existing open source technology. You just need to search “Modern Data Stack” for web or image results and they come pouring in. The truth is, you do need to use a service as it’s often worse to build a platform for your pipelines from scratch, as that’s almost akin to setting up your own datacenter in your garage in 2023. This could be through in-built cloud services as described in the last section, or integrating an open source or managed service.

Some common open source examples include: Mage, Airflow, DBT, Prefect, Dagster, Ploomber, Apache Superset, etc. Managed services include Airbyte, Fivetran, DBT, Census, Mode, Dataiku, Metabase and many more.

A lot of these tools focus on doing a specific part of the data lifecycle and they do it well. The choice of whether to use these or opt for a more comprehensive solution like Databricks depends on your unique requirements and willingness to invest in learning. Opting for a comprehensive platform like Databricks could streamline learning and application, as it integrates multiple features of these diverse tools into one.

Keep in mind that Databricks is also a managed service that’s built on top of open source. Databricks has also built some proprietary additions on top of these and are transparent about these in their documentation. Databricks, as a company are also very agile in adding new features and re-working their platform based on user inputs. Just take a look at the volume of changes and additions they’ve added in 2023.

Ultimately, the decision hinges on prioritizing efficient business outcomes over the intricacy of development processes. Databricks in my opinion, is great at getting to results than tinkering with a myriad of tools in the wake.

--

--