While some people are very familiar with Databricks, others might not know as much. We thought it would be a good idea to break down what Databricks is, explore what Databricks can do, who uses Databricks, and answer some commonly asked questions like: ‘what is a data lakehouse?’ and ‘what is a Databricks certification?’
What is Databricks?
Databricks is a single, cloud-based platform that can handle all of your data needs, which means it’s also a single platform on which your entire data team can collaborate. Not only does it unify and simplify your data systems, Databricks is fast, cost-effective and inherently scales to very large data. Databricks is available on top of your existing cloud, whether that’s Amazon Web Services (AWS), Microsoft Azure, Google Cloud, or even a multi-cloud combination of those.
What is Databricks used for?
Many organisations currently run a complex mix of data lakes and data warehouses, with parallel “pipelines” to process data that comes in scheduled batches or real-time streams. And then they layer on top a variety of other tools for analytics, business intelligence or data science. With Databricks you no longer need all of that. You can just use Databricks. Using Databricks, you can:
• Pull all your data together into one place
• Easily handle both batched data and real-time data streams
• Transform and organise data
• Perform calculations on data
• Query data
• Analyse data
• Use the data for machine learning and AI
• And then generate reports to present the results to your business
You’ll see this idea referred to as the “data lakehouse”.
Or, if you prefer, you can use Databricks for just some of the activities above, mixing it with other technologies within your cloud data system. That’s often a way to get started and see what it’s capable of doing.
Who uses Databricks?
Large enterprises, small businesses and those in between all use Databricks. Some of Australia and the world’s most well-known companies like Coles, Shell, Microsoft, Atlassian, Apple, Disney and HSBC use Databricks to address their data needs quickly and efficiently. In terms of users, Databricks’ breadth and performance means that it’s used by all members of a data team, including data engineers, data analysts, business intelligence practitioners, data scientists and machine learning engineers.
So, what exactly is Databricks? And what is it used for?
Databricks processes data
At its core, Databricks reads, writes, transforms and performs calculations on data. You’ll see this variously referred to in terms like “processing” data, “ETL” or “ELT” (which stands for “extract, transform, load” or “extract, load, transform”). They all basically mean the same thing.
That might not sound like a lot, but it is. Do this well, and you can undertake pretty much any data-related workload.
You see, this processing — these transformations and calculations — can be nearly anything. For example, they could be aggregations (e.g. counts, finding the maximum or minimum value), joining data to other data, or even something more complex like training or using a machine learning model.
To tell Databricks what processing to do, you write code. Databricks is very flexible in the language you choose — SQL, Python, Scala, Java and R are all options. These are coding languages that are common skills among data professionals.
Databricks uses Apache Spark to process data
Sitting at the heart of Databricks is the engine that does this data processing: an open-source technology called Apache Spark. And this is no surprise. Spark is the dominant data processing tool in the world of big data, and Databricks was founded by the creators of Spark.
So why not just use Spark instead? Well, you can if you really want to. To do the data processing — to run Apache Spark — you’ll need a cluster of computers. That’s multiple computers (called “nodes”) working together, each with their own memory and each with multiple cores. The data is distributed and the tasks that form the data processing workload are performed in parallel across the nodes and their cores. This distributed and parallel design is critical for working with large data and for scaling into the future.
But spinning up, configuring, altering and maintaining a cluster is a pain. And installing, configuring, optimising and maintaining Spark is a pain too. It’s easy to spend your time and effort just looking after these, rather than focusing on processing your data, and thereby generating value. (And, yes, that includes using cloud virtual machines or cloud-native, managed Spark services.)
Databricks takes away that pain. Databricks allows you to define what you want in your clusters, and then looks after the rest. Clusters only come into existence when you need them and disappear when you’re not using them. Spark is already installed and configured. It even auto-scales the clusters within your predefined limits, meaning it can add or subtract nodes as the scale of the processing increases or decreases. It all means you can focus on your data processing and therefore generating value, rather than managing supporting the infrastructure.
Even better, the Spark that runs on Databricks is heavily optimised, as are the clusters that Databricks uses. This means that Spark runs faster and more efficiently on Databricks than anywhere else. (Remember, the Databricks folks are the very same ones who created Spark.)
Ok, so Databricks is essentially about processing data. It does it using the dominant data processing technology for big data. And it then runs that better than anywhere else. However, the real trick is that Databricks then builds on such a flexible and performant core to extend it into an entire data platform.
How’s Databricks different from a database or data warehouse?
Databases and data warehouses can process data too. But their engines are fundamentally designed to query data with low latency. Basically to be responsive when you ask questions of your data, particularly on smaller quantities of data.
Databricks, using Spark, is designed for throughput. It’s a workhorse that’s designed to process data at scale. To perform those transformations and calculations super-efficiently, and to shine as data gets large.
In addition, to improve its query performance, Databricks has introduced another engine called Photon, which is compatible with, and complementary to, Spark. Spark plus Photon is how Databricks covers the length of the data processing spectrum.
However, when comparing Databricks with databases or data warehouses, there’s another key difference: how and where your data is stored.
Databricks reads and writes data, but you control where and how your data is stored
A database or data warehouse not only processes your data using its own query engine, it also stores your data in its own format. You can only access that data through using the database or data warehouse. And in some cases, once you put your data in there, you need to pay to read that data out.
Databricks doesn’t store data. (Granted, there are some subtleties here. But this statement and the following all holds when implementing Databricks using best practices.)
Databricks reads data from storage and writes data to storage, but that storage is your own — depending on your cloud of choice, your data will be in Amazon S3, Azure Data Lake Storage Gen2 or Google Cloud Storage.
And Databricks doesn’t require the use of a proprietary data storage format, it uses open source formats, although it can read from and write to databases too. The choice is yours.
The net result is that you always have full control of your data. You know exactly where it is and how it is stored. You’re not locked in either: if you want to access your data without using Databricks, then you can.
Databricks combines your data lake and data warehouse into the data lakehouse
Basic object data storage, like those of the cloud providers, is super flexible. It’s how you make a data lake, which is one of the keys to having a successful data science and machine learning capability. But data lakes provide few guarantees and little robustness.
So, Databricks have developed and released their own open-source data storage format, called Delta Lake. Delta Lake extends upon the open-source Apache Parquet storage format (which is Spark’s preferred storage format) by adding a “transaction log”, which is a list of all operations performed on your data. But the data itself remains in the well-known Parquet format, and can be accessed without using Databricks or even Spark.
Using Delta Lake provides “ACID compliance” (atomicity, consistency, isolation and durability) to your stored data. This means you get:
• Guarantees on reading and writing your data that you normally don’t get without database-style storage
• The ability to read and write batches of data and streams of real-time data to the same place
• Schema enforcement or modification, like you would with a database
• “Time travel”, which means you can read or revert to older versions of your data
Bottom line: With Delta Lake, Databricks can treat your data that sits in a data lake on cloud storage much like it’s in a data warehouse. You get the benefits of both the data lake and data warehouse. And so, Databricks allows you to combine the concepts of a data lake and data warehouse into the “data lakehouse”. It’s a very powerful concept and a great way of simplifying your data systems.
If you read material from Databricks, including their website, you’ll see they’re big on the Lakehouse. Now you know why.
As important as Spark and Delta Lake are, Databricks is more than just those
On top of its data processing engine, Spark, and its preferred storage format, Delta Lake, Databricks has a variety of other features that allow you to make the most of your data.
It enables an end-to-end workflow for machine learning projects and data science. Databricks clusters can be spun-up with machine learning packages and even GPUs for exploring data and training models. Data scientists and machine learning engineers can use interactive notebooks to write their code, which are similar to (but different from) Jupyter Notebooks.
Databricks then enables the whole “MLOps” (DevOps for machine learning) lifecycle with another piece of integrated open-source software called MLflow, and its slew of machine learning features that it packages together under the banner of Databricks Machine Learning.
For data analysts and business intelligence professionals, Databricks also offers Databricks SQL. This is an interface and engine that looks and feels like a database or data warehouse interactive development environment. They can write SQL queries and execute them like they would against more traditional SQL-based systems.
From there, it’s even possible to build visuals, reports and dashboards. Or you can hook Databricks up to their preferred business intelligence tooling like Power BI, Tableau or Looker.
There are heaps more features to Databricks that further round out its capabilities as an all-around data platform, and more are consistently being added. Conceptually, the goal is to make it the one place that a data team can go to do whatever data-related work they need to accomplish.
Databricks runs in the cloud
Databricks is available on top of your existing cloud, whether that’s Amazon Web Services (AWS), Microsoft Azure, Google Cloud, or even a multi-cloud combination of those. Databricks does not operate on-premises.
It uses the cloud providers for:
• Compute clusters. In AWS they’re EC2 virtual machines, in Azure they’re Azure VMs, and in Google Cloud the cluster runs in Google Kubernetes Engine.
• Storage. As mentioned earlier, Databricks doesn’t store data itself. Instead data is stored in native cloud storage. In AWS that’s S3, in Azure it’s Azure Data Lake Storage Gen2, and in Google Cloud it’s Google Cloud Storage.
• Networking and security. This includes integrating with your existing networks, identity and access management, and storing and accessing secrets.
If you want, you can connect and use Databricks with other cloud native tools and services. But it plays really well on its own too.
Once deployed and configured, your data team accesses a Databricks workspace through its own browser interface. You don’t need to go through a cloud console or the like. The team can effectively just do its work through Databricks and, in general, doesn’t need to know about the details of the cloud underneath.
Databricks is a single data platform for all your needs
Bringing all of this together, you can see how Databricks is a single, cloud-based platform that can handle all of your data needs. It’s the data lakehouse. It’s the place to do data science and machine learning.
Databricks can therefore be the one-stop-shop for your entire data team, their Swiss-army knife for data. A place where they can all collaborate, together, rather than using a complex mix of technologies.
It can unify and simplify your data systems, mixing all sorts of data that arrives in all sorts of different ways.
Plus, Databricks is fast, cost-effective and inherently scales to very large data. Done well, you can architect it once and then let it scale to meet your needs.
What data areas is Databricks able to support?
Databricks offers three important layers for working with data: data engineering, Databricks SQL, and Databricks Machine Learning.
Data Engineering
The data engineering layer focuses on simplifying data transportation — and transformation — with high performance.
Using the power of Apache Spark, Databricks supports both streaming and batch data processing use-cases, which are stored using the Delta Lake on your cloud providers’ data lake.
Thankfully, you don’t even need to learn a new language to use Spark. Databricks uses commonly used programming languages such as SQL, Python, Scala, Java, and R.
The Delta Lake format also supports your atomicity, consistency, reliability, and durability (ACID) transactions, which ensures the integrity of the data that’s transported. Data is then transformed through the use of Spark and Delta Live Tables (DLT). As soon as it’s loaded into Delta Lake tables, it unlocks both analytical and AI use cases.
Databricks SQL
The Databricks SQL is reliable, simplified, and unified — allowing you to run SQL queries on your data lake to create simple data visuals and dashboards for sharing important insights. It also integrates with visualisation tools tools such as Tableau and Microsoft Power BI to query the most complete and recent data in your data lake.
Under the hood of the Databricks SQL is an active server fleet, fully managed by Databricks, that can transfer compute capacity to user queries in minimal time. This means that, unlike traditional data warehouses, Databricks SQL is up to six times faster when submitting similar workloads to the compute engine for execution.
Because Databricks SQL is a managed compute engine, it provides instant compute with minimal management and lower costs for BI and SQL — thanks to a central log that records usage across virtual clusters, users, and time.
Finally, not only can you connect your preferred business intelligence tools, Databricks SQL fetches your data in parallel, rather than through a single thread, reducing those pesky bottlenecks that slow down your data processing.
Databricks Machine Learning and Data Science
Today’s big data clusters are rigid and inflexible, and don’t allow for the experimentation and innovation necessary to uncover new insights. The Databricks Machine Learning platform combines services for tracking and managing experiments, trained models, feature development and management, and feature and model serving.
With Databricks Machine Learning, you can train models, track models using experiments, create feature tables, as well as share, manage, and serve models. Databricks also offers Databricks Runtime for Machine Learning, which includes popular machine learning libraries, like TensorFlow, PyTorch, Keras, and XGBoost, as well as libraries required for software frameworks such as Horovod.
Who uses Databricks? And what do they use it for?
Databricks isn’t just for people who love data. Databricks helps everyone from Fortune 500 companies, to government agencies and academics to get the most out of the mountains of information available to them.
Companies such as Coles, Shell, ZipMoney, Health Direct, Atlassian and HSBC all use Databricks because it allows them to build and run big data jobs quickly and easily — even with large data sets and multiple processors running simultaneously.
For example, Shell uses Databricks to monitor data from over two million valves at petrol stations to predict ahead of time if any will break. This instant access to information, and AI-driven decision making, can save the company time, money, and allows them to provide a better experience for their customers.
In Australia, the National Health Services Directory uses Databricks to eliminate data redundancy. This ensures the quality, reliability, and integrity of their data while providing analytics that helps improve forecasting and clinical outcomes in aged care and preventative health services.
Coles also uses Databricks as a central processing technology to enable data to be easily discoverable, streamed and used in real-time, and stored in one place. Having all this information on a unified platform has helped the supermarket chain reduce model training jobs from three days to just three hours.
What is a data lakehouse?
Rather than swimming in a whole lake of data, Databricks provides a data lakehouse — a place where all that information is organised in a way that combines the data structure of a data warehouse with the data management features of a data lake, at a much lower cost. It’s a happy medium between the two.
This data lakehouse holds a vast amount of raw data in its native format until it’s needed. It’s a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data. Similar to data lakes, this includes data like images, video, audio, and text, as well as semi-structured data like XML and JSON files.
The Databricks data lakehouse supports ACID transactions that ensure consistency when multiple parties read and write data at the same time. It also supports schemas for structured data, and implements schema enforcement to ensure that the data uploaded to a table matches the schema.
Because the data lakehouse runs on a cloud platform, it’s highly scalable. Storage resources are decoupled from compute resources, so you can scale each one separately to meet the needs of your workloads — from machine learning and business intelligence to analytics and data science.
Obviously, data is everywhere, and it’s only going to continue to grow. So, using technology to simplify this large amount of information is quickly becoming a necessity for businesses of all sizes. With Databricks, your data is set up for your imagination and success. Not only is it an easy-to-use and powerful platform for building, testing, and deploying machine learning and analytics applications, it’s also flexible, making your approach to data analysis so much more compelling.
Are there any more Databricks learning materials available?
Yes, in fact there are tonnes out there and it can be a bit overwhelming. We have done you a favor and curated a list of learning materials we found useful when we started our Databricks journey and we share with new employees. We’ll be able to share the link to this shortly 🙂
This list is kept up to date with the latest resources we find so you can check back or if you sign up to our Databricks newsletter, we’ll keep you up to date with new Databricks information we’ve found useful and also let you know about any upcoming bootcamps we have.
What is a Databricks certification?
Databricks offer several courses in order to prepare you for their certifications. You can also choose from multiple certifications depending on your role and the work you will be doing within Databricks. While we’re always happy to answer any questions you might have about Databricks we even run Databricks bootcamps to get you started – check out our events page here. For those looking to earn a Databricks certification the Databricks Academy offers official Databricks training for businesses looking to gain a better understanding of the platform. They even offer free vouchers for partners and customers.
Within the Databricks Academy you’ll find custom-fit learning paths for multiple roles and careers, the Databricks Academy aims to train you to become a master of data and analytics across e-learning and corporate training certifications. From learning more about the fundamentals of the Databricks Lakehouse to receiving a data scientist certification, the Databricks Academy has learning paths for all roles, whether you’re a business leader or SQL analyst. They even offer free training vouchers for partners and customers.
How long does it take to study for a Databricks Certification?
When you have a deadline for taking an exam, you have more reasons and pressure to study. In this case for the exam, a 5–7 weeks preparation would make you ready for a successful result especially if you have work experience with Apache Spark.
Databricks FAQ’s
What is Databricks?
Databricks is a cloud platform that simplifies complex data management. It was previously available on AWS and Google Cloud, but has recently been added to Azure. It’s the latest big data tool for the Microsoft cloud.
What is Databricks used for?
Databricks is used for building, testing, and deploying machine learning and analytics applications to help achieve better business outcomes.
Who are Databricks’ customers?
Some of the world’s largest companies like Shell, Microsoft, and HSBC use Databricks to run big data jobs quickly and more efficiently. Australian based businesses such as Zipmoney, Health Direct and Coles also use Databricks.
What is a Databricks certification?
The Databricks academy is the main source of all official Databricks training. There are various learning paths available to not only provide in-depth technical training, but also to allow business users to become comfortable with the platform. Best of all, free vouchers are also available for Databricks partners and customers.
What is a data lakehouse?
This data lakehouse holds a vast amount of raw data in its native format until it’s needed. A data lakehouse combines the data structure of a data warehouse with the data management features of a data lake, at a much lower cost. It’s a happy medium between the two, and much more efficient.
What is the Databricks Delta Lake?
Delta Lake is an independent, open-source project supporting Lakehouse architecture built on top of data lakes. Delta Lake enables ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Some of the organizations using and contributing to Delta Lake include Databricks, Tableau, and Tencent.
What is the best data lake software?
There are a variety of cloud data lake providers, each with its own unique offering. Determining which data lake software is best for you means choosing a service that fits your needs.
What is the difference between Databricks and Snowflake?
While similar in theory, Databricks and Snowflake have some noticeable differences. Databricks can work with all data types in their original format, while Snowflake requires that structure is added to your unstructured data before you work with it. Databricks also focuses more on data processing and application layers, meaning you can leave your data wherever it is — even on-premise — in any format, and Databricks can process it.
Like Databricks, Snowflake provides ODBC & JDBC drivers to integrate with third parties. However, unlike Snowflake, Databricks can also work with your data in a variety of programming languages, which is important for data science and machine learning applications.