Home About Services Speaking Blog
← All writing
Microsoft Fabric Apache Spark Hadoop YARN Apache ZooKeeper Azure Synapse data engineering Python notebooks

Fabric under the hood: the Big Bad Cluster

24 May 2026 · 9 min read

Microsoft Fabric has been out for a couple of years now, have you ever wondered what is under the hood? There are lots of interesting things hiding in plain sight. Today, let’s take a look at Fabric’s Big Bad Cluster, a key component powering all of your Python/PySpark workloads in Fabric.

While doing my due diligence for this post, I did not find a single Google search result about this topic. As you’ll notice, the name Big Bad Cluster is also a guess, since it’s abbreviated everywhere as BBC.

Another disclaimer before I continue: while - as a Microsoft MVP - I have access to a lot of internal information under NDA, everything shared in this post is based on publicly available information and my own experiments in Fabric. I have never discussed this topic with Microsoft staff and all information is sourced from what is available in any Fabric Notebook. Every Fabric user can quite easily discover and verify the information shared in this post.

And last disclaimer: while I did find some evidence, this post is mostly speculation. I am quite confident that I got the bigger picture right, but some of the details might be wrong.

On a lazy Sunday afternoon…

I work a lot with the different Fabric runtimes and I wanted to understand better how they work. As you probably know, next to Python/PySpark, Scala, SQL, and a few other languages, you can also just run plain shell commands in the underlying VM.

So basically, what you can do is use the %%sh magic command to run any bash command in the underlying VM. From there, you can explore all the files and processes running in the VM. You’ll quickly spot all the typical Spark components, but some components might not seem familiar to experienced Spark architects. Let’s take a deeper look at the /etc/bbc/ folder.

What happens when you start a Python/PySpark notebook?

In every region where Microsoft Fabric is available, Microsoft keeps a pool of VMs running. If you use the starter pools , you’re assigned one or more of such VMs when you start a Python/PySpark notebook. It is my understanding that they will pre-provision them based on expected demand. At the time of writing the PySpark Notebooks with Fabric Runtime 1.3 are the most popular ones, so they have a lot of these VMs sitting there waiting for you to use them. This is how your session startup time is so fast. All they have to do is link your session to one of these pre-provisioned VMs and you are good to go.

This means that there should be VMs sitting there waiting for you to use them for every type of VM you can choose:

A Python session in Fabric is basically just a single VM, similar to Spark’s driver node, but with different software installed and different configuration.

Tip: you can see which Python packages are preinstalled in your PySpark/Python session by following this blog post or inspecting this GitHub repository .

Background: Spark

Before we dive into the BBC, let’s quickly recap how Spark works. Spark is a distributed computing framework that allows you to process large amounts of data across a cluster of machines. It consists of a driver node that coordinates the execution of tasks across a cluster of worker nodes. The driver node is responsible for scheduling tasks, managing resources, and handling failures.

When you submit a Spark job, the driver node creates a DAG (Directed Acyclic Graph) of tasks that need to be executed. The tasks are then scheduled to run on the worker nodes, which execute the tasks and return the results to the driver node.

Every Spark cluster always needs at least a driver node and one or more worker nodes. Sometimes, in small Spark clusters, a single machine fulfills both roles. Personally, I find this overkill and advise you to consider alternatives like Polars or DuckDB if you’re using these kinds of clusters.

Spark architecture

If you want to expand on your Spark knowledge, I recommend you to read the book Spark: The Definitive Guide by Bill Chambers and Matei Zaharia, the creator of Spark and current CTO of Databricks. You can get it for free thanks to Databricks .

Microsoft codenames

While I wrote extensively how Microsoft Fabric is not just a rebranding, it did take a few components from Azure Synapse. How do I know? When you do the same kind of exploration in Azure Synapse and Fabric, you can quickly spot the internal codenames Microsoft uses for the bigger projects:

  • Project Arcadia: Azure Synapse Analytics
  • Project Trident: Microsoft Fabric
  • BBC: the cluster system in Azure Synapse and Microsoft Fabric that powers Spark workloads - presumably stands for Big Bad Cluster
  • Sibyl: the internal Microsoft system that collects health information from all the different components in Azure Synapse and Microsoft Fabric

The BBC is clearly linked to arcadia components, so it’s safe to assume that it is a component that was inherited from Azure Synapse, where we can find similar Spark pools as a feature for your Spark workloads.

Want to inspect this for yourself? Take a look at the /etc/bbc/node.json file in your Fabric VM. This also shows you the underlying type of Azure instance that the BBC is using in your session.

BBC, YARN, and ZooKeeper

Like the name implies, the BBC is responsible for allocating machines to your Spark cluster or to your Python session. From what I understand, it’s tightly integrated with the Spark cluster manager.

Apache Spark itself is an open-source project and it integrates with other open-source projects for cluster management, so that it didn’t have to reinvent the wheel.

These days, when you run your own Spark cluster, you’d typically set it up on top of Kubernetes , and use Kubernetes as the cluster manager. When you submit a Spark job, the Spark driver program will request resources from Kubernetes to run the worker nodes.

However, a more classical Spark setup would be using Apache Hadoop’s YARN (Yet Another Resource Negotiator) as the cluster manager. Some setups might even use Apache Mesos (now retired to the Apache Attic ), but that’s less common. YARN is also what Synapse and Fabric use under the hood.

Another typical component in this context is Apache ZooKeeper , which is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is often used in distributed systems to coordinate tasks and elect leaders. While YARN is the system that will allocate resources for your Spark cluster, ZooKeeper is what keeps the YARN ResourceManager highly available and coordinates the different components internally.

BBC architecture

The BBC is the component that will interact with YARN and ZooKeeper to allocate nodes for your Spark cluster and to keep track of the state of the cluster. It is responsible for managing the lifecycle of the Spark cluster, including starting and stopping the cluster, monitoring the health of the cluster, and handling failures.

Some other related components (also used in Fabric) not worth going deeper into right now:

  • Apache Hive metastore: a component that provides a centralized repository for metadata about the data stored in your Spark cluster. It is used to store information about the tables, partitions, and other metadata about the data stored in your Spark cluster.

  • Apache Livy : a component that provides a REST interface for interacting with your Spark cluster. It is used to submit Spark jobs, monitor the status of jobs, and retrieve results from the Spark cluster.

Finally, to monitor all those components, the BBC also has a healthagent. This reports the health of the cluster and its components to another internal Microsoft system called Sibyl . Sibyl seems to be a self-healing system that quickly recovers dead components and keeps the overall system healthy.

How does it work?

This is largely speculation, we can only assume based on the files available in our VMs.

There seems to be a PubSub agent running in the background. PubSub is used to communicate between the BBC and the cluster components. The PubSub agent polls every second for updates from the BBC (over HTTP) and relays commands over HTTP(S) to YARN, Livy, the Spark History Server, and the Jupyter Gateway.

It also comes with something called ConfGen which generates configuration files for all the Spark components. ConfGen has templates for all the configuration files and then replaces the correct values, similar to how Jinja works.

What did we learn from this?

There are a few interesting takeaways from this.

  • First off, I think Microsoft should discuss these topics more publicly. How awesome would it be if Microsoft would have sessions about how Fabric is actually working at conferences like FabCon ? Databricks does these kinds of things a lot more and I think this is partially why a lot of data/software/platform engineers are more excited about Databricks. Databricks also still has the reputation of being the creator of some of these components which are now open-source and used by Microsoft, Google, AWS, and so on. Do you see Microsoft contributing these kinds of components to the open-source community? An open question…

  • Second, those codenames are a lot cooler than the typical product names the marketing team comes up with. Imagine telling your relatives you’re swinging Tridents at work! 😉

  • At the time of writing, my Python notebooks seem to be running on Azure Standard_E2ads_v5 nodes. My PySpark notebooks seem to be running on Standard_E8ads_v5 nodes (I have the default medium Spark node size configuration). This is using AMD’s third and fourth generation EPYC CPUs.

  • It gives us better insight into how Fabric’s Spark sessions are able to start so quickly and which components from the open-source world are actually running under the hood.

Before Fabric and Synapse existed, I did a few setups myself where I had to set up and manage my own Spark cluster. It was a lot of work to set up and maintain, and it was not always reliable. With Fabric, Microsoft has taken care of all the heavy lifting for us, so we can just focus on our data and our code. It’s quite impressive how they have been able to build such a complex system and keep it running smoothly for all the users. The arcadia references explain why we don’t see Kubernetes being used anywhere. At the time of the introduction of Azure Synapse, Spark cluster management within Kubernetes was not as mature as it is now, so it made sense to use YARN.

Keep reading