The Bacalhau Project: Our Long-Term Vision (For Now)

Bacalhau compute over data

Dec 22, 2022

Welcome to the Bacalhau Project, a new way to compute, manage, and use data generated anywhere. If you're reading this, chances are you're a data scientist, developer, or big data engineer. We created this project with you in mind, though we'd like to make it possible for anyone to work with the new data age with familiar and powerful tools. Please read on to find out why we took on this challenge!

⭐️ Star us on Github

The Current State of Big Data

The growth of big data has shocked and impressed everyone. By 2025, IDC believes that we will generate 175 zettabytes of data daily. That's 175 trillion gigabytes. That's a lot, and it's growing fast. In fact, in the same report, we are projected to generate 50 times more data than we do today. Woah.

These troves of information contain critical insights to help us make better decisions and improve our world. Yet, they are quickly becoming too big and too decentralized to process. That's a problem.

What does that mean for you? Well, if you're a developer or a data scientist, you're probably already working with big data. If you're not, you probably will be soon. Regardless of where you are in the journey, it's highly likely that you will be consuming the results of these big data pipelines. So getting these pipelines right to process quickly, reliably, and to be easily managed is critical.

⭐️ Star us on Github

Challenges With Big Data

There are a few significant challenges with big data.

It’s hard to centralize: Unlike traditional data, which is generated and stored in a single location, big data is often generated across thousands of devices and runs in many different places. This could be on servers, end-user PCs, IoT sensors, mobile devices, or even inputs from external sources (e.g., vehicles, other services, etc.). Because information is being generated in all these locations, the data itself is inherently decentralized, and simply moving it to a central location costs enormous amounts of time and money.

It’s hard to manage: With all these inputs, it's hard to keep track of what's going on. You need to be able to monitor the health of your data pipeline and make sure that it's running smoothly. You need to be able to scale up and down as needed, and you need to be able to make changes to your pipeline without disrupting the rest of your system. This is especially true when you're working with a large number of inputs, and need to be able to manage them all from a central location. Even if you could run your data as a single machine in the cloud, you'd still need to control and configure the tools you use. This becomes exponentially harder as data generation goes beyond a single data center.

It’s hard to store: Most devices in the world can store the data they generate - often several times larger. But when generating data on hundreds or thousands of devices, the problem becomes much more complicated. For example, if you have 1000 devices, creating 1 GB of logs daily, you're generating 1 TB of logs daily. Or 100 devices, each with 10 services on them. Or 20000 users are uploading 5 minutes of video. Or 1M devices, each that are generating 1 MB of information. And by the time you get to 100 days, moving that amount of data around takes 24 hours. And the data keeps coming and growing more and more expensive.

What Is Missing?

A solution that addresses the issues above would have the following functionality:

Uplevel edge/fog devices: Upleveling these edge/fog devices, using advanced functionality will help process the data locally and only send the results to a central location. This would reduce the amount of data needed to be moved and the time it takes to process the data.
Manage all these devices and all the data they’re generating: This would make it easy to monitor your data pipeline's health and ensure it's running smoothly. And it would be as easy as managing a single central service.
Store data in a local storage that is readily available: The final step would be to make it easy to store all this data, using local storage where available, and centralizing only the critical information, when it's needed. It also makes it possible to query and analyze, even when the data is remote.

And it would do this without asking the developers and data scientists to rewrite their code and processes. It would allow a seamless move to an efficient, highly scalable distributed processing system.

Launching The Bacalhau Project

For all these reasons and more, we have introduced the Bacalhau Project. Bacalhau is a new data processing system to help you manage your big data pipelines.

It's designed to process data locally or where it's created.
It uses systems that enable you to use your existing tools (such as Python, Javascript, R, and Rust) but also takes advantage of the latest technology, such as support for WASM and GPUs.
And it enables much more efficient parallel processing while treating it all as a single dataset.

Further, it is designed to be (mostly) self-managing. Whether you're running locally, on private clusters, in a data center, or across the distributed web, the system should feel the same. Just write the code to process the data, and let Bacalhau take care of the rest.

⭐️ Star us on Github

Long-Term Vision

While we started the project last February, we're already in live production. You can read more about how to use Bacalhau [in our documentation]. And we're just getting started - in the near term, we plan to deliver:

A simplified job dashboard that lets you see all your jobs in flight
A rich SDK for Python, Javascript, and Rust
Job execution pipelines fully compatible with Airflow
A job zoo that enables you to pick up existing pipelines from the community
Automatic wrapping with metadata/lineage and transformation for known file types (columnar, video, audio, etc.)
An on-premises deployment option for private and custom Hardware
Internode networking for multi-tier applications
A standard data store that automatically records data and lineage information of jobs

In time, our goal is to deliver a complete system that achieves the following:

A fully distributed data processing system that can run on any device, anywhere
A declarative pipeline that can both run the data processing and also record the lineage of the data
Secure and verifiable results that can be used to confirm the integrity and reproducibility of the results forever

But you tell us! We'd love to hear about new directions we may need to include.

How to Get Involved

We're looking for help in several areas. If you're interested in helping out, please reach out to us at any of the following locations:

Thanks for reading, and onward!

Your humble Bacalhau team