Everything Is Just Dandy!

Super-Structured Data: Rethinking the Schema

Anarchism
unknown
2022-05-17
https://www.brimdata.io/blog/super-structured-data/

We all know why dealing with real-world data is so hard.
It’s a big, hairy mess.

While cliche nowadays, you’re no doubt familiar with the “80/20 rule”
in data analytics, and probably even experienced it yourself:

80% of your time is spent gathering, cleansing, and storing data,
while 20% of your time is spent actually analyzing it and getting real work done.

You often end up stuck between the document model of JSON
and the relational model of SQL databases. Going back and forth between the
two worlds is such a big headache.

Thank goodness there’s a new and better way.
Let’s get schemas and messy JSON out of our way.

It’s called super-structured data.

Hold onto your hats.

The Authoritarian’s Way

The gold standard for data analytics is to “cleanse” your messy JSON data and
organize it all in a data warehouse, where data must conform to relational
schemas so everything fits neatly into tables.

In this world, data must conform to the “one true way” of the data warehouse.

Unanticipated data forms must be discarded or stored elsewhere until
changes can be made to the “ingest pipeline” and to the warehouse schemas to
accommodate any new shape of messy data.

A common trick is to make super wide tables with lots of “nulls”
that can hold all of the different shapes
of data that might show up — only to be foiled by a different form
of messy data that eventually doesn’t fit.

Somehow this approach to cleaning data feels a bit too forced.

Metaphorically speaking,
the relational model feels a lot like authoritarianism.

The Anarchist’s Way

Around 2010, the NoSQL movement arose in reaction to this
schema-rigid authoritarianism.

In this approach, the database is “schema-less” and
data of any shape can be stored anywhere in the database, typically structured
around the document model of JSON.

This anything-goes approach, however, often leads to quite a mess in real-world
deployments. It is easy and tempting to allow any data in the system as requirements
evolve, leading to a mishmash of JSON data shapes that have to be teased apart
through ever more complex application logic.

Extending our metaphor,
the document model feels a lot like anarchy.

How Did We Get Here?

The authoritarians like to call the mishmash of anarchist’s JSON data a “data swamp”,
while the anarchists insist that it’s so much easier to get up and running with
a document database that it’s well worth coping with the potential mess.

Anarchy or authoritarianism? Pick your poison.

You all know the history.

Back in the 1980s, the database wars came to an end
when SQL and the relational model emerged as the undeniable champions.

From there, SQL-based data warehouses appeared in the 1990s enabling the
new concept of business intelligence, while in the late 1990s,
the Internet and Web took off like a rocket.

Then, by the early 2000s, the predominance of Web-scale companies with tech stacks
built entirely from scratch led to a massive proliferation of messy data.
Unfortunately, the best warehouses
of that day simply couldn’t scale to the data demands of the Googles and the Yahoos.

Necessity is the mother of invention and those big Web companies soon developed
custom solutions for doing warehouse-style analytics across massive clusters
of commodity servers. In 2004, Google published their
influential paper on MapReduce, and Yahoo later released open-source software called
Hadoop based on Google’s MapReduce programming model.

A bit later, researchers at UC Berkeley improved upon the Hadoop design
with Spark.

It was the dawn of Big Data. 🤮

The Authoritarian Backlash

No good deed goes unpunished, and rest assured in 2008,
Dewitt and Stonebraker famously ranted
that MapReduce was

  • “a giant step backwards”,
  • “a poor implementation”,
  • “not novel at all”, and
  • “overlooked the lessons of 40 years of database technology”.

Of course, they were right.

But back then, the Web-scale anarchists couldn’t just go out
and purchase a sufficiently large authoritarian warehouse license to solve their ever-growing
problems with messy data. Those data warehouses didn’t mesh with the
fast-moving anarchy of the day and weren’t economically viable at
the massive scale required.

It would be another decade before the worlds of big data and relational warehouses
truly began to converge.

NoSQL: The Anarchist’s Database

In the meantime,
many application developers came to loathe the
object-relation mapping (ORM)
pattern that required a complex layer of moving parts between their
dynamic and often messy application data and the authoritarian relational model.

Why couldn’t apps just write JSON data straight into a database? That would be
so much easier.

So, along came the document-model database to the rescue.

Some of these systems like MongoDB
embraced a pure “NoSQL”
approach while others like CouchDB eventually
added a SQL-like query language based on
SQL++,
which extends SQL to operate over the document data model of JSON.

A Cautious Reaction

This NoSQL stuff got popular and the authoritarians spoke again.

A good eight years after the MapReduce rant,
Stonebraker softened his critique of the anarchists,
stating in the Red Book
that

[NoSQL systems] are easy for a programmer to get going and do something productive.
RDBMSs, in contrast, are very heavyweight, requiring a schema up front.

He concluded:

This is a wake-up call to the commercial vendors to make systems
that are easier to use.

So what’s happened in the half dozen years since these insights from the Red Book?

Well, both the document and relational models have continued to thrive and have been
firmly cemented into enterprise data stacks.
Just look at the market capitalizations
of all the companies involved (even after the bursting of Tech Bubble v2.0 in spring 2022).

While the anarchist NoSQL databases have managed to hold their own against the
authoritarian databases as the storage tier for many application deployments,
they’re not so hot at high-speed analytics and complex multi-dimensional
warehouse problems.

You may have seen a project or two moving data out of these systems and into
ClickHouse or a cloud warehouse when insurmountable scaling problems were hit.
Clearly, the schema-rigid relational model has won the battle of analytics,
and serves as the foundation for the modern cloud warehouse.

Schemas Go Viral

Given these trends, the schema concept has made a big move out of the database, and
has become a fundamental design element that shows up everywhere these days.

If your data is going to land in a SQL warehouse, why not push schema enforcement
as far upstream as possible? This way, the authoritarian data teams who
own the model definitions can impose constraints on the anarchist engineering teams to
prevent them from haphazardly creating messy data.

After all, those pesky engineers don’t understand the business value of data, right?
So best to put some handcuffs on them. Isn’t authoritarian control so sweet?

To this end, schemas
lie at the heart of popular data formats like
Avro
and
Parquet.
And, in the client-server realm,
Protocol Buffers
and Thrift configure the schema directly into
the compiled implementations of the communicating end points.

But there’s a cost to pushing the authoritarian model upstream:
a schema-rigid architecture leads to fragile and brittle interdependencies
and implementing change can be difficult and time consuming.

You want to make a change?
Okay, update all your schema definitions, recompile everything, and redeploy.
Someone makes a seemingly innocuous change to a client data structure used on
your mobile app and your mission-critical data pipeline comes to
a screeching halt. You know the fire drill.

While great for data modeling, schemas can really get in the way
when you’re just trying to move and store data.

It turns out central control of everything can make things hard.

Just Add Thrust

With enough thrust, pigs can fly, or so goes the saying.
So why not just throw more engineering at the schema problem?

And sure enough, a whole sub-industry has emerged to take your data from JSON cloud APIs
and put it into schemas.

The idea here is that
instead of manually creating schemas, what if the schemas were automatically
created for you? When something doesn’t fit in a table, how about automatically
adding columns for the missing fields?

This schema-oriented way of thinking has led to a world where
schemas are a given and any impedance mismatch between
real-world, messy data and tabular schemas shall be solved with ever
more layers of software complexity and engineering.

Super-structured Data

We asked ourselves a crazy question: could it be that we’ve built everything
upon the wrong foundational primitives?

Maybe Stonebraker was right? Maybe it’s the schemas that are getting in our way?

We realized this schemas-are-everywhere way of thinking is
like putting a square peg (JSON) in a round hole (relational tables).
Yes you can do it, but there’s nothing natural about it and having two distinct ways of
doing things creates friction and complexity that leads
to wasted time and increased cost.

Could mixing a little controlled anarchy into our authoritarianism perhaps be helpful?

After working on this problem for a couple years, we arrived upon the concept
of super-structured data guided by the following principle:

Instead of pre-defining schemas to which all values must conform,
data should instead be self-describing and organized around a deep type system,
allowing each value to freely express its structure through its explicit type.

With super-structured data, the mishmash of relational tables and semi-structured data
embedded in tables all turns into a well-defined set of values that all conform
to precisely defined super-structured types. Both JSON anarchy and
schema-rigid authoritarianism are just special cases of the super-structured model.

In other words, super-structured data is a superset of both JSON and relational tables.
All JSON documents are super-structured values and any relational table can be represented
with a super-structured type.

For example, the JSON value

{"s":"foo","a":[1,"bar"]}

would traditionally be called “schema-less” and in fact is said have the vague type
“object” in the world of JavaScript or “dict” in the world of Python.
However, the super-structured interpretation of this value’s type is instead:

type record with field s of type string and field a
of type array of type union of types integer and string

We call the former style of typing a “shallow” type system and the latter
style of typing a “deep” type system. The hierarchy of a shallow-typed value
must be traversed to determine its structure whereas the structure of a
deeply-typed value is determined directly from its type.

So given a deep type system,
when a sequence of values in fact conforms to, say, a uniform “record type”,
then such a collection of record values looks precisely like a relational table.
For example, the sequence of JSON values

{"id":1,"name":"Alice"}
{"id":2,"name":"Bob"}
{"id":3,"name":"Carlos"}

has a natural correspondence with a SQL table created by

CREATE TABLE contacts (
	id INTEGER,
	name TEXT
);

In this case, the rows of this table are typed as type record
with field id of type integer and field name of type string
.

If we, in turn, employ named types as part of the super-structured
type system, we can instead create a type called “contacts” that looks just
like the SQL table:

type contacts {id:int64,name:string}

In the super-structured model, data is self-describing and we can employ
decorators to bind the name to the type as in

{id:1,name:"Alice"}(=contacts)
{id:2,name:"Bob"}(=contacts)
{id:3,name:"Carlos"}(=contacts)

Since the underlying type of contacts is implied by the value,
there is actually no need for an explicit type declaration.

Now the SQL statement

SELECT name FROM contacts WHERE id=2

could be interpreted either traditionally as a query for a row of a relational table
named contacts, or in terms of super-structured data,
as a query over a set of super-structured data where the FROM clause refers
to a first projection by type contacts and the SELECT clause refers to a second
projection of the column name.

In this way, anarchy and authoritarianism can live side by side with a single
data model and authoritarian tables can be projected from a pool of super-structured
data as a simple type query.

Hasn’t This Been Done Before?

Surely, this concept of super-structured data isn’t rocket science.
Why don’t things already work this way?!
From a 10,000 foot view, these ideas feel familiar.

The EdgeDB project advocates for
“types not tables”,
which certainly rhymes with the
super-structured goal of using types instead of schemas to organize data.
And while EdgeDB’s type system is deeply typed, its storage layer
is just a traditional relational database. While
this approach masterfully solves some important and thorny problems
(all while strategically reusing mature relational database technology),
it does not solve the
underlying data representation problem. Instead, EdgeDB is essentially a new data silo
whose type system cannot be used to serialize data external to the system.

Okay, but can’t we get super-structured properties with other existing data formats?

Let’s have a look.

Even though JSON isn’t a candidate,
BSON and
Ion are efficient, binary cousins of JSON
and were created to provide a type-rich elaboration of the semi-structured model.
Unfortunately, both approaches have shallow type systems so they
are not a candidate for super-structured data.

But what about Parquet, Avro, or the hugely popular Arrow format?

Indeed, these formats all have deep typing but are schema rigid:
an encoded sequence of values requires an up-front schema definition
and all of the values in the sequence must conform to that one schema.
Also, Parquet does not have union types, so mixed-type arrays and dictionaries
aren’t expressible, though this could be addressed
in a future version of the format.

In a nutshell,

  • JSON, BSON, and Ion are schema-less but have shallow typing, while
  • Parquet, Avro, and Arrow have deep typing but are schema rigid.

Super-structured data, on the other hand, provides the best of both worlds:

Super-structured data has deep types without schema rigidity.

Schema Registries to the Rescue

Wait a minute. Can’t you solve the schema rigidity problem with a schema registry?

Indeed, a number of years ago, developers wanting to transmit
diversely typed sequences of data over a Kafka queue
clearly tripped over the problems of schema-rigid formats.

They needed a solution: why not just use a
schema registry
to persist all the possible schemas in use?

In this approach, each transmitted value is tagged with a small-integer “schema ID”
and the schema registry provides a centralized service for mapping these IDs
to the intended schema.
Consequently, a heterogeneous sequence of Avro or Protocol Buffers values
can be transmitted over a Kafka topic by prepending the schema ID to each
encoded value. The receiver can then look up and cache each schema using the
ID and the schema registry.

When deployed with Avro, this schema-registry pattern begins to resemble
our model for super-structured data.
In particular, not only does Avro have a deep type system but it also
includes union types, which accommodates
multi-typed arrays and tuples. And it has a null type,
which when combined with a union type can represent optional values in
a record (or JSON object) just like optional values in a relational column.

Given all this, Avro with a schema registry comes closest to
our concept of super-structured data.
However, the schema-registry service not only creates operational overhead,
but makes the approach entirely unsuitable for a self-contained format
for data serialization.
Without live, online access to the schema registry, a client of this approach
cannot decode any Avro-encoded payload.

In short, a schema registry creates a parallel universe problem: everything
is organized around schemas so data in flight and data at rest must both
conform to the same set of schemas. When data at rest resides in a
relational database, we now have to keep the tables in the database
consistent with the schemas in the separate registry service.

The schemas are getting in the way again. What a mess!

The Zed Project

To tackle the myriad of challenges with schema-rigid authoritarianism
juxtaposed with JSON anarchy,
our small team at Brim Data
has been developing, iterating, and refining the ideas for super-structured
data under the umbrella of The Zed Project.

At the foundation of Zed, we’ve developed a family of super-structured formats
that all adhere to a common Zed data model.
The super-structured formats include

  • ZSON – a human-readable format based on Zed as a superset of JSON
  • ZNG – an efficient binary, format based on Zed and analogous to Avro
  • ZST – an efficient columnar format based on Zed and analogous to Parquet

A novel advantage to this design is that one cohesive data model supports the
three important variations of serialized data:

  • a human readable form for easy interpretation,
  • an efficient sequence form for search, and
  • an efficient columnar form for vectorized analytics.

Zed is the first system to unite these three models with a unified set of
formats where converting between the various forms incurs no loss of information.

To crack the problem of efficiently representing super-structured types
across a sequence of values,
the ZNG and ZST formats utilize a concept called a
type context.
A type context allows us to replace a globally scoped schema registry
with locally scoped type definitions that are embedded within the data sequence itself.
Types need only be
defined once
and can then be reused. And the type context can always be
“reset”
within large files or data streams so they can be seekable or
fragmented into independently decodable chunks. Moreover, values can be moved
from one context to another with a fast and simple
table lookup.

At this point,
you might wonder why create Zed and these formats
in the first place? Rest assured, we didn’t just set out to work on
super-structured data for its own sake.

Necessity is the mother of invention and our journey to super-structured data
started when we found it hard to
retain the rich and deeply typed event information from Zeek logs
without force fitting heterogenous log data into warehouse tables
or dumbing down Zeek events into JSON for storage in document-oriented search systems.
We also realized that in order to do both search and analytics well, you had to
stand up two systems: search systems aren’t very good at analytics and warehouse
systems aren’t very good at search.

To this end, we began prototyping these ideas in a command-line tool called
zq,
which is like jq,
but of course operates upon super-structured Zed
data instead of just JSON and has easy-to-use search built in.
Also, since Zed is a superset of other data models,
we’ve included support in zq for reading and writing data in other formats
like JSON, CSV, and Parquet.

Our vision is that super-structured data should make it really easy to
scale down search and analytics
to your laptop, or scale up to a large-scale cloud deployment of a
data lake based on Zed, i.e., “Zed lake”. Thus, we’ve been developing a
lake format based on Zed,
which is managed and served by another command-line tool
simply called zed.

A Zed lake is sort of like a lakehouse but is based on super-structured data,
requires no schema definitions, and has a user-friendly, history-navigable
commit model like Git.
Our work on Zed lakes is less mature than zq and the Zed formats,
but the lake implementation has already proven robust enough to run in production
at a non-trivial scale by many of our community users.

To take advantage of the Zed data model, we have also developed a new search,
query and data-transformation language
that we simply call the “Zed Language”.
The Zed language is the primary means to interact with zq and the
zed query commands.

To be honest,
we struggled a bit as to whether we should just embrace SQL as the query interface.
Does the world really need yet another query language?

Yet the problem with SQL for our use case is that it’s simply an awful
user experience for search.
Many of our community users use Zed in a lean-forward style of interactive keyword search
with a certain amount of lightweight analytics. Forcing these users to switch
to SQL would be a major step back for them.

In the end, we decided to continue to develop the Zed language and explore
the audacious goal of blending keyword search, warehouse-style analytics,
data exploration primitives, and data transformation logic all in one
unified language. This might sound a bit crazy, but we think we’re on to something here.

In the long run, we’ll no doubt support
a dedicated SQL query engine that can operate on virtualized SQL tables
projected from Zed types, but for now, our team is small and we’re exploring
how far we can go with the Zed language.

Finally, we’ve built a desktop application called
“Brim”
that provides an interactive search, analytics, and exploration experience
for Zed data. Through its integrations with Zeek and Suricata,
many of our community users rely upon Brim for threat hunting and
incident response. Other users have implemented ETL pipelines
in Zed and monitor and debug their pipelines using Brim. Some of our
other users leverage Brim for
exploratory data analysis
when trying to decipher large, complex JSON objects that were produced elsewhere
in their organization.

The Brim app utilizes the
Zealot library
to bring super-structured data and the Zed data model
to the JavaScript world.
We don’t aspire for Brim to be a notebook, but rather have leveraged Zealot
to explored some initial integrations with notebook systems like
Observable.
In a future article, we’ll write about our Observable integration.

Try it out

If you’d like to try Zed and Brim, it’s all pretty easy. You can:

We love working with all our users to help guide us to the best ways of solving
your real, everyday problems. Give us a holler and we look forward to chatting.

Wrapping Up

It’s hard to make data easy and the jury is certainly out on Zed, but let’s
see how far we can get.

Let’s see if we can use the Zed type system to get schemas out of our way.

Let’s see if we can do better than shaving the hard edges off
JSON’s square peg to fit in the round hole of relational schemas and dataframes.

Maybe, just maybe,
by mixing a bit of controlled anarchy into the world of schema-rigid
authoritarianism, Zed can make data engineering much, much easier after all.

Acknowledgements

Noah Treuhaft
coined the awesomely perfect term super-structured data to describe
what we’ve been working on.

Garrison Hess
came up with the clever metaphor
of “anarchy vs. authoritarianism” as a reaction to the design motivation of Zed.