How Does Parquet Store JSON Objects?

Parquet is a columnar file format, but JSON objects are usually nested, irregular, and row-shaped. So how does Parquet store an object with fields like name.first, name.last, and tags?

The short version: Parquet flattens the JSON object into columns, then uses extra metadata to reconstruct the original nested structure later.

Each atomic field becomes its own column. Instead of storing the whole object together, Parquet stores values like:

name.first
name.last
tags

That is what gives Parquet its columnar benefits: similar values sit next to each other, and queries can read only the fields they need.

The hard part is that flattening loses structure. If a field is missing, nullable, or repeated, Parquet needs a way to know what the original object looked like. That is where repetition levels and definition levels come in.

Repetition Level

The repetition level, usually written as R, tells Parquet whether a value starts a new record or continues a repeated structure.

This matters for lists.

For example, if a JSON object has a tags array like this:

{
  "tags": ["a", "b"]
}

Parquet stores the tag values in one column, with repetition and definition metadata beside each value:

Column	Value	R	D	Meaning
`tags`	`a`	0	1	Start a new record/list, and the tag value exists
`tags`	`b`	1	1	Continue the same repeated list, and the tag value exists

Here D = 1 is the maximum definition level for this simplified tags field, meaning the value is present. The exact maximum definition level depends on the Parquet schema and which fields are optional.

Without R, Parquet would know the column contains a and b, but it would not know whether they belong to one row’s list or two different rows.

That is what the repetition level encodes:

R = 0 means this value starts a new record or a new repeated group.
R = 1 means this value continues an existing repeated group.

So the first tag might have R = 0, while the second tag has R = 1.

Without this metadata, Parquet would just have a column of values and no reliable way to know how to group repeated items back into arrays.

Definition Level

The definition level, usually written as D, tells Parquet how much of the path to a value exists.

This matters for missing and nullable fields.

Take a field like name.first.

There are a few possible cases:

The name object does not exist.
The name object exists, but first does not.
Both name and first exist, so the value is present.

Parquet encodes those cases with definition levels.

For a simplified name.first example:

D = 0 means name does not exist.
D = 1 means name exists, but first does not.
D = 2 means both name and first exist, so the value is present.

The maximum definition level means the full path exists. If D is less than the maximum, Parquet knows the value is missing somewhere along the path.

Reconstructing the JSON Object

Once Parquet has the flattened column values plus the R and D metadata, it can reconstruct the original nested object.

The reconstruction process is roughly:

Read the flattened column values.
Use definition levels to decide whether each path exists.
Use repetition levels to decide whether repeated values start a new list or continue an existing one.
Emit events or intermediate records that can be assembled back into the nested JSON shape.

So even though Parquet stores the data column by column, it still has enough information to rebuild the original nested records.

Why Store JSON This Way?

There are three main reasons.

First, compression is better. Similar values are stored next to each other because each field gets its own column. That makes the data easier to compress than a row-based JSON representation.

Second, queries can read fewer bytes. If a query only needs name.first, it does not need to read the whole JSON object. It can read just the column for that field.

Third, the original object is still recoverable. Repetition and definition levels preserve the structural information needed to rebuild nested objects and arrays.

That is the core trick: Parquet turns nested data into flat columns, then uses repetition and definition levels to remember what the nesting used to be.

P.S. If there are any mistakes please let me know, I’m by no means an expert.

How Does Parquet Store JSON Objects?

Repetition Level

Definition Level

Reconstructing the JSON Object

Why Store JSON This Way?

Related

PAX, Row Groups, and Parquet

What Does Freezing a Row Mean in PostgreSQL?

PostgreSQL Vacuum