xet documentation

Hashing

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Hashing

  • Chunk hashes - compute for each chunk from chunk data.
  • Xorb Hashes - compute for each xorb from its chunk hashes.
  • File Hashes - compute for each file from its chunk hashes.
  • Term Verification Hashes - compute for each term in a reconstruction when serializing a shard from the chunk hashes in the xorb that is used in that term.

The Xet protocol utilizes a few different hashing types.

All hashes referenced are 32 bytes (256 bits) long.

Chunk Hashes

After cutting a chunk of data, the chunk hash is computed via a blake3 keyed hash with the following key (DATA_KEY):

DATA_KEY

[
  102, 151, 245, 119, 91, 149, 80, 222, 49, 53, 203, 172, 165, 151, 24, 28, 157, 228, 33, 16, 155, 235, 43, 88, 180, 208, 176, 75, 147, 173, 242, 41
]

reference implementation

Xorb Hashes

Xorbs are composed of a series of chunks; given the series of chunks that make up a xorb, to compute the hash or xorb hash we will compute a MerkleHash using a Merkle Tree data structure with custom hashing functions. The xorb hash will be the root node hash of the MerkleTree.

The leaf node hashes are the chunk hashes as described in the previous section.

The hash function used to compute internal node hashes is as follows:

  • concatenate the hashes together such that for each chunk there is a line in order formatted like {chunk_hash:x} : {size}\n
    • the hash first in lowercase hex format (64 hex characters e.g. a3f91d6e8b47c20ff9d84a1c77dcb8e5a91e6fbf2b2d483af6d3c1e90ac57843)
    • a space, a colon, a space (:)
    • the chunk length number e.g. 64000
    • finally a newline \n character
  • Then take the bytes from this string and compute a blake3 keyed hash with the following key (INTERNAL_NODE_KEY)

reference implementation

INTERNAL_NODE_KEY

[
  1, 126, 197, 199, 165, 71, 41, 150, 253, 148, 102, 102, 180, 138, 2, 230, 93, 221, 83, 111, 55, 199, 109, 210, 248, 99, 82, 230, 74, 83, 113, 63
]

Example of data for internal node

Consider that a node were 4 chunks with the following pairs of hashes and lengths:

hash,length (bytes)
1f6a2b8e9d3c4075a2e8c5fd4f0b763e6f3c1d7a9b2e6487de3f91ab7c6d5401,10000
7c94fe2a38bdcf9b4d2a6f7e1e08ac35bc24a7903d6f5a0e7d1c2b93e5f748de,20000
cfd18a92e0743bb09e56dbf76ea2c34d99b5a0cf271f8d429b6cd148203df061,25000
e38d7c09a21b4cf8d0f92b3a85e6df19f7c20435e0b1c78a9d635f7b8c2e4da1,64000

Then to form the buffer to compute the internal node hash we will create this string (note the \n newline at the end):

"1f6a2b8e9d3c4075a2e8c5fd4f0b763e6f3c1d7a9b2e6487de3f91ab7c6d5401 : 10000
7c94fe2a38bdcf9b4d2a6f7e1e08ac35bc24a7903d6f5a0e7d1c2b93e5f748de : 20000
cfd18a92e0743bb09e56dbf76ea2c34d99b5a0cf271f8d429b6cd148203df061 : 25000
e38d7c09a21b4cf8d0f92b3a85e6df19f7c20435e0b1c78a9d635f7b8c2e4da1 : 64000
"

Then compute the blake3 keyed hash with INTERNAL_NODE_KEY to get the final hash.

Example Python code for the internal hash function

from blake3 import blake3

def internal_hash_function(node):
  buffer = ""
  for chunk in node:
    size = len(chunk)
    chunk_hash = compute_chunk_hash(chunk)
    buffer += f"{chunk_hash:x} : {size}\n"

  blake3(bytes(buffer), key=INTERNAL_NODE_KEY)

File Hashes

After chunking a whole file, to compute the file hash, follow the same procedure used to compute the xorb hash and then take that final hash as data to compute a blake3 keyed hash with a key that is all 0’s.

This means create a MerkleTree using the same hashing functions described in the previous section. Then take the root node’s hash and compute a blake3 keyed hash with the key being 32 0-value bytes.

reference implementation

Term Verification Hashes

When uploading a shard, each term in each file info in the shard MUST have a matching FileVerificationEntry section that contains a hash.

To generate this hash, take the chunk hashes for the specific range of chunks that make up the term and:

  1. Concatenate the raw hash bytes: Take all the chunk hashes in the range (from chunk_index_start to chunk_index_end in the xorb specified in the term) and concatenate their raw 32-byte representations together in order.

  2. Apply keyed hash: Compute a blake3 keyed hash of the concatenated bytes using the following verification key (VERIFICATION_KEY):

VERIFICATION_KEY

[
  127, 24, 87, 214, 206, 86, 237, 102, 18, 127, 249, 19, 231, 165, 195, 243, 164, 205, 38, 213, 181, 219, 73, 230, 65, 36, 152, 127, 40, 251, 148, 195
]

The result of the blake3 keyed hash is the verification hash that MUST be used in the FileVerificationEntry for the term.

reference implementation

Example Python code for the verification hash

def verification_hash_function(term):
    buffer = bytes()
    # note chunk ranges are end exclusive
    for chunk_hash in term.xorb.chunk_hashes[term.chunk_index_start : term.chunk_index_end]:
        buffer.extend(bytes(chunk_hash))
    return blake3(buffer, key=VERIFICATION_KEY)

Reference Files

Reference files are provided in Hugging Face Dataset repository xet-team/xet-spec-reference-files.

In this repository there are a number of different samples implementors can use to verify hash computations.

Note that all hashes are represented as strings. To get the raw value of these hashes you must invert the endianness of each byte octet in the hash string, reversing the procedure described in api.

Chunk Hashes Sample

There are 3 chunks files, for each file name, the first 64 characters are the string format of the chunk hash of the data in the file:

File Hash Sample

The xet-team/xet-spec-reference-files repository contains the original file Electric_Vehicle_Population_Data_20250917.csv.

When processed through the Xet upload protocol the chunks that are produced for this file are listed (formatted <hash> <length>) in the file Electric_Vehicle_Population_Data_20250917.csv.chunks.

Using these chunks to compute a file hash of the entire file the result is the hash stored in the file Electric_Vehicle_Population_Data_20250917.csv.xet-file-hash or the raw value 118a53328412787fee04011dcf82fdc4acf3a4a1eddec341c910d30a306aaf97.

Xorb Hash Sample

All of the chunks of Electric_Vehicle_Population_Data_20250917.csv can fit into 1 single xorb.

The xorb produced with all of the chunks in order for this file can be found serialized in file eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632.xorb.

The hash of this xorb is eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632, the value in Electric_Vehicle_Population_Data_20250917.csv.xet-xorb-hash.

The chunks that make up this xorb are listed in a file eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632.xorb.chunks; note this file is equivalent to Electric_Vehicle_Population_Data_20250917.csv.chunks.

Range Hash Sample

In the reconstruction of Electric_Vehicle_Population_Data_20250917.csv with xorb eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632 there is 1 range that contains all 796 chunks.

The verification range hash for this range is the value in eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632.xorb.range-hash which is d81c11b1fc9bc2a25587108c675bbfe65ca2e5d350b0cd92c58329fcc8444178.

Update on GitHub