Decentralized storage

In this module, you will learn about how we can build an architecture based on a peer-to-peer network that allows a community to store data forever, with all the benefits of decentralization.

Such an architecture is the first step towards building a blockchain.

Goal: storing data forever

Imagine you are a member of a community of authors of digital content. You would like any member to be able to store their new content so that it stays available for any member forever, or at least as long as this community exists in some form.

If you were in charge of solving this problem, and don't already have a tool built for this purpose, how would you do it? Take a couple of minutes to think about it.

Traditional approach

A common solution is to use a cloud storage service, hosted by some large multinational company.

What could be the issues with this approach?

This company may:

disappear or simply decide to stop offering this service
have technical issues that make your data unavailable
decide that your content is against its new terms of service and delete it
prevent some members or entire countries from accessing the service
etc.

Picking a company for your community's needs means having to trust that this company will continue offering the service to all your members without restrictions, for a very long time.

This is what we call a centralized approach: a single entity is in charge of everything, and users need to trust that it won't unilaterally stop offering the service as you expect. Unfortunately, you can never fully trust a single entity.

Decentralized approach

Avoiding the issues that come with centralization means using decentralization: splitting the responsibility of the service between many entities, so that no small group of entities may prevent any user from using the service.

How would you design a simple decentralized architecture to store the community's data and make it available?

First, let's assume that many members of the community each have a single computer that they dedicate to this service. The community will have to agree on what software to run on these computers. We will call each running instance of this software a node.

For a very basic architecture, we could have every node:

store all of the data from the community
be able to communicate with every other node (know each node's IP address).

To upload data, for example a file, members would simply send that file and its name to every other node, which would then store it.

To download data, a member could send a request for the file having a given name, to any of the nodes, and that node would send it to them. If for some reason, the node is not available or rejects the request, they can simply try with another node.

To join the community, a new user would just need to know the address of one node, request the list of all the other nodes, the list of all files, and start downloading their content.

As long as one node is still running the software, and a member is connected to the internet, they will be able to download the content.

Potential issues

This simple approach presented above has a few issues. Can you think of some?

Here are some of the main issues:

two members could use the same name for a file
as the community grows, the bandwidth required to send files to every node becomes prohibitive
when nodes are temporarily unavailable, they will miss some files and become desynchronized
bad members could:
- send the wrong data, for a given download request
- saturate the node by sending too much data or too many requests

Can you think of ways to improve our architecture to prevent these issues?

Ensuring that nodes store/send the right data

There is a risk that two members upload different files with the same name.

If they start uploading these files to nodes in different orders, some nodes will store the first member's file, while others will store the other one. Fixing this situation after the fact is complicated, so we need to prevent it.

Sending files to the nodes always in the same order would not work when some nodes are temporarily disconnected or have very slow connections.

We need to ensure members always use different names for different files.

One way to generate a unique name is to produce a random sequence of characters. If the sequence is long enough, two members will never generate the same name, unless one does it on purpose.

However, we also need to make sure the data sent by a node is the original file, exactly as intended by its author. We want to be able to detect any node's attempt to send invalid data.

Cryptography provides a solution: Identify the file using the hash of its content, instead of a name.

A hash has the property that in practice, two different files will always have different hashes, and after downloading a file, you can compute the hash of its content, then verify that it matches the one you requested.

Cryptographic hashes

Blockchains make heavy use of cryptography to ensure a high level of security. One cryptographic primitive often used in a blockchain, including as a key component of the chain structure itself, is cryptographic hashing.

A hash is a sequence of bytes that is the output of a hash function, where the input is an arbitrarily large piece of data.

There are many types of hash functions with different properties, but most share these three main ones:

the size of the hash is limited, typically from a few bytes to a couple of hundred bytes
changing a single bit of the input significantly changes the output, which makes it look random
given the same input (and sometimes for a given value of an internal seed), hash functions always produce the same hash (determinism)

In blockchains, we use cryptographic hash functions. Such hash functions have these additional properties:

collision resistance: it is unfeasible in practice, to find two different inputs that yield the same hash,
pre-image resistance: it is also unfeasible to find an input (pre-image) that outputs a given hash, with no better strategy than trying every possible input and computing its hash.

With these properties, the hash of some data can be seen as a unique fingerprint that identifies this data.

Tezos uses BLAKE2b, a cryptographic hash function that takes any sequence of bytes as input and produces 32 bytes (256 bits) hashes. Another well-known example is SHA256.

As an example, here is the hash of the small string "Cake", expressed using 64 hexadecimal digits:

cecef07bea32870c90c29b678f46b61b046b1ba1b16cc0b875661bfea0a10755

And here is the hash of a longer string: 05624b116c3e4f6dac5a74de5014e38d2bda5c8c5c26ab5e2f1a365c5d9e0c61

Note that the two hashes are completely different, but both are 64 digits long.

You have probably already used hashes without knowing it. Indeed, when you download a file on your computer, some browsers check the hash of the downloaded file and compare it to the hash announced by the source before download. If the two hashes match, it means that the file you have downloaded is a perfect match to the one intended to be sent by the source. If the hashes don't match, your download has been corrupted. You can try this out manually by downloading the latest release of Ubuntu, computing the hash of the downloaded file, and comparing it to the one announced on their website.

Reducing bandwidth needs

With a centralized approach, a member only has to upload new content once to the central entity, therefore minimizing bandwidth needs.

With our basic decentralized approach, as the community grows and the number of nodes increases, it becomes prohibitive for the author of new content to send their data to every single node in the network.

Limiting the number of nodes would limit the bandwidth for uploading, but increase the amount of bandwidth required for nodes to reply to download requests.

How can we solve this dilemma?

Assuming we still want every node to store all of the data, the total bandwidth used by the community can't be reduced. What we can reduce, however, is the bandwidth needed for a single user or node.

This means:

when uploading, a member should only send its data to a subset of the nodes.
when downloading, members shouldn't all send their requests to the same node.

If a user only sends their data to a subset of the nodes, this means these nodes must in turn send their data to the remaining nodes.

For this, we organize the nodes as a peer-to-peer network, where each node knows a subset of other nodes, called its neighbors. When a node receives a file it doesn't already have, it asks each of its neighbors if they already have that file (using the hash of its content as an identifier), and sends it to the ones that don't.

As long as the network is strongly connected, i.e. not split in two parts where no node of one part knows a node of the other part, then the file will quickly propagate through the entire network. Every node will then store a copy of the file. To ensure the network is connected, we make sure each node knows a large enough subset of the other nodes that is picked at random.

The peer-to-peer network can be called p2p for short. In the context of blockchains, we also call it the gossip network.

When downloading data, different members can simply connect to different nodes, to spread the load. If a node is missing the requested file, for example because it was disconnected during its propagation, it can use the p2p network to obtain it from another node, by sending the same request to one of its neighbors.

Avoiding DOS attacks

Our basic p2p network is vulnerable to Denial Of Service attacks (DOS for short), where a malicious member sends so much data to the network, or so many requests, that it saturates the network and makes it unavailable for other members.

One common approach is to limit the amount of data or requests that a given member, or a given IP address, can perform, then reject any requests beyond these limits.

However, if we keep the community open, an attacker can easily use a large number of machines with different IP addresses, or impersonate a large number of members, to go around such limits. Furthermore, strict limits may prevent legitimate intensive uses of the network.

Can you think of another solution?

One approach that works well consists in making attacks costly for anyone trying to attack the network.

There are multiple ways to make it costly:

require users to perform a significant amount of computing for any request. This implies spending computing resources and electricity. This approach called Proof of Work (PoW for short), has the unfortunate side effect that it wastes energy and is therefore bad for the environment. Its use should therefore be limited. We talk about PoW in more detail in the consensus mechanism section.
require users to pay for every new file they upload. The payment could be transferred to the nodes that do the work, which means this approach would double as an incentive for members to contribute their own nodes. This implies handling transfers of amounts of currency, which is one of the reasons why cryptocurrency is usually involved with decentralized architecture.

Conclusion

We have presented the issues that come with using a centralized storage service, and how we can build a peer-to-peer network of nodes that each store a copy of the data and communicate with their neighbors to propagate new data.

We have seen how cryptographic hashes can be used to uniquely identify files on such a network and help avoid inconsistencies between nodes.

Finally, we touched on two approaches to protect against DOS attacks: Proof of Work, and charging a fee when people upload new data.

Goal: storing data forever​

Traditional approach​

Decentralized approach​

Potential issues​

Ensuring that nodes store/send the right data​

Cryptographic hashes​

Reducing bandwidth needs​

Avoiding DOS attacks​

Conclusion​