Implementing a self-hosted S3 server

Some Concern

S3 would give you some peace of mind, since it'll solve your problem of storing data somewhere. But maybe you have the space and you just need the tool.

You are told not to reimplement what already exists, but maybe it's just bad advice, taking things for granted.

You want to be able to implement elastic block storage, as you have it on top of your data provider of choice, there's nothing bad in being able to do the same thing on premise, with your server resources, or on some remote machine you have control over.

Seaweed

Seaweed is a distributed filesystem. It has a lot of unique qualities, but we spoiled children of the cloud can just say that it is the self-hosted version of S3.

There are plenty of similar projects, but I subjectively think this one is the best among them all. It's written in golang, and has a lot of convincing features.

Among other amazing properties, it implements support for S3 protocol.

Installation

Install SeaWeed binary by issuing the commands:

git clone https://github.com/seaweedfs/seaweedfs.git
cd seaweedfs/weed && make install

Single node

You can start a single node to start experimenting, you can script something like the following:

#! /bin/bash -x

DATA_DIR="/some/dir/on/your/system"

test -d ${DATA_DIR} || mkdir -p ${DATA_DIR}

weed server -dir=${DATA_DIR} -s3

Multiple nodes

The proper way to do it is to start multiple nodes, each residing on different IP addresses. Each node can have a script that looks roughly like this:

#! /bin/bash -x

cd $(dirname $0)

DATA_DIR="/some/dir/where/to/store/data"

test -d ${DATA_DIR} || mkdir -p ${DATA_DIR}

NODE_IP="192.168.0.2"
NODE_S3_PORT="8333"
NODE_DATACENTER="dc1"
NODE_RACK="rack1"
PEERS="192.168.0.2:9333,192.168.0.3:9333,192.168.0.4:9333"

weed server -dir=${DATA_DIR} -s3 -s3.port=${NODE_S3_PORT} -master.peers=${PEERS} \
	-ip=${NODE_IP} -dataCenter=${NODE_DATACENTER} -rack=${NODE_S3_RACK} -volume.max=100

Also, you probably want the option:

 -master.defaultReplication="010"

For some documentation regarding the replication policies, see below:

000: no replication
001: replicate once on the same rack
010: replicate once on a different rack, but same data center
100: replicate once on a different data center
200: replicate twice on two different data center
110: replicate once on a different rack, and once on a different data center

So the script becomes:

#! /bin/bash -x

cd $(dirname $0)

source ~/my-config.cfg

test -d ${DATA_DIR} || mkdir -p ${DATA_DIR}

weed server -dir=${DATA_DIR} -s3 -s3.port=${NODE_S3_PORT} -master.peers=${PEERS} \
	-master.defaultReplication="010" -ip=${NODE_IP} -dataCenter=${NODE_DATACENTER} \
	-rack=${NODE_S3_RACK} -volume.max=100

Assuming that you populate first ~/my-config.cfg with good values for the following variables:

NODE_IP=""
NODE_S3_PORT=""
NODE_DATACENTER=""
NODE_RACK=""
PEERS=""

Now you can start the process on each of the nodes:

./start-node.sh

Using s3cmd client

I expect you already know the s3cmd and know what it does: it's a client for S3 to create buckets, upload objects, list and retrieve them.

Until you are going to use it locally, some tweak is going to be necessary in order to use it like you would with AWS.

Here you'll find some useful information:

s3cmd with SeaweedFS

The gist is: make somewhere a s3cfg.seaweed file that contains:

# Setup endpoint
host_base = 192.168.0.2:8333
host_bucket = 192.168.0.2:8333
use_https = No
# Enable S3 v4 signature APIs
signature_v2 = False

In other words, ip and port of the endpoint.

Creating a bucket with S3cmd

You can use s3cmd to create new buckets:

s3cmd -c ~/s3cfg.seaweed mb s3://first-bucket
Bucket 's3://first-bucket/' created

Writing files to bucket with S3cmd

You can save a file to a bucket like this:

s3cmd -c ~/s3cfg.seaweed put my-local-file s3://first-bucket
upload: 'my-local-file' -> 's3://first-bucket/my-local-file'  [1 of 1]
 6 of 6   100% in    0s    23.70 B/s  done

Retrieving files from bucket with S3cmd

You can use s3cmd to retrieve your file:

s3cmd -c ~/s3cfg.seaweed get  s3://first-bucket/my-local-file
download: 's3://first-bucket/my-local-file' -> './my-local-file'  [1 of 1]
 6 of 6   100% in    0s     3.34 KB/s  done

Explore buckets with curl

There are a few operations that you can actually perform on buckets directly with an http client such as Curl.

Bucket contents with Curl

If you call the endpoint with the name of the bucket, you will typically get an XML response, like that:

curl http://192.168.0.2:8333/first-bucket

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>first-bucket</Name>
  <Prefix/>
  <Marker/>
  <MaxKeys>10000</MaxKeys>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>my-local-file</Key>
    <ETag>"c0710d6b4f15dfa88f600b0e6b624077"</ETag>
    <Size>6</Size>
    <Owner>
      <ID>3e8</ID>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
    <LastModified>2023-08-31T21:06:41Z</LastModified>
  </Contents>
</ListBucketResult>

where you can actually see the names of stored objects such as my-local-file.

Retrieve files with Curl

Once you know the names of the object, you can just retrieve them with an http call:

curl http://192.168.0.2:8333/first-bucket/my-local-file
1
2
3

Security

As you might have noticed in the examples above, we move everything over plain HTTP, and that's probably not fine in a production scenario.

Suffice it to say that everything here is a webservice, you can just implement mTLS on top of it.

I might write down the details later on, by now one idea that I would like to suggest is using something like HaProxy (or Nginx as reverse proxy) to implement TLS termination.



[git] [security] [golang] [api]