A guide for creating a ClickHouse cluster from scratch
We will create a very minimal ClickHouse cluster using 3 VMs, 2 for ClickHouse Server and 1 for ClickHouse Keeper.
Configuring VMs to create the CH cluster
Install CH
DO THE FOLLOWING ON EACH VM
P.S The keeper node won’t have a data disk, so do create a folder clickhouse
on the root folder and install CH there.
- SSH to the newly created VM
sudo su
- Create the XFS disk refer (optional, do it if you have added a data disk)
- Go inside the data directory, we will install ClickHouse inside this folder
cd /data
curl https://clickhouse.com/ | sh
./clickhouse install
this will create symlinks to be used by the OS./clickhouse server
this will create the files and directories needed for ClickHouse in the current directoryctrl+c
to stop the running instance, and verify if the files are created by doingls
Adding the ClickHouse installation to systemd
cd /usr/lib/systemd/system
touch clickhouse-server.service
vi clickhouse-server.service
- Insert this:
[Unit]
Description=ClickHouse Server (analytic DBMS for big data)
Requires=network-online.target
After=time-sync.target network-online.target
Wants=time-sync.target
[Service]
Type=simple
User=clickhouse
Group=clickhouse
Restart=always
RestartSec=30
RuntimeDirectory=clickhouse-server
ExecStart=/usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
# Minus means that this file is optional.
EnvironmentFile=-/etc/default/clickhouse
LimitCORE=infinity
LimitNOFILE=500000
CapabilityBoundingSet=CAP_NET_ADMIN CAP_IPC_LOCK CAP_SYS_NICE CAP_NET_BIND_SERVICE
[Install]
WantedBy=multi-user.target
5. systemctl status clickhouse-server
this will show inactive
6. systemctl start clickhouse-server
this should start the service
7. Add this service to boot, so that CH starts on VM start:systemctl enable clickhouse-server
8. Verify:systemctl is-enabled clickhouse-server
Creating the cluster
Configuring the cluster topology
THIS NEEDS TO BE DONE ON EACH SERVER NODE
Quick tip: For Vim you can press /
to go in search mode, for nano you can do ctrl+w
vi /etc/clickhouse-server/config.xml
This file is read-only,:wq!
to save and quit vim.- Enable listen_host, this will enable the hosts to communicate. Uncomment this line:
<listen_host>0.0.0.0</listen_host>
3. Go to the section for remote server config.
4. Add the config for the cluster, and leave the default config:
<cluster1>
<shard>
<replica>
<host>x.x.x.x</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>x.x.x.x</host>
<port>9000</port>
</replica>
</shard>
</cluster1>
In our setup we are defining only shards, if we go for a replica in the future. We can define a replica using the <replica> key in the <shard>. This will set up data replication inside one shard. Also, set internal replication to true. The config will then look like this:
<cluster1>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>x.x.x.x</host>
<port>9000</port>
</replica>
<replica>
<host>replica_ip</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>x.x.x.x</host>
<port>9000</port>
</replica>
<replica>
<host>replica_ip</host>
<port>9000</port>
</replica>
</shard>
</cluster1>
Verify this cluster inside ClickHouse
clickhouse-client
- List the clusters
SELECT cluster FROM system.clusters
3. It should list down cluster1, the count should be equal to the number of total replicas defined. In our case, we have one replica per shard and we have 2 shards. So, it should show cluster1 twice.
4. Get the details:
SELECT
cluster,
shard_num,
replica_num,
host_name,
port
FROM system.clusters
WHERE cluster = 'cluster1'
ORDER BY
shard_num ASC,
replica_num ASC
This should show the cluster topology as defined. In our case:
┌─cluster──┬─shard_num─┬─replica_num─┬─host_name───┬─port─┐
│ cluster1 │ 1 │ 1 │ x.x.x.x │ 9000 │
│ cluster1 │ 2 │ 1 │ x.x.x.x │ 9000 │
└──────────┴───────────┴─────────────┴─────────────┴──────┘
Looks good!
Configuring the keeper node
Clickhouse Keeper comes bundled with a ClickHouse server instance, just installing Clickhouse is sufficient to run clickhouse keeper. Adding the keeper configuration in the clickhouse-server config enables it for you.
vi /etc/clickhouse-server/config.xml
- Enable listen_host, this will enable the hosts to communicate. Uncomment this line:
<listen_host>0.0.0.0</listen_host>
3. We are using only one keeper node to manage data. Add this config at the end, since there won’t be any existing dummy entry. Go to where the ClickHouse config ends </clickhouse>
Add the following just before that:
- NOTE: We are using port 9234 not 9444, this is done to ensure if there is any other process using port 9444 our keeper is not affected.
<keeper_server>
<tcp_port>2181</tcp_port>
<server_id>1</server_id>
<raft_configuration>
<server>
<id>1</id>
<hostname>x.x.x.x</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
- In case we go for more than one keeper node we can define the config as follows, remember to put the server ID as unique for each keeper node.
<keeper_server>
<tcp_port>2181</tcp_port>
<server_id>3</server_id>
<raft_configuration>
<server>
<id>1</id>
<hostname>x.x.x.x</hostname>
<port>9234</port>
</server>
<server>
<id>2</id>
<hostname>x.x.x.x</hostname>
<port>9234</port>
</server>
<server>
<id>3</id>
<hostname>x.x.x.x</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
4. Restart ClickHouse systemctl restart clickhouse-server
Configuring the cluster
Almalinux tip: It has the firewall enabled. This will not allow routes from the nodes to be discovered.
sentenforce 0
, disable security featuressystemctl stop firewalld
systemctl disable firewalld
We need to tell the server nodes, where the keeper is. Do the following on each of the server nodes.
1. vi /etc/clickhouse-server/config.xml
2. Add the zookeeper config:
<zookeeper>
<node index="1">
<host>x.x.x.x</host>
<port>2181</port>
</node>
</zookeeper>
- In future, if we have more than one instance, we can define the config as:
<zookeeper>
<node index="1">
<host>x.x.x.x</host>
<port>2181</port>
</node>
<node index="2">
<host>x.x.x.x</host>
<port>2181</port>
</node>
<node index="3">
<host>x.x.x.x</host>
<port>2181</port>
</node>
</zookeeper>
3. Restart ClickHouse systemctl restart clickhouse-server
4. Verify ClickHouse keeper config:
5. clickhouse-client
select * from system.zookeeper where path IN ('/', '/clickhouse')
6. Add macro definition to the nodes:
<macros>
<shard>01</shard>
<replica>01</replica>
</macros>
Change the number to 02 for the second replica, similarly for the second shard.
Verify using:
SELECT *
FROM system.macros
Verifying the cluster setup
We will create a database and tables on one node and verify if it gets created on each of the nodes or not.
- Create a database on the cluster:
CREATE DATABASE test1 ON CLUSTER cluster
2. Create a replicated table on this cluster:
CREATE TABLE test1.table1 ON CLUSTER cluster1
(
`x` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/table1/{shard}', '{replica}')
PRIMARY KEY x
3. Create a distributed table on top of the above to manage this table on all the shards and replicas:
CREATE TABLE distributed_test1 AS test1.table1
ENGINE = Distributed(cluster1, test1, table1)
All of this can be executed on any of the nodes and verified on any other node.