Use Yandex ClickHouse for Analytics with Data from MySQL

Today I will discuss a newborn star on DB for analytics: ClickHouse by Yandex. ClickHouse is quite new when comparing to other solutions, however, there are lots of popular giants using it in production such as Yandex and Cloudflare. In this blog entry, I will note necessary steps to setup ClickHouse for Analytics with Data streaming from MySQL.

The base system comprises Ubuntu 16.04 with installed NginX, Percona Server for MySQL 5.7 with a dataset of a table of ~160M records (~ 40GB in InnoDB storage).

One point worth noting is that ClickHouse does not support DELETE / UPDATE operators. In order to overcome this, several approaches will be noted in another future blog entry.

1. Install ClickHouse

  • Add repo to source.list (I choose to use a separated file):
    echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" > /etc/apt/sources.list.d/clickhouse.list
  • Install:
    apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4
    apt update -y
    apt install -y clickhouse-client clickhouse-server
  • The config file is at /etc/clickhouse-server/config.xml. You need to check this file before starting the service. Be sure to check http_port and tcp_port to be sure they do not the same as other services. For example, the port 9000 might similar to PHP-FPM in some default configuration.
  • Start the ClickHouse service:
    service clickhouse-server start
  • Check with clickhouse-client:
    [email protected]:/home/MyUSER# clickhouse-client
    ClickHouse client version 1.1.54380.
    Connecting to localhost:9000.
    Connected to ClickHouse server version 1.1.54380.
    
    my-e5v2-server.tienle.com 🙂 SELECT 1;
    
    SELECT 1
    
    ┌─1─┐
    │ 1 │
    └───┘
    
    1 rows in set. Elapsed: 0.040 sec. 
    
  • Congratulations, it is done and it seems to be working;-)
  • Some beginners’ commands that you might find they are useful:
    • You can use CREATE TABLE, CREATE DATABASE, SHOW TABLES, SHOW DATABASES, USE DBNAME, etc. in clickhouse client cli.
    • To quickly access a database from Linux command line, use clickhouse-client -d DBNAME.
    • Remember, till now we can only connect to CH in the local machine via SSH.
    • If you want to import data from csv, first you need to create a table in CH, and then you can use INSERT INTO tableName FORMAT CSV (or CSVWithNames if your file has a header). E.g.
      cat /tmp/my_csv_with_header.csv | clickhouse-client --query="INSERT INTO tientest.table_name FORMAT CSVWithNames";
    • _

2. Setup Tabix to work with ClickHouse

  1. First, we need to accept external connection in /etc/clickhouse-server/config.xml. Add the following thing to this file: <listen_host>::</listen_host> (or just modify current existing lines)
  2. Then, need to create a new user in /etc/clickhouse-server/users.xml:
    • Execute the following command to create a sha256 random password:
      PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
    • Add a new user to the file with that password:
      	<mych>
      	    <password_sha256_hex>c677b56940a55a2f083c165fc9601e05936f104295377a4f2a007e1d6c85854e</password_sha256_hex>
      	    <networks incl="networks" replace="replace">
                  	<ip>::/0</ip>
                  </networks>
      
                  <!-- Settings profile for user. -->
                  <profile>default</profile>
      
                  <!-- Quota for user. -->
                  <quota>default</quota>
      	</mych>
  3. Download latest Tabix and set nginx host as per guide at Tabix.IO website:
    mkdir /home/tabix
    cd /home/tabix
    wget https://github.com/tabixio/tabix/archive/18.04.1.zip
    unzip 18.04.1.zip
    cat > "/etc/nginx/conf.d/test-tabix.conf" <<
    server {
        listen 80;
        server_name test-tabix.tienle.com;
        charset        utf-8;
        root /home/tabix/tabix-18.04.1/build;
        location / {
            if (!-f $request_filename) {
                rewrite ^(.*)$ /index.html last;
            }
            index  index.html index.htm;
        }
    }
    END
    systemctl restart nginx
  4. Go to your Chrome browser and log in with the connection http://YOURIP:8123 and the above mych user.

3. Stream data from MySQL

  1. First, be sure you have python 3.4 or later. In Ubuntu 16, it is mostly python3 command line: python3 -V. We will install python3-pip so that we can use pip3 command line:
    apt install -y python3-pip
  2. Install clickhouse-mysql-data-reader by using pip3 command line
    1. Install using pip3:
      apt install -y libmysqlclient-dev
      pip3 install clickhouse-mysql
    2. You should now access command line clickhouse-mysql:
      which clickhouse-mysql
    3. To finish installing data reader, install service files:
      clickhouse-mysql --install
  3. Configure MySQL to prepare for the connection
    1. First, insert the following parameters inside your [mysqld] section. If you are using Percona server, it is located at /etc/mysql/percona-server.conf.d/mysqld.cnf:
      server-id        = 11111 # Your server ID
      log_bin          = /var/lib/mysql/bin.log
      binlog-format    = row # very important if you want to receive write, update and delete row events
      # optional
      expire_logs_days = 30
      max_binlog_size  = 768M
      

      . Then, restart mysql service with systemctl restart mysql.

    2. Next, add a reader user to use for clickhouse later:
      CREATE USER 'reader'@'%' IDENTIFIED BY 'ClickHousePass';
      CREATE USER 'reader'@'127.0.0.1' IDENTIFIED BY 'ClickHousePass';
      CREATE USER 'reader'@'localhost' IDENTIFIED BY 'ClickHousePass';
      GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'%';
      GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'127.0.0.1';
      GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'localhost';
      FLUSH PRIVILEGES;
  4. Finally, you can use clickhouse-mysql command line to import data from MySQL to ClickHouse server. Assume that you need to import data from MySQL database testdb in big_table_here table, we will need to (i) first create table structure in ClickHouse, then (ii) migrate current data from MySQL to ClickHouse, and (iii) Setup CH to listen to MySQL binlog to perform INSERT (only insert, no DDL or other DELETE/UPDATE actions) when there is new data in MySQL:
    1.  Create a table in ClickHouse:
      • Create the table template
        clickhouse-mysql \
            --src-host=127.0.0.1 \
            --src-user=reader \
            --src-password=ClickHousePass \
            --table-templates-with-create-database \
            --src-tables==testdb.big_table_here > create_clickhouse_table_template.sql
      • We will then need to edit create_clickhouse_table_template.sql to sharding field and primary key (These columns must not be Nullable. Remember that the first column in MergeTree must be Date type, not DateTime or String). For example, edit and change it to
        ......
            `Year` UInt16,
        ......    
            `FlightDate` Date,
        ......    
            `Month` UInt8,
        ......
        ) ENGINE = MergeTree(FlightDate, (FlightDate, Year, Month), 8192)
      • Next, run create table with clickhouse-client:
        clickhouse-client -mn < create_clickhouse_table_template.sql
    2. Next, Migrate Existing Data in MySQL to ClickHouse:
      clickhouse-mysql \
          --src-host=127.0.0.1 \
          --src-user=reader \
          --src-password=ClickHousePass \
          --table-migrate \
          --src-tables=testdb.big_table_here \
          --dst-host=127.0.0.1
          --dst-schema=tientest
    3. Finally, setup CH to listen to MySQL binlog to replicate new INSERT data:
      clickhouse-mysql \
          --src-server-id=11111 \
          --src-wait \
          --nice-pause=1 \
          --src-host=127.0.0.1 \
          --src-user=reader \
          --src-password=ClickHousePass \
          --src-tables=testdb.big_table_here \
          --dst-host=127.0.0.1 \
          --dst-schema=tientest
          --dst-create-table \
          --migrate-table \
          --pump-data \
          --csvpool

       

 

Further References

Leave a Reply