churro

main repo for the churro file/API processor

View on GitHub

churro Installation Guide

Development Host Setup

Here is a typical developer environment that we use to develop and test churro:

Operator System

Fedora 33

Kubernetes

Kubernetes 1.20

We tend to test locally with k8s installed via kubeadm.

kubeadm

For running on Fedora 33 you will need to disable swap as follows:

sudo touch /etc/systemd/zram-generator.conf
sudo touch /etc/systemd/zram-generator.conf

I tend to also disable selinux and the firewalld for local development:

systemctl disable firewalld
systemctl stop firewalld
vi /etc/sysconfig/selinux

Make sure your host can resolve to a single hostname and IP address that won’t change. Kubernetes needs to have your hostname resolve to an IP address that can be found in /etc/hosts for example.

sudo hostnamectl set-hostname yourhostname
sudo vi /etc/hosts (add yourhostname to your /etc/hosts file)

golang Compiler

go version 1.15.7

Kubernetes Storage Class

You need at least a single StorageClass for churro to use. We tend to use the hostpath provisioner found here: rimusz hostpath provisioner

helm repo add rimusz https://charts.rimusz.net
helm install my-hostpath-provisioner rimusz/hostpath-provisioner --version 0.2.10

WARNING!!! https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner/issues/25#issuecomment-742616668 You need to apply that bugfix with k8s 1.20 in order to get hostpath storage to work.

Other combinations might work as well, we tend to use the latest and greatest of the building blocks (kube, golang)

Source Code

Access to the churro source code is granted with a valid license which can be obtained by request with churrodata directly.

You would access the source as follows:

git clone git@gitlab.com:churro-group/churro

Install gRPC Tools

sudo dnf -y install protobuf-compiler
go get google.golang.org/protobuf/cmd/protoc-gen-go \
         google.golang.org/grpc/cmd/protoc-gen-go-grpc

Configure Docker

For running churro with cockroachdb, you will need to configure Docker to support a larger than default set of file limits:

sudo vi /etc/sysconfig/docker
Add OPTIONS=”--default-ulimit nofile=50000:50000” to the OPTIONS variable

sudo vi /etc/systemd/system.conf
Add DefaultLimitNOFILE=4096:524288

churro-download

TODO

Deploy churro

Deploy churro

This document describes the steps to set up churro on an Azure Kubernetes Cluster (AKS), the same steps will generally apply to other Kube cluster/environments. Users will want to adjust the various configuration steps below to match their particular environment.

Create Namespace

kubectl create namespace churro

Create Pull Secret

Since churro images are currently private, you will need to be able to authenticate to the churro container registry which will update your Docker config with credentials used for an image pull Secret, in the churro container manifests we’ll refer to that image pull Secret.

kubectl -n churro create secret generic regcred \
    --from-file=.dockerconfigjson=<path/to/.docker/config.json> \
    --type=kubernetes.io/dockerconfigjson

Deploy cockroachdb Operator

This operator is used only for running the cockroachdb instance used by the churro web console if you specify a databasetype value of cockroachdb in the churroui CR.

kubectl create -f deploy/cockroachdb/crd.yaml
kubectl -n churro create -f deploy/cockroachdb/rbac.yaml
kubectl -n churro create -f deploy/cockroachdb/operator.yaml

Wait for the cockroachdb operator to be in running status.

kubectl -n churro get pod

Deploy cockroachdb Instance

kubectl -n churro create -f deploy/cockroachdb/cr.yaml

Wait for the 3 cockroach pods to be in running status.

kubectl -n churro get pod
NAME                                  READY   STATUS    RESTARTS   AGE
cockroach-operator-5864f4f767-w5hgx   1/1     Running   0          14m
cockroachdb-0                         1/1     Running   0          2m
cockroachdb-1                         1/1     Running   0          73s
cockroachdb-2                         1/1     Running   0          41s

Customize Templates

The churro operator can read its templates from either a ConfigMap named churro-templates or from templates that are embedded in the churro-operator binary during at build time.

If the ConfigMap is not present then all templates will be used based on the embedded templates within the churro-operator binary. If the ConfigMap is found, then it takes precedent over the embedded version for a given template. This scheme allows you to modify a single churro template and store it within a ConfigMap but use the embedded templates for all other templates.

There is a Makefile target that demonstrates how to create a template ConfigMap:

make update-templates

Deploy churro operator

The following Makefile target can be used to create the churro-operator Deployment:

make deploy-churro-operator

Customize the churroui CR

You will want to customize the churroui CR before you create it to fit your installation environment. Here is an example of settings that are possible:

apiVersion: churro.project.io/uiv1alpha1
kind: Churroui
metadata:
  name: fuzzy
  labels:
    name: fuzzy
status:
  active: "true"
  standby:
    - "AAPL"
    - "AMZN"
spec:
  databasetype: cockroachdb
  servicetype: LoadBalancer
  storageclassname: default
  storagesize: 1G
  accessmode:  ReadWriteOnce
setting meaning default possible values
databasetype the web consoles database it will use cockroachdb  
servicetype the type of Service for the web console ClusterIP ClusterIP or LoadBalancer
storageclassname the storage class name for the web console app default storage class must be an existing storage class name
storagesize the size of volume to request for the web console app default size for storage class must be valid syntax for kube storage resource and storage class
accessmode the storage access mode to use for the web console app ReadWriteMany ReadWriteMany or ReadWriteOnce

Create the churroui CR

Note that you can specify which databases you want churro to support by editing the churro-ui.yaml Custom Resource as found in the following directory/file:

cmd/churro-operator/deploy/templates/churro-ui.yaml

This will only effect the user interface when a user wants to create a pipeline, it does not determine what database operators get installed. Database operators are installed manually by the churro administrator during installation.

kubectl -n churro create -f deploy/operator/churro-ui-cr.yaml

Verify the churro web console is running:

kubectl -n churro get pod
NAME                                  READY   STATUS    RESTARTS   AGE
churro-operator-744b879c67-2bjqq      1/1     Running   5          5m57s
churro-ui-98fdcf565-5xrq5             1/1     Running   0          7s
cockroach-operator-5864f4f767-w5hgx   1/1     Running   0          50m
cockroachdb-0                         1/1     Running   0          38m
cockroachdb-1                         1/1     Running   0          37m
cockroachdb-2                         1/1     Running   0          36m

Once the web console is running, it will create some database tables for churro to write state into, you can verify this by looking at the tables it creates:

make run-web-console-db-client
root@:26257/defaultdb> show tables from defaultdb;
  schema_name |     table_name     | type  | estimated_row_count
--------------+--------------------+-------+----------------------
  public      | authenticateduser  | table |                   0
  public      | userpipelineaccess | table |                   0
  public      | userprofile        | table |                   1
(4 rows)

Time: 30ms total (execution 22ms / network 9ms)
root@:26257/defaultdb> quit

By default, the web console has a ClusterIP Service type that requires you to port forward to gain access on your local development host.

The following make target will create a forwarded port (8080) on your local dev box:

make port-forward

You can then access it via the url https://:8080

The web console is bootstrapped with an administrator user ID of admin@admin.org and a password of admin.

Validate Installation

After you complete the churro installation, you should be able to create a Pipeline to further validate your installation as being complete.

Create Pipeline

Here are the steps you should be able to perform:

Once the status goes green it means you have a running pipeline. You can verify this by looking on your kube cluster at the namespace created for the pipeline:

kubectl -n pipeline1 get pod
NAME                        READY   STATUS    RESTARTS   AGE
churro-ctl                  1/1     Running   0          3m8s
churro-extractsource        1/1     Running   0          3m8s
cockroachdb-0               1/1     Running   0          3m25s
cockroachdb-1               1/1     Running   0          3m25s
cockroachdb-2               1/1     Running   0          3m25s
cockroachdb-client-secure   1/1     Running   0          3m8s

sftp-setup

What is sftp?

sftp is a fairly mainstream means to upload or transfer files to and from different hosts. In churro, we have included an sftp service within the churro-watch Pod to enable an easy means for users to upload data files directly into churro Data Source file folders.

How it works

Within the churro-watch Pod, we include a new container, this container will run the sftp server from text. We have configured this sftp server to mount the same file folder that crunchy-watch uses to process uploaded data source files (e.g. CSV, JSON, Excel, XML).

sftp Service

By default, churro doesn’t create a Service that exposes the sftp service. You can create one by running the following Makefile target:

make create-sftp-service

This service exposes the sftp port (2022) as well as the sftp web console port (8080). You can now port-forward to this service or modify it to be a LoadBalancer service depending on your environment.

Port Forward to the sftp Web Console

There is a Makefile target that you can use to port-forward to the sftp web console:

make port-forward-sftp-web

sftp User Configuration Example

The sftp service includes a web console that allows you to define sftp users. We use this console to add a churro user account that we will use to upload data files.

First, generate a unique key as follows:

ssh-keygen

This will create a public and private key such as /home/yourname/.ssh/[id_rsa, id_rsa.pub]

Within the sftp web console, we create a user that specifies the public key contents you generated. You will also want to set the data directory to be /data, this is where the churro-watch data source directory is mounted within the sftp container. This mounted volume is shared with the churro-watch container.

sftp Client Configuration

Next, we create a sftp config file (e.g. config.sftp) that can be used on the client host that a user would use to upload files to churro. The config file looks like this:

Host localhost
User test2
IdentityFile /home/jeffmc/.ssh/id_rsa

You can upload files using sftp by using a command as follows:

sftp -F /home/yourname/config.sftp -P 2022 test2@yourhost

After connecting, you can look around the file directory that churro-watch uses to find data files to process:

sftp> cd csvfiles
sftp> ls
ready                                   source1.csv.churro-processed
sftp> put source1.csv

The put sftp command uploads a file source1.csv from your current local client directory to the remote server directory. Effectively this will cause churro-watch to process the uploaded file.