Extract Sources
An extract source represents either a streaming API like a JSON feed or a type of file like CSV or Excel. A churro pipeline lets you define multiple extract sources which it will use to extract data from and load into the pipeline backend database.
Excel Files
An example of an extract source for an Excel file would look like the following:
extractrules:
- columnname: city
columnpath: "1"
columntype: TEXT
extractsourceid: c4p8hhhjiqcs73ee0lsg
id: c4p8hkpjiqcs73ee0lt0
matchvalues: ""
transformfunctionname: None
extractsources:
- cronexpression: 30s
id: c4p8hhhjiqcs73ee0lsg
name: my-xlsx-files
path: /churro/xlsxfiles
regex: '[a-z,0-9].(xlsx)$'
scheme: xlsx
skipheaders: 1
tablename: myxlsxtable
With Excel files, you can specify how many header rows there are to skip during processing. In the above example, we specify that we want to skip a single header row, store the processed rows into the myxlsxtable database backend table, and also process files from the /churro/xlsxfiles directory that have a .xlsx file extension.
A single extract rule is associated with the extract source, we reference a column we call city by specifying a path of 1. For Excel files, we define columns as starting from 0, so in this example the city value is in the second column of the spreadsheet file.
CSV Files
Comma Separated Files (CSV) are processed by churro. As with Excel files, we reference each CSV column value by an integer value starting with 0.
Here is an example of a CSV extract source definition:
extractrules:
- columnname: a
columnpath: "1"
columntype: TEXT
extractsourceid: c4p6rg1jiqcs73d90500
id: c4p6ri1jiqcs73d9050g
matchvalues: ""
transformfunctionname: None
extractsources:
- cronexpression: 30s
id: c4p6rg1jiqcs73d90500
name: my-csv-files
path: /churro/csvfiles
regex: '[a-z,0-9].(csv)$'
scheme: csv
tablename: mycsvtable
In the above example, we define an extract rule for a column we call a, that has a column path of 1. This value is the second CSV column value in each row.
We can tell churro to also skip header rows by specifying a skipheaders value.
XML Files
Files of XML content can be processed by churro with a definition similar to the following:
extractrules:
- columnname: author
columnpath: /library/book/author/name
columntype: TEXT
extractsourceid: c4qhjipjiqcs73a43460
id: c4qhkfhjiqcs73a4346g
matchvalues: ""
transformfunctionname: None
extractsources:
- cronexpression: 30s
id: c4qhjipjiqcs73a43460
name: my-xml-files
path: /churro/xmlfiles
regex: '[a-z,0-9].(xml)$'
scheme: xml
tablename: myxmltable
Notice that we are extract a column that we name author. We use an xmlpath expression to locate the value we want, e.g. /library/book/author/name.
JSON Path Message Files
churro can process files that contain a single JSON message, and you can extract values from the JSON message using jsonpath expressions.
Here is an example of an extract source for jsonpath processing:
extractrules:
- columnname: author
columnpath: $..book[*].author
columntype: TEXT
extractsourceid: c4qi3epjiqcs73a434c0
id: c4qi3p1jiqcs73a434cg
matchvalues: ""
transformfunctionname: transforms.MyUppercase
extractsources:
- cronexpression: 30s
id: c4qi3epjiqcs73a434c0
name: my-jsonpath-files
path: /churro/jsonpathfiles
regex: '[a-z,0-9].(jsonpath)$'
scheme: jsonpath
tablename: myjsonpathtable
In this example, we are extract an author field from the JSON message file using a jsonpath expression of $..book[].author. We also are performing a transformation function of *MyUppercase on the value before we insert it into the pipeline database table myjsonpathtable.
Raw JSON Files
churro can ingest a raw JSON file and store it within a JSONB database column without performing any field extractions. This type of JSON processing is meant for situations where you want to store the raw JSON directly into the database as a single database row with the JSON message stored in a single column.
In this extract source type, there is no transformation function capability offered by churro.
An example of an extract source definition for raw json files looks like:
extractrules:
extractsources:
- cronexpression: 30s
id: c4qinf9jiqcs73a434d0
name: my-json-files
path: /churro/jsonfiles
regex: '[a-z,0-9].(json)$'
scheme: json
tablename: myjsontable
Notice in this file type, that no extract rules are specified, the entire json message is extracted and loaded into the pipeline table myjsontable.
JSON API Stream
churro can continually process data from a JSON API and store it in the pipeline database. Here is an example that defines a SpaceX Starlink extract source:
extractrules:
- columnname: shipname
columnpath: $[*].spaceTrack.OBJECT_NAME
columntype: TEXT
extractsourceid: c4qi3epjiqcs73a434c0
id: c4qi3p1jiqcs73a434cg
matchvalues: ""
transformfunctionname: transforms.MyUppercase
- columnname: longitude
columnpath: $[*].longitude
columntype: DECIMAL
extractsourceid: c4qi3epjiqcs73a434c0
id: c4qi3p1jiqcs73a434ch
- columnname: latitude
columnpath: $[*].latitude
columntype: DECIMAL
extractsourceid: c4qi3epjiqcs73a434c0
id: c4qi3p1jiqcs73a434ci
extractsources:
- cronexpression: @every 1h
id: c4qi3epjiqcs73a434c0
name: my-starlink-feed
path: https://api.spacexdata.com/v4/starlink
scheme: api
tablename: mystarlinktable
This example causes the SpaceX feed to be polled every hour based on the cronexpression. There are 3 extract rules defined: shipname, latitude, longitude. The extracted values are stored in the mystarlinktable database table in the pipeline’s backend database.
http POST Messages
An example of an extract source for an Excel file would look like the following:
extractsources:
- cronexpression: '@every 1h'
encoding: urlencoded
id: c591rb1jiqcs73edkbvg
name: my-httppost-files
path: https://my-httppost-files.pipeline1.cluster.svc.local:10000/extractsourcepush
port: 10000
scheme: httppost
servicetype: ClusterIP
tablename: myhttpposttable
transport: https
With http POST messages, you have an extract source that runs when you start it and you can also stop it at anytime you want as well. The http POST extract service waits for http POST messages to receive essentially. So, any device that can perform an http POST can send messages to a churro pipeline.
In the above example, we specify that we want a service to write http POST messages into a table called myhttpposttable. The http POST messages are encoded using urlencoded values. You can also specify an encoding of json if you expect JSON messages to be posted.
Message producers would post to the url at https://my-httppost-files.pipeline1.cluster.svc.local:10000/extractsourcepush. A Service is built for http POST extract sources that allow you to specify a ClusterIP or a LoadBalancer service type.
You can test out an extract source like this using curl commands like the following:
curl -d "firstname=david" -X POST https://192.168.0.120:10000/extractsourcepost
curl -d '{"login":"my_login","password":"my_password"}' -H 'Content-Type: application/json' -X POST http://192.168.0.120:10000/extractsourcepost
You can also specify an extract source transport value of either http or https. In the case of specifying https, the extract source will create a TLS connection and use the service.crt and service.key keys that are mounted by churro into the extract source Pod.
Note that for json encoded messages, if you do not specify any extract rules, the entire JSON message will be stored within a jsonb column type in the back-end database.