Singer Spec
The Singer Specification is an open source standard for defining the format of data exchange. The standard is useful because it enables Data Professionals to move data between arbritrary systems as long as the programs generating and ingesting the data can understand this format.
This documentation is our attempt at simplifying the canonical specification into an easier to understand and follow format for people who are new to the Singer community. The full specification is in the Singer project on GitHub.
Version
The current version of the spec is 0.3.0 and is versioned using Semantic Versioning.
Basics
Messages
The full specification for data exchange consists of three types of JSON-formatted messages: schema
, record
, and state
. The record
message contains the actual data being communicated, the schema
message defines the structure of the data, and the state
message keeps track of the progress of an extraction.
An example of what these messages look like is here:
{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "users", "record": {"id": 1, "name": "Chris"}}
{"type": "RECORD", "stream": "users", "record": {"id": 2, "name": "Mike"}}
{"type": "SCHEMA", "stream": "locations", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "locations", "record": {"id": 1, "name": "Philadelphia"}}
{"type": "STATE", "value": {"users": 2, "locations": 1}}
Each record
message contains a stream
identifier which specifies the unique name of that data. For data coming from an API this can be thought of as the name of the endpoint. For data coming from a database this might be the table name. The schema
message will have a matching stream
identifier for the records it describes. The term "stream" will be used in the rest of the documentation to identify a set of data being extracted.
Taps
These 3 message types are generated by programs called taps
. A tap can be written in any programming language (note that Meltano will only run Python-based taps). Taps output the 3 messages to standard output, aka stdout
.
Taps are required to accept 1 file, called a configuration (config
) file, and can optionally accept 2 other files called state
and catalog
files.
File | Description |
---|---|
config.json |
JSON-formatted file containing any information needed for a tap to run. This can include authorization information such as username and password, date parameters to specify when to start extracting, and anything else useful for pulling a specific set of data. |
state.json |
JSON-formatted file used to store information between when taps are run. There is no specification for the format of a state file other than the JSON requirement. If a tap is able to accept a state file it is expected that it will output state messages as well. |
catalog.json |
JSON-formatted file that specifies which streams and entities within the streams, such as columns or fields, to extract. It also can define how streams are replicated and can include extra metadata about a particular stream. |
Targets
The 3 message types are consumed by programs called targets
. A target can be written in any programming language (note that Meltano will only run Python-based targets). Targets ingest the 3 messages from standard input, aka stdin
.
Targets can optionally accept a configuration file if the target system requires authentication information. For a simple target like a csv file, this would not be required, but for a more complicated target like a SaaS database, the config file would be required.
Taps | Targets
Since taps and targets are able to communicate to each other via the Singer spec, they can be used together to move data between systems. This can be done on the command line by sending the messages from a tap to a target using a Unix pipe, |
. A pipe takes information from stdout
of one process, in this case a tap, and redirects it to stdin
of a second process, in this case a target. This composability means taps and targets can be composed as simply as tap | target
.
Details
Messages
Each of the messages have a defined schema and some required and optional fields. Note that while example messages will be shown on multiple lines, each record when output from a tap must be on its own line.
Records
Record messages contain the actual data being extracted from a source system. Every record message must have the following properties:
type
- this will always beRECORD
stream
- the unique identifier of the data streamrecord
- a JSON object containing the data being extracted
Record messages can optionally have:
time_extracted
- The time the record was observed in the source. This should be an RFC3339 formatted date-time, like "2022-11-20T16:45:33.000Z".
Putting it together, a full record message looks like this:
{
"type": "RECORD",
"stream": "tools",
"time_extracted": "2021-11-20T16:45:33.000Z",
"record": {
"id": 1,
"name": "Meltano",
"active": true,
"updated_at": "2021-10-20T16:45:33.000Z"
}
}
Note that in the above example the message was formatted for readibility, but when output from a tap the entire message will be on a single-line.
Schemas
Schema messages define the structure of the data sent in a record message. Every schema message must have the following properties:
type
- this will always beSCHEMA
stream
- the unique identifier of the data stream. This will match the stream property in record messagesschema
- a JSON Schema describing therecord
property of record messages for a given streamkey_properties
- a list of strings indicating which properties make up the primary key for this stream. Each item in the list must be the name of a top-level property defined in the schema. An empty list may be used to indicate there is no primary key for the stream
What is a JSON Schema?
A JSON Schema is a way to annotate and validate JSON objects. The data types available in raw JSON are limited compared to the variety of types available in many targets. Within the Singer Spec, JSON schema definitions are used to tell a target the exact data type to use when storing data.
Using the record
example shown previously, the JSON schema for that record could be:
{
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
},
"active": {
"type": "boolean"
},
"updated_at": {
"type": "string",
"format": "date-time"
}
}
This definition now explicitly defines what kind of data is expected in a record and how to handle it when loading the data.
Also of note, there are several different versions of JSON Schema. The most common one is Draft 4 and Meltano and the SDK both support this draft.
Optional SCHEMA message properties
Schema messages can optionally have:
bookmark_properties
- a list of strings indicating which properties the tap is using as bookmarks. Each item in the list must be the name of a top-level property defined in the schema. This is discussed more in the bookmarks section.
Putting it together, a full schema message looks like this:
{
"type": "SCHEMA",
"stream": "tools",
"schema": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
},
"active": {
"type": "boolean"
},
"updated_at": {
"type": "string",
"format": "date-time"
}
}
},
"key_properties": ["id"],
"bookmark_properties": ["updated_at"]
}
Note that in the above example the message was formatted for readability, but when output from a tap the entire message will be on a single-line.
Ordering of SCHEMA and RECORD Messages
Before any record messages for a given data stream are output by a tap, they must be preceded by a schema message for the stream. While the extraction will still work, it will be assumed that the record is schema-less and will be loaded in a potentially unexpected manner.
State
State messages contain any information that a tap is designed to persist. These are used to inform the target of the current place in the extraction of a data stream. Each state message must have the following properties:
type
- this will always beSTATE
value
- this is a JSON object of the state values to be stored
There structure of the value
property is not defined by the spec and is determined by each tap independently. However, the following structure is recommended:
{
"bookmarks": {
"tools": {
"updated_at": "2021-10-20T16:45:33.000Z"
},
"team": {
"id": 123
}
}
}
The bookmarks
key should be familiar since it's an optional key in a schema message. Each property within the bookmarks
JSON object is a data stream from a previously defined schema and record. Each stream maps to a JSON object storing the last data point seen in the extraction.
In the above example, the tools
stream has extracted data up to the timestamp shown in the updated_at
field. Similarly, the team
stream has extracted up to id
= 123.
Putting it together, a full state message looks like this:
{
"type": "STATE",
"value": {
"bookmarks": {
"tools": {
"updated_at": "2021-10-20T16:45:33.000Z"
},
"team": {
"id": 123
}
}
}
}
Note that in the above example the message was formatted for readibility, but when output from a tap the entire message will be on a single-line.
Taps
When taps are run, they can accept three files that provide information necessary for it to work properly: config, state, and catalog files. Taps are required to accept the config file, and can optionally accept the state and catalog files.
Config Files
The config file contains the parameters required by the tap to succesfully extract data from the source. Typically this will include credentials for an API or database connection.
There is no required specification, but it is recommended to have the following fields:
start_date
- this is used on the first sync to define how far back in time to pull data. Start dates should conform to the RFC3339 specification.user_agent
- this should be an email address or other contact information should the API provider need to contact you for any reason
Putting this all together, a config file may look like:
# config.json
{
"api_key" : "asd23ayzxz80adf",
"start_date" : "2022-01-01T00:00:00Z",
"user_agent" : "your_email@domain"
}
State Files
Taps can optionally use a state file to start replication from a previous point in a data stream. The structure of the state file and the state message described previously should be nearly identical. The value
property in the state message will be the contents of any state.json file passed to a tap.
Using the previous example, a state file would look like this:
# state.json
{
"bookmarks": {
"tools": {
"updated_at": "2021-10-20T16:45:33.000Z"
},
"team": {
"id": 123
}
}
}
Catalog Files
Catalog files define the structure of one or many data streams. Taps are capable of both using and generating catalog files.
The structure of a catalog file is a JSON object with a single top-level property:
streams
- this is a list containing information for each data stream that can be extracted
Each item within the streams
list is another JSON object with the following required properties:
stream
- this is the primary identifier of the stream as it will be passed to the target (tools
,team
, etc.)tap_stream_id
- this is the unique identifier of the stream which can differ from thestream
name since some sources may have multiple available streams with the same nameschema
- this is the JSON schema of the stream, which will be passed in a SCHEMA message to the Target
Optional properties within the list are:
table_name
- this is only used for a database source and is the name of the tablemetadata
- this is a list that defines extra information about items within a stream. This is discussed more in the Metadata section below
An example catalog with a single stream and no metadata is as follows:
{
"streams": [
{
"stream": "tools",
"tap_stream_id": "tools",
"schema": {
"type": ["null", "object"],
"additionalProperties": false,
"properties": {
"id": {
"type": [
"string"
]
},
"name": {
"type": [
"string"
]
},
"updated_at": {
"type": [
"string"
],
"format": "date-time"
}
}
}
}
]
}
Discovery Mode
Discovery mode is how taps can generate catalogs. When a tap is invoked with a --discover
it will output the full catalog list of streams available for extraction to stdout
. This can then be saved to a catalog.json
file.
tap --config config.json --discover > catalog.json
Note that some older taps use properties.json
as the catalog file.
Metadata
Metadata is the preferred method of associating extra information about streams and properties within a stream.
There are two kinds of metadata:
discoverable
- this metadata should be written and read by a tap-
non-discoverable
- this metadata is written by other systems, such as Meltano, and should only be read by the tapA tap is free to write any type of metadata they feel is useful for describing fields in the schema, although several reserved keywords exist. A tap that extracts data from a database should use additional metadata to describe the properties of the database.
Non-discoverable Metadata
Keyword | Tap Type | Description |
---|---|---|
selected |
All | Either true or false . Indicates that this node in the schema has been selected by the user for replication. |
replication-method |
All | Either FULL_TABLE , INCREMENTAL , or LOG_BASED . The replication method to use for a stream. See Data Integration for more details on the replication type. |
replication-key |
All | The name of a property in the source to use as a bookmark . For example, this will often be an updated_at field or an auto-incrementing primary key (requires replication-method ). |
view-key-properties |
Database | List of key properties for a database view. |
Discoverable Metadata
Keyword | Tap Type | Description |
---|---|---|
inclusion |
All | Either available , automatic , or unsupported . available means the field is available for selection, and the tap will only emit values for that field if it is marked with "selected": true . automatic means that the tap will emit values for the field. unsupported means that the field exists in the source data but the tap is unable to provide it. |
selected-by-default |
All | Either true or false . Indicates if a node in the schema should be replicated if a user has not expressed any opinion on whether or not to replicate it. |
valid-replication-keys |
All | List of the fields that could be used as replication keys. |
forced-replication-method |
All | Used to force the replication method to either FULL_TABLE or INCREMENTAL . |
table-key-properties |
All | List of key properties for a database table. |
schema-name |
Database | Name of the schema. |
is-view |
Database | Either true or false . Indicates whether a stream corresponds to a database view. |
row-count |
Database | Number of rows in a database table/view. |
database-name |
Database | Name of the database. |
sql-datatype |
Database | Represents the datatype of a database column. |
Each piece of metadata has two primary keys:
metadata
- this is a JSON object containing all of the metadata for either the stream or a property of the streambreadcrumb
- this identifies whether the metadata applies to the entire stream or a property of the stream. An empty list means the metadata applies to the stream. For specific properties within the stream, the breadcrumb will have theproperties
key followed by the name of the property being described.
An example of a valid metadata object is as follows:
"metadata": [
{
"metadata": {
"inclusion": "available",
"table-key-properties": ["id"],
"selected": true,
"valid-replication-keys": ["date_modified"],
"schema-name": "users",
},
"breadcrumb": []
},
{
"metadata": {
"inclusion": "automatic"
},
"breadcrumb": ["properties", "id"]
},
{
"metadata": {
"inclusion": "available",
"selected": true
},
"breadcrumb": ["properties", "name"]
},
{
"metadata": {
"inclusion": "automatic"
},
"breadcrumb": ["properties", "updated_at"]
}
]
Putting it Together
Putting this all together, a complete catalog example looks like this:
{
"streams": [
{
"stream": "tools",
"tap_stream_id": "tools",
"schema": {
"type": ["null", "object"],
"additionalProperties": false,
"properties": {
"id": {
"type": [
"string"
],
},
"name": {
"type": [
"string"
],
},
"updated_at": {
"type": [
"string"
],
"format": "date-time",
}
}
}
}
],
"metadata": [
{
"metadata": {
"inclusion": "available",
"table-key-properties": ["id"],
"selected": true,
"valid-replication-keys": ["date_modified"],
"schema-name": "users",
},
"breadcrumb": []
},
{
"metadata": {
"inclusion": "automatic"
},
"breadcrumb": ["properties", "id"]
},
{
"metadata": {
"inclusion": "available",
"selected": true
},
"breadcrumb": ["properties", "name"]
},
{
"metadata": {
"inclusion": "automatic"
},
"breadcrumb": ["properties", "updated_at"]
}
]
}
Metrics
A tap can periodically emit structured log messages containing metrics about read operations. Consumers of the tap logs can parse these metrics out of the logs for monitoring or analysis. Metrics appear in the log output with the following structure:
INFO METRIC: <metrics-json>
where <metrics-json>
is a JSON object with the following keys:
Metric Key | Description |
---|---|
type |
The type of the metric. Indicates how consumers of the data should interpret the value field. There are two types of metrics: counter - The value should be interpreted as a number that is added to a cumulative or running total timer - The value is the duration in seconds of some operation |
metric |
The name of the metric. This should consist only of letters, numbers, underscore, and dash characters. For example, "http_request_duration" |
value |
The value of the datapoint, either an integer or a float. For example, "1234" or "1.234" |
tags |
Mapping of tags describing the data. The keys can be any strings consisting solely of letters, numbers, underscores, and dashes. For consistency's sake, we recommend using the following tags when they are relevant. Note that for many metrics, many of those tags will not be relevant. endpoint - For a Tap that pulls data from an HTTP API, this should be a descriptive name for the endpoint, such as "users" or "deals" or "orders" http_status_code - The HTTP status code. For example, 200 or 500 job_type - For a process that we are timing, some description of the type of the job. For example, if we have a Tap that does a POST to an HTTP API to generate a report and then polls with a GET until the report is done, we could use a job type of "run_report".status - Either "succeeded" or "failed" |
Here are some examples of metrics and how those metrics should be interpreted.
Timer for Successful HTTP GET
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 1.23, "tags": {"endpoint": "orders", "http_status_code": 200, "status": "succeeded"}}
The following is what the object looks like expanded:
{
"type": "timer",
"metric": "http_request_duration",
"value": 1.23,
"tags": {
"endpoint": "orders",
"http_status_code": 200,
"status": "succeeded"
}
}
This can be interpreted as: an HTTP request to an "orders" endpoint was made that took 1.23 seconds and succeeded with a status code of 200.
Timer for Failed HTTP GET
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 30.01, "tags": {"endpoint": "orders", "http_status_code": 500, "status": "failed"}}
This can be interpreted as: an HTTP request to an "orders" endpoint was made that took 30.01 seconds and failed with a status code of 500.
Counter for Records
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"endpoint": "orders"}}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"endpoint": "orders"}}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"endpoint": "orders"}}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 14, "tags": {"endpoint": "orders"}}
This can be interpreted as: a total of 314 records were fetched from an "orders" endpoint.
Log Output
Metrics messages are interspersed with the primary 3 messages, so parsing them should be handled programmatically. This is an example of what a realistic log output might look like:
INFO Using API Token authentication.
INFO tickets: Skipping - not selected
{"type": "SCHEMA", "stream": "groups", "schema": {"properties": {"name": {"type": ["string"]}, "created_at": {"format": "date-time", "type": ["string"]}, "url": {"type": ["string"]}, "updated_at": {"format": "date-time", "type": ["string"]}, "deleted": {"type": ["boolean"]}, "id": {"type": ["integer"]}}, "type": ["object"]}, "key_properties": ["id"]}
INFO groups: Starting sync
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.6276309490203857, "tags": {"status": "succeeded"}}
{"type": "RECORD", "stream": "groups", "record": {"id": 360007960773, "updated_at": "2020-01-09T09:57:16.000000Z"}}
{"type": "STATE", "value": {"bookmarks": {"groups": {"updated_at": "2020-01-09T09:57:16Z"}}}}
Targets
When targets are run, they can accept a single config file that provides the information necessary for it to work properly.
Config Files
Similar to taps, targets take a configuration file. There is no specification for the structure of a config file, as long as it is JSON based.
State Files
Unlike taps, targets do not take a state file. Targets are expected to read in the state messages from stdin
, but typically they do not do anything with the state messages beyond sending them to stdout
. This is done once all data that appeared in the stream before the state message has been processed by the Target.
Schema Files
Targets do not take a schema file. However, they are expected to read the schema messages from stdin
and perform validation of the incoming record using the provided schema.