Azure Data Factory pipeline with Snowflake and DBT
A data pipeline is a series of data processing steps. If the data is not currently loaded into the data platform, then it is ingested at the beginning of the pipeline. Then there are a series of steps in which each step delivers an output that is the input to the next step. This continues until the pipeline complete. Steps can also be arranged in parallel to accomplish desired goal.
Data pipelines typically consist of three key elements:
1. a source,
2. a processing step or steps,
3. a destination.
Data pipelines are workhorse to move data from an application/ source to a data warehouse, from a data lake to an analytics database, or into an analytics application, for example.
Common steps involved in building a data migration include data transformation, augmentation, enrichment, filtering, grouping, aggregating, and the running of algorithms against that data.
Data Pipeline vs. ETL
ETL refers to a specific type of data pipeline. ETL stands for “extract, transform, load.” It is the process of moving data from a source, such as an application, to a destination, usually a data warehouse. “Extract” refers to pulling data out of a source; “transform” is about modifying the data so that it can be loaded into the destination, and “load” is Data pipelines typically consist of ETL paradigm, where data is first extracted in a staging area, business rules are applied to transform data and data is loaded to sink before it is made available for final consumption.
However, a new paradigm has also emerged i.e. ELT. Here, data is extracted in the staging area, loaded to sink and business rules are applied to transform the data in sink itself.
ELT in Azure Cloud
One such ELT architecture, which is highly scalable and effective is presented below.
Orchestration Pipeline: It is central piece of this architecture and is responsible to bind all the components of this architecture together. It should be able to extract data from different sources using available connectors or common connectors like SFTP, ODBS, generic data extraction framework.
Source: Data source to fetch data used in pipeline. Data is extracted and transformed on the way.
Storage area: It is a staging area where data will be staged after extraction.
DBT: Data Build Tool, it is T part of ELT i.e. Transformation. DBT a command line tool that enables data analysts and engineers to transform data in their warehouses more effectively. More information on this innovative tool can be found here.
Snowflake: It is an offering from snowflake, a warehouse in the cloud. It is a virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. A warehouse provides the required resources, such as CPU, memory, and temporary storage, to perform the following operations in a Snowflake.
- Executing SQL SELECT statements that require compute resources (e.g. retrieving rows from tables and views).
- Performing DML operations, such as:
o Loading data into tables (COPY INTO <table>).
o Unloading data from tables (COPY INTO <location>).