With evolving technologies, change in era of tech implementations. Data is center of all types of implementations throughout industry. Talking about old era where RDBMs/Database systems were used to store and process “Data” in the industry. Looking at increasing need of storage and compute to process huge volume of data, storage, analytics — Data Warehouse implementations came into replacing some of database systems. Implementation of both Database and Data Warehouse systems within same application, enterprise setup is common recent years. Database systems are used for OLTP implementations and features support. Datawarehouse systems are used for OLAP implementations.
OLTP — Cater to real time transactional processing in the system
OLAP — Cater to Prescriptive and Predictive analytics. Initially, this was used widely for prescriptive analytics on historical data at data warehouse systems. Recent days, this is also used to run Predictive analytics using Machine Learning models on top of DW systems.
To setup DB or DW systems , heterogeneous sources , target systems needs to be integrated together. Data Extraction/Ingestion, Transformation , Storage @ DB/DW, Analytics/BI on top of DB/DW are the widely used stages across “Data” implementations.
ETL is commonly used implementation in both OLTP and OLAP implementations. This is single platform/framework to integrate heterogeneous systems as source/target and process data on the run time before stored/preserved at DB/DW layer.
What is ETL ?
ETL is abbreviation of Extract, Transform and Load. There are several tools are present in market to serve ETL needs of an organization. ETL can be implemented using available tools in the market or this can also be implemented as framework to integrate various technologies , tools, open source offerings to serve purpose of ETL.
Informatica, Datastage, AbInitio, Pentaho are some of widely used ETL tools/platform to serve the integrations, processing, transforming data on the fly and load to DW/DB layer.
With ETL implementation –
E — Extract — Connect to heterogeneous sources to read data from other DB/DW systems, File systems, Real time systems, Structured or Semi-structured data extracts. This is referred as source integration.
T — Transform — Once data read from source systems, run data transformations like filtering, sorting, aggregating, casting, derivation of new fields, lookups(reference data or dimension data), joining, merging, SCD, sequence generation, composite key generation etc. These are applied on data read and processed to pass data to target layer.
L — Load — Upon completion of data processing and transformation, data is written to the target systems. The Load can be to a Database, Data Warehouse, Files, Streaming Feeds, loading to any other downstream application. This is referred as target integration.
All these stages/steps can be implemented using single GUI based tool such as Informatica, Datastage etc. or these layers can also be implemented using any other open source, DB services, Hadoop tools etc.
Pros and Cons of ETL –
· Single platform to implement Extract, Transform and Load. Can be used as enterprise tool to implement ETL within organization
· Process and load data on the run time
· Heterogeneous integrations can be possible with single window/GUI while developing jobs
· Leverages power of ETL server/Memory/Cache to process data on the fly
· Heavy load jobs takes time to complete hence long running jobs
· Performance can be a hit for huge volume processing
· License and Maint cost is as per tool, choices of servers(core, cpu) to setup ETL tool
· Expensive for small/medium business
What is ELT?
ELT is widely used these years replacing ETL implementations. With ELT, some of the cons of ETL implementations can be overcome. This is fastest, adaptable process where power of DB/DW systems can be leveraged to transform data faster than ETL implementations. This implementation is very common in Data Warehouse and Data Lake these days.
Data Lake — A data lake is a centralized repository that allows you to store all your structured, Semi-structured and unstructured data at any scale. Data can be stored as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Hadoop is one of on premise implementation of Data Lake systems. Hadoop/Bigdata is widely used across industry to store, process, leverage their “Data” to generate, derive insights from “Data”.
E — Extract — Stage where heterogeneous sources can be connected to read data from other DB/DW systems, File systems, Real time systems, Structured or Semi-structured data extracts. This is referred as source integration. This is also referred to get raw data from source systems to DW/DB/Data Lake layer. This can also be addressed as an ingestion step where data is ingested in raw format from source systems to DW/DataLake systems.
L — Load — Once the data is read from source systems it is landed/loaded to Data Lake or Data Warehouse system. Most commonly this step is used to load raw data to Landing/Raw/Staging layer of Datawarehouse or Data Lake systems.
T — Transform — Once data loaded as-is/in Raw format from source systems, run data transformations like filtering, sorting, aggregating, casting, derivation of new fields, lookups(reference data or dimension data), joining, merging, SCD, sequence generation, composite key generation etc. These are applied on data read and processed to pass data to target layer of DW/Data Lake systems. These transformations are run in the form of SQL or using native utilities of DW or Data Lake etc.
With this approach/implementation, transformation load is pushed to DW or Data Lake. This leverages power of DW/Data Lake to process data in parallel, efficiently.
Pros & Cons of ELT –
· Easy and Quicker to implement than ETL
· Cheaper in comparison with ETL as DW / Data Lake resources can be scaled using commodity hardware
· Performance efficient , Scalable implementation hence less time to execute jobs, time to market
· Raw data made available to DW layer, which can be used for analytical implementation
· Maintenance / Operational overhead can be little higher if various open-source tools used to implement ELT for Data Lake implementations
· Need to implement appropriate cleaning, archival or roll ups on DW/Data Lake system to avoid unwanted data to be preserved/stored to reduce system cost and improve performance
Follow my next blog to see how ETL and ELT can be implemented on cloud using Google Cloud Platform!