Creating an ETL Data Pipeline on Google Cloud with Cloud Data Fusion & Airflow - Part 1

26.58k views7008 WordsCopy TextShare

TechTrapture

Part 2 - https://youtu.be/kxV4_xDchCc Source Code - https://github.com/vishal-bulbule/etl-pipeline-d...

Video Transcript:

[Music] hello all welcome to Tech capture so today we are going to implement one amazing data engineering project in a Google Cloud so what we are going to create today is we'll be creating ATL data pipeline in Google Cloud so what is ETL extracting the data transforming the data and then loading the data so we are not only focusing on ETL part we are also focusing on the reporting part so for for each data analytics pipeline there are two section one is ETL another one is reporting in ETL you will load the data like extract the data transform data and load the data and Reporting you have to create dashboards and report to visualize that data using a different visualization tools okay so that is a complete data analytics pipeline so we are not going to create only ETL part so we are creating our bi part as well so that is a complete a data Pipeline and data analytics project I would say instead of a data engineering project project so in this project you will learn how to extract the data how to transform data how we can load the data so how to create and automate all complete this process using the data pipeline so we are going to see so one good thing about this project that we are going to implement everything from the scratch so I'll be creating code from scratch I will create every single piece of this uh data analytics project on Console in this video itself so nothing is created already and I I will be using that so everything I will create in the video and if any issue we Face we'll troubleshoot that in the same video and will go further so you will learn how to like troubleshoot this kind of issues and definitely if you are doing something good you will always face some issues some error so do not afraid of that so we'll do it together so let's first see what we are going to do this so I'll just go with a flow where like how you will work when you are working on the re Real Time Project so initially you will get some requirement from the client you will get some problem statement and how you should start working on that requirement so like creating your data flow diagram uh identifying your resources which resource to use identifying your services then creating kind of small architecture for your like uh data analytics and then how you try to implement that all pieces one by one and then you will join all pieces together okay so we'll see this together uh so let's first try to understand what is a client requirement what is a usual requirement for this project so this is a problem statement and the requirement for this project so what is the problem statement that you are tasked with creating a data pipeline to extract employee data from various sources like Mass sensitive data information within the data and load into the big query Additionally you are required to develop a dashboard to visualize the employee data so what is the problem statement that we have to extract employee related data if there is any sensitive information like pan number or salary or anything can be sensitive like personal information if anything is there and if client want that information not to display on the visualization then we'll mask that data before loading so Suppose there is a pan number in a user data so before loading into big query we have to mask that pan number so if you get your like credit card bill your credit card Bild is not shown in your PDF Bill always a 16 digit only last four digit will be displayed for pan card same for AAR card so that data is not publicly shared across any platform the same the best security practices will follow before loading the data we'll mask that sensitive data and then we'll load into a big query table so that is client requirement extract the data mask sensitive information and then load data into the big query additionally we are required to create visualization so client can easily see how many employees are there which employees belong to which department and their joining date so all these things they can visualize on the dashboard so that is our like problem statement now what is the requirement so data extraction like extract employee data from multiple sources that is our first thing in data extraction so we'll see how we can extract the data if someone is going to send that data if we have to pull that data from some API or we have to generate that data by our own so there are a lot of option for data extraction so we'll see so if someone is going to inest data already so we'll see that options and we'll first extract employee data second will be the data masking we'll see how we can mask that data we'll be doing any complex coding that mask data or we'll be using any apis to mask that data or we'll be using some services to mask that data so we'll explore that options as well okay third thing is data loading into big query so we'll load data into the big query so once data is available we'll load that data after masking but how we'll be loading we'll be using a data flow we'll be using uh like python custom coding we'll be directly using air flow so there are multiple options but we are trying to implement a best solution which will be easy to implement which is like uh easy to maintain so we are going to implement that solution where even you don't need to write lot of codes lot of complex code so we'll see all factors and for implementing any solution there are multiple ways okay but it depends on the client requirement and your budget and your like what are your like goal to achieve that requirement it depends on all factors so to achieve one goal you are multiple options always remember if you are working as a cloud architect or data engineer or deop engineer so to achieve one target you will always have multiple options so you have to one best suited solution you have to implement the third will be the data visualization so here you will develop a webbased dashboard using appropriate visualization tool so you will use Google studio or tblo or any custom dashboard so these are like client requirement and we have to implement that requirement but as we are implementing this in Google Cloud definitely we'll use a Google Cloud textt so what are the services we are going to use in this project now this is our requirement now we finalize okay we'll use these services so which Services we are going to use so we'll be using python for data extraction we'll extract the data using python then we'll use a cloud storage for storing that raw data then we'll use a cloud data Fusion as a data pipeline because we already seen multiple videos how to load the data from cloud storage to big VAR but here we have to apply some transformation so transformation is data masking there is some transformation might be other things like we have to change that format we have to remove comma we have to keep full address we have to join two column so any kind of transformation might be there so we'll be see that transformation how we can do that transformation in cloud data Fusion within without writing a single piece of code because cloud data Fusion is a no code solution for ETL okay so we'll be focusing on cloud data Fusion then we we load that data into big query and we'll visualize on local studio so this complete extraction data load and transformation process will be orchestrating on a cloud composer using airflow so we'll see that as well so and at the end we'll visualize that on Li studio so if you are trying to implement this project you will learn some python you will learn cloud storage how to store data you will learn data Fusion you will learn Cloud composer and you will learn the bigquery query as well as local studio so this is is like a best project for you if you are following it end to end and try to implement in your environment so I'll be using a Google free trial account only so cloud data Fusion might cost you uh not cost you it will just uh use your heavy credit but we'll try to create a small instance of cloud data fusion and we'll delete it once we finish our project okay so cloud data fusion and Cloud composer are bit heavy because you the cloud composer create a GK cluster at the back end so we'll see see how we can keep our cost or credit usage minimum okay so this is our text tag now first thing what we'll do now we understand the text tag we are trying to create some flow diagram or data flow diagram so I'll go to uh draw. I and trying to create some small architecture that is always your task whenever you are creating a new architecture or trying to set up a new Solutions okay so let me go and create a small architecture just within a few minutes okay so let me go to the draw. iio where I can create my architecture diagram let me open new tab and draw.

iio so this is a tool to create architecture diagram so I already created one video how to create effective architecture diagram if you haven't seen you can go back and check my Google Cloud architecture playlist where I have explained multiple uh sites and apps to create architecture diagram okay so here we'll be creating our data pipeline so I'll just create a simple architecture diagram so first we have to extract the data so we'll see if any icon for python here okay so for python there is no icon so what I will do I just copy an image python okay so I will just copy this image okay so let me just keep it small okay so what we'll be doing using python we'll be extracting data so we'll be using one Library I will tell you what which Library we'll be using so here we are not extracting so there are multiple option either you can extract data from some API someone will send directly files to to your storage bucket so but we'll be generating our own data so we'll generate some dummy data for our employee so I will generate dummy data for employee name employee email ID employee password their joining date and like other fields like what are common fields related to any employee data so we'll be generating that data using python one library and we'll store that data to the storage bucket so now if you scroll down you will see gcp icon here if you don't see you have to add plugins from here so how to add so on more shapes at the bottom scroll down and click on this Google Cloud platform icon and gcp icon and apply then only you will see this gcp icon so I'll be using one icon here cloud storage okay so using python I'll pull data and I will put this in cloud storage bucket so my cloud storage bucket file will be then picked by uh data Fusion so I will go to here I will check uh data Fusion card okay so I have here as well a cloud data Fusion let me see if I have icon yeah I have to pick card not only icon so anything data Fusion yeah cloud data Fusion so then my data will be go to cloud data Fusion it will apply some transformation on the data and after the transformation the data will go to the big query okay so I'll put it here then data will go to the big query okay and on big query it will be let me check if there is any looker icon there is for data Studio but now it is renamed as a looker studio so let me go to the yeah there is yeah so there is looker so then we visualize this on a looker studio so here I will get some icon for dashboard because on looker we'll be creating dashboard okay so I'll just pick one dashboard yeah so here we'll be visualizing this data in looker so I'll just [Music] add to looker yeah so this is our simple data flow so from python we just add a label here so from from python I will just pull data extract data extract data and and put the extracted data file into storage bucket then storage bucket will the data Fusion will pick data from Storage bucket and transform and load that data into the big query so here all transformations and loading will happen in data Fusion so I'll just put that here transform and load data okay so data Fusion will transform it will Mass sensitive data and then it will load to a big query and here big query using some query or direct table connection we can visualize data on local studio so we can add some users as well if you want to make it more beautiful so I'll just add users who are using that report or we are sending these reports to user mailbox okay so this is up to you how you want to use that report so your user so this is our simple diagram now we are going to implement this data pipeline so we'll first write a python script to extract the data then we'll put that extracted data to the cloud storage bucket then we'll create one data Fusion pipeline to pull the storage bucket data and apply transformation and load into the big query and then we'll visualize the data into the loal studio so all these things okay one more thing here so I will use a airflow to orchestrate all these things all these steps okay so all these steps I will use airflow uh that is Google like CL Cloud composer where airflow is manage service okay so I will use cloud composer here to orchestration so my airflow dag will consist of one task to extract data one one task to so this extract and load will happen in one task then the another task is to trigger data Fusion Pipeline and then load into the big query so this is my airf flow okay so these will be I orchestrate this one because what is orchestration so first my python I have to automate this complete flow so first extraction will happen in Python once extraction is completed then only it will load ITA into storage bucket once data is available in storage bucket then only it should trigger my data Fusion pipeline once data Fusion pipeline is completed then only transformation is completed then only data should be loading into the big query and then it will be available for visualization so now two things will take time to create a composer environment and to create a cloud data Fusion so I'll quickly go and first create cloud data Fusion instance and a composer instance and then I will go to python script because mean the time we create python script and pull data these two instance will be up because we are going to do everything in this single video so I haven't created anything before so first I will go and create composer environment which will take around 30 minute and in second tab I will create data Fusion instance okay so let's create composer environment here okay click on create environment so we'll use composer two which is more like automated Auto scaling is available there okay name we'll give so I'll just give the name composer day which is already available here it will ask to Grant permissions I will grant permission from here itself so keep default things only don't try to customize more here we'll use a small uh environment okay so it is kind of50 CPU 2 GB memory 1 GB storage that is fine for us like small we are doing and everything will keep as it is and we'll create it so it will take 20 minutes or around 30 minutes to create my cloud composer environment and then we'll create data Fusion so if you are not aware of what is a cloud composer so you can refer my complete playlist on cloud composer where I already explained what is cloud composer how to create dag on composer so you can refer that so in this video also we'll create dag from scratch okay so do not worry but if you want to learn more about Cloud composer you can refer my playlist on cloud composer and how to create a multiple dags and task in a cloud composer and here data Fusion I haven't used in this project previously so I have to enable API first so I will enable API for data Fusion okay API is enabled now let's create a data Fusion instance here also I'll just give the name data Fusion da I won't do much customization I will follow so we'll try to create this is in US Central one so here also we'll try to create a newest Central one okay everything will keep as it is basic developer okay so we'll keep the basic only here also we'll Grant permission it will ask for required permissions to create a data Fusion instance okay and let's create it so it will also take some time okay so mean the time it is creating so we can start working on a data extraction part okay so let me go to my uh desktop and create one folder there and there we can start working on the data extraction and creating our python script so let me go to okay here let me create one folder I'll give the name ETL project okay and let me go inside project and open vs code here so it will open vs code directly here okay and now I have to create one file so I will give the name extract. py okay now what I will be doing to extract the dummy data I will be using one python Library so there is one python Library called a faker so this create a fake data so if you want to create like data for testing this library is very useful so it can provide all kind of data it can provide like name data address data email data so all kind of data it provide okay so I'll just uh get that code directly from Google Gemini or chat GPT whatever you are comfortable so I will just ask help me to create python script to generate employe data using Faker Library also include [Music] some pii field okay so let's see if gini will provide us a script to extract data using a faker Library okay so it provided data so what it is saying sharing okay saying it is not able to provide pii data so uh what we'll do we'll just ask add password field as well okay I'll just as password and salary so it will just add a password and salary field in this code So currently it is creating data for first name last name job title department email ID and okay this is exact values we can provide general idea so what we can do so we can just get another column name because he's not able to understand uh the requirement so random character field I need okay so from here it will create a random string so okay we'll use this data in our code so this is my another project okay so let's just we'll check we'll check with chat GPT as well so help me to add field for salary and password dummy okay and it created a field fake random numbers so password I need a random characters character do not toen it so it is already giving masking data so instead I asked so do not it is just give a random okay password correctors okay now it will create employee data so I will just so first we'll see so number of employee how much employees it will create So currently it is the count is 10 so this is showing yellow because we haven't installed libraries so we have to install PP install FCO first so I install Library first pip install fer so it will install our fer library and then this error will go or warning currently it is a warning okay it is completed now let me execute a code Python extract. py and we'll see the data generator object has no attribute company department so we can change our code using what jimin has mentioned for Department company Department okay here is it is also showing the same so what we'll do the department name we'll just ask again this error what field they are having for f Library so we are just taking help from these a tools because this is a new library we are using here okay so we are it is using like job only based on that as we are generating the dummy data so okay we'll use the same here this is only we need a dummy data so we can create cre any character Str okay now this created the data for 10 employees okay and password they have created like a random string which is a password now I want this data into a CSV file and upload on storage bucket so I'll ask here add data in CSV file and load into GCS bucket so it will provide us code also okay so it provided complete code now let's copy I will now copy it okay and what I will do so you have to install this Library as well if you haven't installed I already installed that is why it is not giving error for me and what I will do now I will create storage bucket so let me go and create storage bucket in the same project so we are using a POC project so it will take some time to create composer and data proc instance so Cloud composer sometimes might give an error because of uh free tire quota error so you have to check if you are already existing any kues cluster or VM instance then you have to manage that so I'll first create a storage bucket here okay so yeah I will just create bucket name with bkt employee data okay and I'm not making making it more customizations and I'll create confirm I'll copy bucket name so I have to give bucket name so where is a bucket name so bucket name where is the variable mentioned for bucket name upload to GC okay this is your bucket name okay and one more thing so here I am using my own credentials to execute this python script and go connect to Google Cloud Storage if you are not using own credential so how to create use your own credential so you have to write gcloud o gcloud o application default login so because because of this you don't need a service account okay so it will use your own credentials else you have to create service account key and use service account key but using this command you don't need service account key and by default your application will use your Gmail credentials so I am running this command and that's why it won't give permission error while connecting to storage bucket because now it will have my account permissions so let me try to run this code now and we'll see if my my file will get uploaded or there is any error as we are running it for first time okay so it created employee data CSV file here and it will try to upload on storage bucket now so let's refresh and we'll see yeah file is here okay so simply with the help of chat gptv created this python extraction script so this script will generate D data and put into the storage bucket so now instead of 10 records I will go with 100 records now because everything is working fine so I'll go with 100 records so just check the file size here so it is 1.

7 KB now let's see for 100 records so it will take some time more than 10 records it will take couple of seconds more okay so let's refresh now 1.