Don’t forget to like and subscribe to my YouTube channel for more videos like this
Video Transcript:
so okay tell me something little bit about your technical experience I started off in 2018. I'm currently working as a data engineer with quantifier analytics and uh my expertise here lies with gctp query gcp cloud storage and Cloud composer with Apache airflow and it's been inclusive two years with quantify and prior to that I've been working with her Channel blog which was a U. S tax filing company and there I started off with Edge block like an ETL developer with experience in Informatica and post Informatica and everything migrated to Cloud I got to work on the show data Factory in Microsoft service on cloud so uh after that I joined the quantify so my experience lies and like basically I'm a detail developer the whole time like close to 4.
6 years of my career and I started off as an Informatica then I migrated to cloud and now doing Booth analytics data analytics as well as data and share with you okay all right so do you have any experience on Google Cloud yeah right I have close to 1. 8 years of experience in global Cloud uh with quantify I'm working on Google Cloud because a partner is with quantify with as Google okay all right so and in terms of data engineering uh product and service which product and service Google uh Cloud uh using for data engineering activity uh for me I've been using the Google bigquery basically uh since it's the biggest data warehouse that we have been using that and uh for the file I've been using the Google Cloud Storage for storage facilities and apart from that the cloud composer for tax scheduling okay all right so uh could you please tell me uh how we can optimize the bigquery performance optimizing the bigquery performance can be achieved by uh one thing is reducing the number of uh inequities and also I mean long sub queries and everything so if we could go for a Common Table expression or like uh like that or a temporary table that will be more of our solution for optimization and uh unwanted ordering of data is something that would help us in optimization and other than that what comes to my mind is partitioning the data while querying that is something that helps a lot with optimization that is um we are dividing the table into sub uh subtivos or like partitioning the data inside a table itself based on the uh time or a date column that is available and with that we can just efficiently reuse a cost as well as improve the performance okay so in runtime how we can check the performance in bigquery and the runtime we uh before running we can we can and we query on the right hand side on the top we can see how much MBs are this is a square you're gonna consume so we have a clear bit idea about what is going to be the amount of space it's gonna use okay so could you please a little bit explain about the bigquery architecture yeah sure about the architecture it has a Colossus it has Jupiter and it has a Drmel these are the main three parts of a bigquery architecture so the Colossus is basically the uh journalists journalists or engine that is being used by YouTube and everything every Google service provider I mean since it is commonly shared by all the Google things that you took everything so travel is the engine and when it comes to closest is basically of storage and uh when we think about the Google storage and Google storage the one thing that makes Google storage Stand Out is that this and the architecture that Colossus is having a capacitor like storage that is um if we want to fetch one column from a table we don't have to worry I mean it doesn't have to fetch the bowl of the table the Avery column is stored is like a different capacitor a different file so that is one advantage when it comes to the architecture of Google and the CPU usage and the memory are in like uh are like uh kept separate and like in most systems and the Jupiter was gonna be the interface that uh interacts between the Drmel and the closest okay all right so could you please explain what is Shuffle uh excuse me yeah I didn't hear you okay all right so I'm asking what is Supple in bigquery can you please spell that I how I did before okay would you mind spelling it uh yeah so uh suffer s h u f f l e supple ment yeah Shuffle uh I haven't used Shuffle and I'm not quite sure it will be awake this any idea about slot sloths and shuffles I think this is this has something to do with the resource or location in bigquery and uh any idea about how to uh apply the price uh in bigquery uh so how the price is calculated uh based on like what are the criteria we are using to calculate the our pricing in bigquery uh the criteria I mean what what is the free and what we have to pay any idea about pricing model [Music] I'm sure that the MB size is very small when we get the query that is something I ensure and uh pricing model I think the uh the risk gonna be like we're using like two CPUs and uh 64 MPS it's gonna cost us anyways 60 dollars per month or fifty dollars per month in gcp so uh it's like there is gonna be a base cost that no matter how we're gonna pay it towards like it is a per second usage in gcp almost everything and uh there is going to be a waste price we've got to create like for just like if just like I said the two CPUs and uh 8 GB Ram so that's gonna be a 50 thing so similarly everything uh is gonna have a base price so what since I already have been working in projects in which there are other days multiple Keys Associated so what I would do from my end to reduce the cost is uh I'll ensure that I'll do uh there have been situations in which city besides it's gonna be a multiple uh terabytes I usually when we query the data so partitioning is one thing that I do and the other thing is reducing The Unwanted sorting analysis things so uh these are the things basically what I do to reduce a query costs okay so for example uh I want to allow someone to play the bigquery job execution rule so which which permission is required to uh provide to execute the job execution rule hmm job execution rule I think uh uh all these things the roles were in allocated by us for me in my current organization it's done by the platform engineer not by me but even though it's like that I think the uh bigquery admin rule is something that would help with most of the permissions like including the job execution yeah so admin definitely so people will can play most of the thing but the only thing is that I want to play only or the people can able to execute the job only I don't want to give any other admin related permission okay all right so now moving to other question like how does data compression Works in bigquery yes I'm not sure I have done that in bigquery I've done uh basically what I have to announce I have done the data loading from the GCS bucket to be query and even on like streaming data from bigquery uh towards bigquery and from there uh apart from this compression and everything what I've done this I have uh been into the analytics of the data mostly like providing Solutions and coming up with questions and answers based on the data and helping business with the insights that will happen Okay okay there we go okay all right so could you please explain uh as you mentioned that you are involving in streaming Pipeline and bits workloading ah so in streaming what was the source destination for crb query what was the source and uh how you had set up your streaming pipeline towards uh CDC or it was like what kind of data traveling from source to destination could you please explain little bit more about that sure actually what I've done is uh it was a separate project actually not mine but it was a common project for us um the data was coming from uh API endpoint that is uh called fifo it is more like a review platform for us I mean it is more like a review platform which is used by most of the time so uh what we do is we hit the API endpoint for the review there was a particular endpoint for the pages and from there the data will be passed uh it was comma separated value should be and from there it was fetched using the um uh python script written uh for uh using the cloud composer at flow DAC and using that we wrote the data into the bigquery okay uh what are the data format is supported bigquery the multiplicative formats include include python Arrow [Music] okay all right uh okay so uh apart from bigquery AG also know the uh Google Cloud computer and I mean airflow right yeah right so uh any other service Google providing uh which we can instead of using the computer can you use any other service to uh to make and to execute our you know how we can how we can do such kind of thing suppose some requirement to create a pipeline but ah some business don't want to use ah air flow I mean computer so instead of using the airflow or computer any other service Google providing so that we can achieve that now to get the on-prem data to us correct yeah so uh what comes to my mind is uh we have multiple options gonna be just an uh on-prem Source we can use the file transfer service storage transfer service provided by Google but if the source is going to be us driving like um what do we call a message or more like uh an application or something from which we want to transfer the live streaming things then we have the pub sub service for that and if it's gonna be uh uh file or somewhere from which we want a real-time syncing of the data then we have to find a store uh do you have any idea about bikiri Omni yeah no no no I do not and uh have you heard about data Fusion okay all right uh okay so uh which programming and scripting language you are more comfortable I've been using sequels since the last 4.