Apache spark an open- Source data analytics engine that can process massive streams of data from multiple sources like an octopus juggling chainsaws it was created in 2009 by mate zaharia at UC Berkeley's amp lab around this time the amount of data being collected on the internet was exploding from megabytes to pedabytes making it impossible to analyze on a single machine but there was already a clever programming model called map reduce you map data into key value pairs Shuffle and sort them into groups by key then reduce each group to to compute a final result this
allowed large data sets to be distributed across multiple machines but there was still a huge bottleneck caused by dis iio aache spark fixed this by doing most of its work in memory instead of reading from disk which can be up to 100 times faster and that's a game Cher for big data analytics and machine learning it's used by Amazon to analyze e-commerce data by NASA's jet propulsion lab to analyze deep space along with 80% of Fortune 500 companies to process all their data despite its reputation for distributed big data processing you can easily run Apache
spark locally on your own machine it's written in Java and runs on the jvm but its apis can be used with wrappers for python SQL and many other languages to get started install it then let's imagine we have a CSV file with four columns for City population latitude and longitude and our boss wants us to find the city with the biggest population between the tropics the first step is to initialize a session and then load the data into memory the spark will take the spreadsheet and create a data frame which turns the columns and rows
into a collection of objects they can be processed across distributed nodes from here we can apply transformations to the data frame by chaining method calls like in this case we want to filter the data frame to exclude cities outside of the tropics that transformation will happen in memory then we can order the results by population and finally use first to grab the largest tropical City to the city of Mexico City pretty cool and if you're working with a SQL database you can easily use that data directly instead of the data frame API and when working
with massive data sets it Sparks cluster manager or tools like kubernetes can scale this workload horizontally across an unlimited number of machines but when it comes to machine learning spark also has a secret weapon called MLB let's build a predictive model by first bringing in Vector assembler to merge multiple columns into a vector column then we can split it up into training and testing data frames from there spark has a wide variety of different algorithms to handle classification regression clustering and so on all of which can be trained in a distributed system congratulations you can
now train large scale Ma machine learning models this has been a patchy spark in 100 seconds but before one can truly harness the full potential of spark one must have a solid foundation and math and problem solving and you can start building that foundation for free right now thanks to this video sponsor brilliant brilliant's platform will introduce you to essential programming Concepts but most importantly the Hands-On exercises will develop your brain to recognize and solve complex problems that developers need to overcome on a daily basis best of all every lesson is concise and rewarding but
by investing just a few minutes each day you'll develop habits that can level up your programming skills for the rest of your life and you can do it anywhere even from your phone to try everything brilliant has to offer for free for 30 days visit brilliant.org fireship or scan this QR code for 20% off their premium annual subscription thanks for watching and I will see you in the next one