Exploring Quantum Machine Learning with Meltem Tolunay: Qiskit Summer School 2024

4.03k views9005 WordsCopy TextShare

Qiskit

In this lecture, we discuss quantum machine learning, with a specific focus on near-term algorithms....

Video Transcript:

hi everyone uh welcome to today's lecture my name is meltam and I'm a research staff member at IBM Quantum and today we're going to be talking about Quantum machine learning and its applications on our path uh to Quantum utility our goal will be to provide an overview of quantum machine learning applications that can be implemented on current and near-term quantum computers before we get started with the details uh I would like to quickly mention what we will be covering in today's lecture first we will recap some preliminaries from uh machine learning specifically we will revisit

some Concepts from classical machine learning and second we will talk about our first Quantum machine learning model which is a Quantum classifier uh consisting of a variational circuit here we will also discuss some strategies to encode data uh onto quantum computers and later part of the lecture we'll focus on two different Quantum machine learning algorithms for new term quantum computers uh the first one of these methods is quantum support Vector machines which utilize the idea of so-called Quantum kernels and the last method we will cover today is the Quantum neural networks with this overview we

are now ready to revisit some basic concepts emerging from classical machine learning so the first question is what is machine learning as a simple definition machine learning is a collection of algorithms that analyze and draw inferences from patterns and relationships and data uh this way these algorithms learn to perform specific tasks without being explicitly told how to do it machine learning algorithms use statistical techniques uh to improve performance on a specific task overtime and to generalize their performance uh to unseen data and machine learning algorithms are used in a V very wide variety of applications

uh these applications range from image recognition to natural language processing and recommendation systems to to areas like medical diagnosis finance and autonomous vehicles and with all its applications machine learning truly revolutionize the technology industry and how we as humans interact with it now that we've broadly defined what machine learning is we can look at what it means mathematically um from a mathematical perspective machine learning can be thought of as a function approximation and optimization problem in essence what these algorithms try to do is to approximate true function G that we call here with a function

that we here refer to as F uh the true function G takes in some parameters x uh which we refer to as data or input features uh for instance G could be a function that takes in pictures as inputs and it outputs plus one if the image contains flowers or minus one if it doesn't and in this example the pixel values of an image would be the data Fe features that we feed into um the function here we assume that g captures the true relationship between an image and its corresponding label uh which tells us

if it contains a flower or not uh but in fact it is almost impossible to Define such a function that can takeen as input and label all flower pictures in the world uh by explicitly defining the true patterns in a picture that constitute a flower and therefore we try to approximate it with another function f that we see here that takes in some subset of all possible data Futures uh we denote the subset here as xat uh and most importantly this function f also depends on some other parameters uh that we see here as the

Theta Vector uh these parameters are what the algorithm learns as it is exposed to training data these are also referred to as weights that parameterize a space of functions and the approximating function f uh is also called a hypothesis in literature for instance H here uh that we see as an example um defines the hypothesis class that spend the space of third degree polinomial uh for input X so in summary our primary goal when designing a good machine learning algorithm is to pick a good mathematical model or hypothesis and to make sure that the Theta

parameters learn very well from training data uh with the goal that in the end uh we would like them to generalize well to be able to perform their task on data uh that they haven't seen before there are also several different types of machine learning algorithms as we see here broadly speaking we can Define three categories the first one is called supervised learning where the data that is used to train the model is labeled in our previous example we defined the task of identifying if an image contains a flower or not in this example the

model would be trained with images both containing and lacking flowers and these images would be associated with the correct labels that place them into one of the possible categories this is therefore an example of supervised learning since the training data is labeled and more specifically this is a classification task because the model outputs one of the possible discrete classes um if we have an algorithm that predicts values from a continuous interval instead of a instead of discrete classes then we refer to this as a regression algorithm the second type of machine learning algorithm is called

uh unsupervised learning where in in contrast to the previous type uh the training data here is not labeled these algorithms usually try to find or group patterns and structures in data sets some algorithms in this class are dimensionality reduction and clustering algorithms also some generative models uh such as the generative adverse serial networks and variational auto encoders are also in this category unlike discriminative models uh which learn to predict labels corresponding to input data generative models aim to learn the underlying distribution of the data itself and they do this in order to generate new data

samples that exhibit similar characteristics to those in the training set the third type of machine learning algorithm is called reinforcement learning and this one is actually quite different from the first two categories uh here the algorithm learns to make decisions by interacting with an environment and in reinforcement learning uh a so-called agent receives feedback in the form of Rewards or penalties based on its actions and the goal of the agent is to learn the optimal strategy to maximize the cumulative reward over time um and with this categorization in mind we will mainly Focus on algorithms

and examples that are under the supervised learning setting um in the following sections we can now summarize the workflow of generic supervised learning uh models uh first we have our data set with labels uh which will be used to train the model uh we can denote this as a as a tuple containing the input features X train and that their true labels uh y now the model takes in uh the input Futures xtrain and predicts a label which is y hat uh recall that the goal of the model is to learn the optimal parameter values

denoted Theta here uh for the first step these values are often sampled from a random probability distribution though there can be more sophisticated initialization strategies for machine learning tasks um for machine learning models to warm start start the training uh now that the model made a prediction why had uh we need to define a loss function first which is essentially a metric Del that tells the model how far away the prediction is from the true label uh there are several different loss functions that we can Define and here we have an example of the mean

squared error loss where we compute the differences of the predictions uh for the training inputs from the true labels Square them and then sum those up uh this gives an overall score score of how well the model is performing on the training data set note that the better the predictions the lower the cost function is and wice Versa uh therefore we use optimization techniques such as gradient descent to update our parameters Theta in a direction that lowers this cost function uh to do this we can use methods like um to do this as we see

here we can use method likes gradient descent which computes the gradients of the loss with respect to the model parameters and updates them in the direction that lowers the cost after the parameter update we repeat this whole procedure many times using the training data set at hand until the algorithm converges a well-trained machine learning model will output a lowcost function on unseen data meaning that meaning which means data that it hasn't seen during training and this is an indicator of how well the model generalized so in summary how well a model generalized is captured by

the following plot that we have here uh this picture refers to the so-called bias variance trade-off phenomenon and it means that if the model doesn't perform well on the training data uh this means that we're in the underfitting regime for instance we may have chosen a hypothesis class uh that doesn't have enough parameters to capture capture all the complex patterns inside uh the training data set and in this regime the model has high B bias and low variance and on the contrary if the model works very well on the training data set but performs poorly

on the Unseen test data set uh this could mean that the model overfit to training data and could not generalize enough uh to perform well on unseen data and in this regime the model has High um uh the model has low uh bias but High variance and ideally our model should be in between these regions working at optimal capacity um so this was our recap of some machine learning preliminaries that Encompass general concepts uh shared across both classical and Quantum machine learning and in the following section we build our first Quantum machine learning model and

this will be a Quantum classifier that we build using a variational circuit now we're ready to start exploring the quantum counter part of machine learning first if you search for Quantum machine learning on the internet you will most likely encounter this picture that we have on the left this picture summarizes the types of data sets and processing devices that we can consider in machine learning and their combinations uh for instance CC here means that we have a classical data set at hand such as images sounds or text that we can store on classical computers and

that we also use a classical computer to run a machine learning algorithm on this um classical data set this is precisely the classical machine learning setting that we discussed in the previous section and if we take a look at QQ for instance now this means that we are using a quantum computer to process Quantum data and Quantum data could be thought of for instance a set of measurement outcomes obtained from a Quantum device or this could be some state that has been prepared on a quantum computer by another algorithm in the first place and we

can do machine learning on the same quantum computer however when people talk about Quantum machine learning they usually refer to this CQ regime that we see here where the data set is classical and the processing device executing the machine learning algorithm is quantum uh for the supervised learning setting that we have on the right this would mean that the model execution step uh shown in a box here would be executed on a quantum computer this also means that we need efficient ways of uploading classical data sets onto quantum computers for these algorithms to work well

another remark is that uh there are broadly speaking two different types of quantum machine learning algorithms which are near-term and fault tolerant and for fault tolerant algorithms they require error correction on quantum computers and therefore are beyond the capabilities of near-term Hardware um some examples in this uh area include the hhl algorithm for solving a system of linear equations and the principal component analysis on the other hand uh near-term Quantum machine learning algorithms can be implemented on near-term devices uh W which have a limited number of cubits with shorter coherence times and they perform noisy

operations uh these practical applications include Quantum support Vector machines and Quantum Norm networks and on our path to Quantum utility uh these are the algorithms that we will explore in the second half of our lecture um here is now the workflow of our first Quantum classifier uh we have the same outline that we set for the supervised learning setting first we have a classical data set that we train the model with and we need to have efficient ways of encoding the data set to our quantum computer uh because now the model which is mathematically our

approximation function f or our hypothesis uh this will be computed on a on a quantum computer and this will result uh preparing some State s uh on the quantum Hardware itself and next uh we need to define a Quantum circuit that also relies on some parameters Theta similar to the classical setting as the model trains the parameters in our Quantum circuits will update by learning from the training data um as you can see in this picture defining a parameterized Quantum circuit simply means inserting some unitary Gates into our Quantum circuit that depend on a set

of parameters uh note that in literature such parameterized Quantum circuits are also referred to as variational circuits uh or an anzats so these terms are practically used interchangeably finally we make measurements on the quantum Hardware to either get a probability distribution over possible outcomes or to estimate the expectation value of an observable uh we can now classically post process these outcomes to decide what our model predicted uh for instance which class label is predicted for a given input from the data set um here is the task uh that we consider in this section uh to

the goal is to do a binary classification meaning we would like the model to learn and predict which of the two classes a certain data point belongs to and we label these two classes as plus one and minus one um in summary first we encode the classical data into a Quantum State and then apply a parameterized model finally we measure the circuit to extract labels which are fed into the cost function that the classical Optimizer uses to update the model uh parameters until it converges um now we will take a look at some standard methods

of uploading classical data to quantum computers um this first method that we have here is called basis encoding and as the name suggests in this method different data points are encoded as different computational basa states of the cubits as a specific example consider the 2x two image that we have above and assume that it has the corresponding pixel values associated with it uh first we construct the feuture vector using these pixel values uh we then convert these features to Binary representation um in this representation every bit is associated with one cubit that is is prepared

in the same computational state uh so to prepare the one uh one basis State we can simply flip the Cubit uh state with an X gate uh so in this example uh that we see here we would need eight cubits to en encode our classical data uh the second data encoding method that we can employ is called amplitude encoding and here we encode the data directly into the amplitudes of the quantum State Vector itself uh this encoding strategy is more advents than the previous one because n cubits live in 2 to the N dimensional Hilbert

space and therefore we can encode data fatures using uh only a logarithmic number of cubits in our example with four feutures as you can see we only need two cubits for this encoding method uh but the down downside of this method is that the circuit that prepares the state uh this unitary U Alpha that we see here can get quite deep for General States uh that we want to encode looking at our specific example we uh to encode our futures as amplitudes we first need to normalize the future Vector since wave functions themselves are normalized

uh then we can use software functions like Kiss kits built-in State preparation function class uh to synthesize the encoding unitary uh U that we see here a third method that we can use for data encoding is called uh angle encoding and here we encode the features directly into the angles of the parametric Gates uh for in inance by assigning each future one cubit we can construct this following circuit that we see here here we choose to encode the Futures into the angles of the X rotation Gates so each Cubit is initialized to one of the

superposition States uh seen below so the overall state after angle angle encoding is the tensor product state that we see in the first equation uh there's also a variant of this method called dense angle encoding uh this version includes two features in a single Cubit by placing the second feature into the relative phase of the Cubit uh as we see in the second equation and therefore dense angle encoding requires half the cubits uh that the regular angle encoding requires now we introduce a more sophisticated data encoding strategy referred to as higher order encoding uh we

also call this a future map and we will actually revisit this concept more in depth as we discuss Quantum kernel based methods in the next section for now as a data encoding method the reason why we call this higher order encoding is that instead of only loading the Futures into the rotation gate angles like an angle encoding we also encode a higher order function of the data features into the rotation L Gates uh here in this example you can see that we start off with hyart Gates bringing the cubits to superp position uh followed by

Z rotation Gates um then we create entanglement between the cubits with the two Cubit C gate uh finally we add a rotation gate uh whose rotation angle is the product of the two input features due to this product we effectively encode higher orders of our data we can also repeat this data encoding Block in the circuit increasing the depth of the future map and the reason for this will become more clear um in the next section in addition this particular future map that we have here is called the ZZ future map and has been implemented

in kkit as a function if you would like to take a closer look um here we summarize the different data en coding methods that we discussed so far the main takea away from this slide is that different data encoding strategies require different resources in terms of number of cubits and the circuit depth or the state preparation runtime that they require uh this trade-off is especially apparent if you take a look at this table and amplitude encoding as it results in the least number of cubits among all methods listed here but in Turn It suffers in

runtime or circuit depth so now going back to our Quantum classifier uh so far we talked about different ways of constructing the ux in our variational on in our variational model now the second step is to construct and synthesize a good anzat W that will eventually learn to do the classification task um after having updated its Theta parameters and designing a good onet for a specific problem is actually a very active research area and quite often might not have a very straightforward answer one of the reasons for this is that we need to design variational

circuits and a way that is both expressive and is Hardware efficient an expressive onzo effectively means that it contains enough degrees of freedom such that it can reach the part of the optimal solution space with its tunable parameters um while constructing such circuits we're also constrained by the circuit depth we can have uh before noise fully accumulates or the topology when we also need to consider the topology of the hardware for instance having too many two Cubit Gates uh that are far away from each other results in an overhead of swap Gates when we transpile

the circuits and is therefore not desirable a good onot uh in practice should therefore balance expressivity and Hardware efficiency um in this particular example for instance uh the onzas consists of single Cubit rotation gates with tunable parameters and two Cubit c not Gates which are laid out in a linear nearest neighbor fashion to create entanglement um I also referenced the paper below if you would like to learn more about Hardware efficient onet in the context of quantum machine learning okay so now that we have our variational circuit we need to make measurements to extract the

information from the model uh for classification the labels that the model uh predicts will be later used to compute the loss function and eventually a classical Optimizer will make updates to the parameters to minimize the cost uh for our binary classification task there are two simple ways we can use uh to make a binary decision on the label uh in the first one we sample the probability distribution over all possible bit strings and can use the parity function later to make a binary decision from the sample distribution uh if for instance the probability of observing

even parity is greater than the probability of observing odd parity then the model outputs the plus one label and if not then it can output the minus one label so in kkit we can obtain the probability distribution using the sampler primitive another option is to discard all cubits except for except for one and to measure the expectation value of for instance uh the poly Z observable on that Cubit and since this value is in the minus1 +1 interval we can use zero as the cutoff value and label outputs plus one if the expectation is greater

than zero and vice versa and in kkit we can get the expectation value using the estimator primitive and after extracting labels we use a class iCal Optimizer to lower the cost function uh there are several different types of optimizers that one can choose from and broadly speaking we can categorize them as gradient free or gradient-based optimizers um in this case we in the case that we use an Optimizer which is gradient based uh then we need a way to compute the partial derivatives of the model output with respect to the individual parameters uh that we

will update in the circuit and we can do this by using the so-called parameter shift rule what the paper reference refer below referenced below has shown uh is that the gradient with respect to a parameter can be computed by measuring two circuits first we shift the parameter by an amount of uh Pi / 2 and compute the output and then we repeat this shift um in in in a value of minus pi/ 2 in the opposite direction and the difference of these two circuit estimations divided by a factor of two is the gradient with with

respect to that particular parameter and we actually need to repeat this for all parameters uh in our model as a remark there's also an Optimizer called spsa that is quite often favored in practice on quantum computers uh due to its lower computational cost and is known to perform well under noise uh this method has its own way of approximating the gradients Which is less costly than the parameter uh shift r that we have here therefore I would also encourage you to explore spsa for your own Quantum machine learning tasks on near-term quantum computers up until

this point uh we discussed the ingredients of a variational Quantum classifier and in this new section we will introduce the idea of quantum kernels and the S Quantum support Vector machines uh which are based on Quantum kernels uh we first begin by revisiting the concept of support Vector machines from a classical uh machine learning point of view so again let's assume that we have a binary classification task where we're trying to place the training data into two class class of labels and this time let's take a look at this two dimensional data that we see

on the left side of the plot and one thing that we can do for instance when we're taking a look at this data is to try and find a line or a linear decision boundary that separates these two classes and so or in general because we're in two dimensional space right now the word that I use this line but if we the data has is it lives in higher dimensional space in general what we're looking for in this setting is a separating hyperplane and how can we do this so going back to this two-dimensional example

we're trying to find a line right but there are if you take a look infinitely many lines that can separate the two classes of theta that we have here so if I if we tilt this line a little bit in that direction that is still a separating hyper plane or a line that correctly separates these uh data points into their corresponding labels uh but a particularly good way of thinking about this is to try to come up with the line uh when it so that when it separates the two classes it has the maximal distance

uh from itself to the data points that are nearest to the decision boundary itself so this is in literature called the margin and what we're trying to do in the in this support Vector algorithm is to try to find um the line that has the greatest margin to the nearest points uh that are to the that have the nearest distance to the separating line itself and these nearest points uh after we run the algorithm are called support vectors that's where the algorithm gets its name from so if we take a look at what we call

the Primal formulation uh in this setting again because we're looking for a linear decision boundary our hypothesis or the model that we're trying to learn uh has this form that we see here which is a linear equation so in this model here the goal would be to learn the Theta parameters and the function itself has this inner product form plus uh sunbi and if we do a little bit of mathematical formulation uh mathematical manipulation to come up with another formulation of the same problem setting we end up with the so-called dual for formulation and as

you can see here the Dual formulation this time includes the training labels XI and Yi so XI are the data points that we see on the left and yis are the corre correct uh labels uh that correspond to the data points and the training so we slightly reformulated to the D dual uh version of the equation and it actually looks a little bit more complicated than the Primal formulation itself but the reason why we switch to this particular way of uh this uh support Vector machine equation will become a little bit more clear when we

move on to the next slide and the slight difference between the Primal and the Dual formulation is also the parameters that we're learning um out of the algorithm so in the Primal formulation our goal was to learn the correct Al Theta parameters but in the Dual formulation that we have here uh the model will learn uh the alpha parameters um after it learns from the training data okay so the idea of um support Vector machines um is very well motivated in this particular slide and that case is actually when data is not uh linearly uh

separable so if we take a look at this picture on the left we see two classes of uh two dimensional data and it has input features X1 and X2 which are laid on to the two axes that we see on this plot again our goal was to find a linear line that's uh a linear hyperplane or a line that separates these two classes but as you can see very clearly there is no such line that can do that if we only constrain ourselves uh to the features that this data set currently has but one trick

that we can do is that if you take a look at the right side of this uh slide we can also increase the number of features that we have in our future map and we can Define the product of the two in initial features X1 and X2 and add that to our original future map here we can see that now we moved on to the three-dimensional space but there immediately we can immediately find a separating hyper plane that puts that separates the two classes that we have have in our uh data set so this should

motivate why moving on to higher dimensional spaces uh can result in us being able to find linear decision boundaries that we're not able to find when we are in the original lower dimensional uh future space okay now that we know that using a future map to get to higher dimensional space can allow us to successfully find a separating hyper plane uh we can replace the original feature Vector X that we have in this equations uh with with the future map vectors so this simply means initially we had our original future map in this example it

was X1 and X2 and then we replace it in the both formulations with this uh f of x which is the future map now these vectors include X1 X2 and their multiplication if we specifically constrain ourselves to the previous example and the if we take a look at so we know that this is beneficial and if we take a look at why we switch to the Dual for dual formulation in the first place the answer lies here on the in the second uh column of equations so if we stick ourselves to the Primal formulation solving

this requires us to explicit ly build in our products between the future map and the Theta parameters that we're optimizing over but in fact this this future map Vector can be very high dimensional and in fact it should be because we just saw the idea that we can find a separating hyper plane if we move on to higher dimensional spaces but if we take a closer look at the Dural formulation these explicit inner products between the optimizable parameters and the future space is now replaced by Inner products that only rely on uh inner products between

the future Maps themselves so there are uh computational cases where we are able to compute this value this inner product between the future M map vectors without ever explicitly building the future without ever explicitly Computing the future um future map itself this is what's referred to as the kernel trick uh in the literature and in fact these future Maps can be infinite dimensional so there is it's not computationally feasible to compute uncountably many features of a of an infinite dimensional uh feuture map but we might still be able to compute these inner products and as

you can see uh these future vectors taken different values from the data set and all we need to do is to plug in different combinations of data values into this inner product and we need to come up with an efficient way of computing the kernel without ever explicitly building the high-dimensional feature Vector so this was the idea of a classical kernel now we are ready to discuss why we would need uh why we would want to do this on a quantum computer and switch to a Quantum counter part uh of the kernel definition so quantum

computers or Quantum States in general allow for a very natural definition of a Quantum kernel and if you recall the higher order encoding method uh that we discussed when building a Quantum classifier a way of thinking about it is also simply to interpret the encoding of an input x uh into a high dimensional Quantum State FX as a future map as we will see on the next slide we can use a quantum computer to estimate a kernel we can then then run the support Vector machine optimization classically on the quantum kernel itself the important question

is when can we expect to gain an advantage from Computing the kernel on a quantum computer the answer is not so simple but what we know is definitely a necessary Buton not sufficient condition is that Quantum kernels can only be expected to do better than classical kernels if they are hard to estimate classically so this needs to be satisfied for us to be able to expect an advantage by Computing the kernel on a quantum computer instead of a classical one otherwise we can simply estimate the kernel uh classically and we can't expect to gain any

more Advantage from doing this on a quantum computer um it was actually shown that learning problems do exist for which Learners with access to Quantum kernel methods have a Quantum advantage over uh classical over all classical Learners so we do know of uh some uh data some data sets that have a specific type of a group structure that can benefit that can definitely benefit from having access uh to a Quantum kernel uh to read more about this technical results I would encourage you to check out this paper that I have listed here covariant Quantum kernels

for data with groupst structure structure and this is one example of where we can expect a Quantum advantage using a u Quantum kernel so after having defined uh the quantum kernel we can now Define the quantum support Vector machine um algorithm so first the part of the support Vector machine algorithm that moves onto the quantum computer is the kernel computation itself so first we Define the circuit which is the quantum kernel estimator so if you recall this kernel is an inner product between the future map uh future maps that taken different uh data inputs uh

data features as inputs so they're indexed differently XJ and XI we can even think of this as a matrix that are with values indexed as Ki JS so we need to compute uh these values efficiently so if you take another look at this expression that we would like to compute out of a quantum computer you can also see that if we start in the zero State uh and if we prepare this state that we see on this picture here this also corresponds to measuring an all zero bit string out of a quantum computer so this

expression that we would like to compute translates to the probability that of obtaining a of obtaining an all zero bit string out of the quantum computer if we were to be uh prepare the state and how do we do that we Define our future map and let first Let It Be be a unitary they that takes in uh XI one of the data points and then we implement the ad joint of this uh future map unitary and then let it take in the value of XJ and that's the state preparation circuit for our Quantum kernel

estimator and then we make measurements so how does the full algorithm work for IJ in the training set what we do is we prepare this state and let as I mentioned earlier let k i j be the probability of measuring an all zero bit string and then after we obtain the kig values we plug them simply into the Dual form the rest of the computation is handled classically and we can solve this as similar to what I had mentioned previously using quadratic programming software or some other optimization software that is uh tailored towards uh solving

svms and then what this algorithm does is to return this Alpha I uh values because we're in the Dual form and now we can use the learned parameter values Alpha I to make a decision uh on the labels that the new unseen uh input of input features we're now at the last section of our lecture where we will explore Quantum neural networks uh as an application of near-term quantum machine learning um first We Begin by reminding ourselves the concept of classical uh feed forward neural networks a neural network is a computational model which is loely

inspired by the structure and the function of neurons in a brain these neurons which are notes as we see them here in this picture are organized into layers and are connected through weights they process signals that they received from a previous layer and transmit them to the next one through connections and if we focus on one of these neurons uh we have the building block of a neural network which is called a perceptron and mathematically speaking a perceptron takes in an input Vector X and computes its inner product uh with a trainable weight Vector plus

and bias and very importantly uh this perceptron applies a nonlinear activation function uh on top of this computation these nonlinear activation functions are responsible for the great expressive power of neural networks and another way to think about it is that if as we see here these inner products and if we stack them together in this neural network picture here they are linear Matrix multiplications and suppos that if we didn't have these nonlinear activation function in between uh each layer uh applied by the neurons themselves then we would be able to apply one linear function after

another one matrix multiplication after another and this whole giant model that we have here would be a very big Matrix M multiplication itself so this would simply be a linear model that wouldn't be able to capture the complex patterns that deep neural networks can um the deep real networks I really want to emphasize uh owe their great expressivity to these nonlinear activation functions uh that they have in between layers um these are therefore they are fundamental in neural networks so the next thing that we're going to be discussing in this regard when we're make an

attempt to build a a Quantum perceptron is to introduce this nonlinearity into our Quantum circuits uh this is because without additional consideration Quantum circuits are uh only circuits that can Implement unitary operations and unitary operations are simply linear operations right so we need to think of additional ways that can Implement nonlinearity if we would like to find the quantum counterpart of a perceptron or a neural network that has this expressivity to be able to approximate a function like a this true function G which has complex uh which can detect complex patterns and there are several

different ways that one can introduce nonlinearity to a Quantum circuit uh there have been several different attempts in this regard but several sources of nonlinearity uh can be thought of for instance by making measurements measurements are nonlinear operation nonlinear operations that we can perform on a Quantum circuit and even if we do these for instance in between circuits with a technique called mid circuit measurements or dynamic circuits and continue with the quantum oper operation with the quantum gate after having made measurements in the circuit this would be a source of nonlinearity in our in our

Quantum circuit there is also one of the most uh initial ways that initial papers that um that proposed the quantum counterpart of a perceptron uh was utilizing this corer Quantum Foria transform based um method and uh this paper also discusses how we can introduce nonlinearity um in our in our circuits again there are several different ways uh but these are examples of of how we can introduce nonlinearity to our Quantum circuits uh so we can now look at one way of constructing the quantum part of a neural network um in this model here so on

the left we again have the classical neural network setting and the information flow that we have in these two settings is slightly different uh from each other so in the classical setting that we see here and this is a classical feed for ordal Network information flow when we start with the input and would like the model to Output a label for us for instance follows from left to right so input goes in as comp goes through one layer there are uh connections that take the signal from the first layer to the second one and so

forth as I mentioned there are nonlinear activations in between but the information flow in this particular structure is following from left to right when we ask the model to give us a label or an or an output uh the only time when the information flow is actually in the reverse direction is when this back propagation is happening and that is when we're training our model we compute gradients with respect to the trainable parameters weights and biases and the back propagation algorithm now reverses the information flow starts at the last layer and make sure that gradients

are propagated updated uh propagated uh through the network structure uh to update the trainable parameters but it is when Computing the output the information flow is follows from the input layer to the output layer from left to right in the quantum neural network setting uh in this particular way uh that we can construct it uh that is that has expressivity to resemble uh to to resemble a classical neural network and has expressivity to approximate a true function G um uh in the general setting consists of the following idea and it uses the so-called data reuploading

strategy so what we do here as in uh as opposed to the classical setting here is that you see these uh blocks different unitary blocks that are encoded in different colors so this s of X here is the data encoding block but it repeats itself within the circuit itself so we first here load the data with this unitary then we have our onet uh within the structure with the trainable set of parameters this uh Theta 1 Vector as we progress through the circuit we repeat the data encoding um multiple times so it's not just as

the the initial step as we see it here and this data reuploading strategy actually has um has interesting theoretical um implications one paper that I listed here shows that with the help of multiple data reuploading steps even a single Cubit provides sufficient computational capabilities to construct a universal quantum classifi when assisted with a classical separate so us being able to upload the data multiple times as we in this circuit uh gives us enhanced expressive uh Power um as as a universal as a universal uh function of proximator and this allows the quantum neural network to

approximate uh complex functions so a specific type of uh Quantum neural n work that I would like to discuss next because it is very quite suitable for near-term Hardware uh is called a Quantum convolutional neural network which is inspired by a classical convolutional neural network so I'd like to spend uh just a few minutes uh to go over this idea uh so that we can discuss its its Quantum counterpart so a convolutional neural network consists of alternating layers which are called convolutional layers and pulling layers so in a convolutional layer we have matrices that slide

over the input features or if they're occurring in between uh the neural network itself they slide over the signals that they received from from a previous layer and they mathematically uh they mathematically perform convolutional uh on this on this uh inputs that they're acting on and they extract local information so if this is for instance the first step and if we have an image recognition task we we can feed the pixel values of an image and our task will will be to whether for instance in this case the whether the image contains a KCET logo

or not so we taken the pixel values put them together in layers and a convolution matrix with trainable weights uh will slide over these pixel values to extract local information from it what happens in a pulling layer is practically dimensionality reduction uh we discard some of the information as we process with the help of pulling layers and there are several there are several different ways we can uh achieve this for instance we can use the maximum function the averaging function or the sum and here for instance we would like to get from a 4x4 uh

dimensional Matrix to a 2X two uh dimensional one so we group them in in the in these groups encoded in different colors here and here we perform for instance the the maximum function so we get the value three here and discard all of the values uh that are less than that U this helps reduce the dimension of these of these networks and in the end usually there's a uh there's a layer called fully connected layer with connections across um all possible neurons which then outputs a probability distribution over over possible classes and ideally the the

class that labels that this picture has a kcit logo uh will will have the greatest probability and this is how a convol network works in principle for in in the classical machine learning sense so the quantum count counterpart part uh of a classical convolutional neural network the so-called Quantum convolutional neural network is in a sense inspired by a similar methodology but there are quite a few differences uh so it it's I think it's good to use the word inspired here so because we can't have sliding operations uh in the same sense that we have in

a classical computation but but similarly you can see we have unitaries that repeat themselves as if they're sliding across uh cubits so these are of course fixed uh as they're placed as fixed Gates on the circuit itself but you can see that the same unitary block repeat itself in in an in a next neighbor topology in in alternating layers so this unitary on unitaries that constitute our convolutional layer onsets um are also repeated throughout the architecture but the question is like how do we perform pulling in a Quantum convolutional neural network there are mainly two

different ways that we can do this again the goal there we have two goals the pulling will layer will reduce the dimensionality of the circuit but also it'll be responsible for introducing nonlinearity into our neural network as we discussed in the first slide uh introducing nonlinearity is of great importance uh to be able to capture complex uh relations as we use these models as function approximators so one way we can do is by employing uh something called mid circuit measurements so not instead of we can also have measurements in the middle of the circuit instead

of uh you know having them at the very end what we do here is that we pair cubits and for every pair of cubits we make a measurement on one of the cubits at the pooling layer and depending on the outcome we apply the unitary uh V1 so this could be something like if you measure one from this uh Cubit then apply the unitary V1 and if you didn't if you get measure up zero uh then don't do anything it could be a decision like that so this is the mid circuit measurement uh version of

doing uh pulling the other way uh we can introduce nonlinearity in the circuit is by this tracing out operation so it's a mathematical uh description for discarding cubits and not performing any operations on them so what we would do is that we would let this Cubit to sit and continue till the end of the circuit duration but we wouldn't perform any other measurements on it and mathema I Ally speaking this this corresponds to an operation that we call tracing the cubits out and it introduces nonlinearity um in the system so at each pulling layer we

discard half of the cubits and we continue with a convolutional layer and another pulling layer we can repeat this however as many times as desired for that particular application and then occasionally again we have this uh block that we call the fully connected layer but in a Quantum setting this is an entangling unit that acts on the remaining number of cubits and finally we make a measurement on on on cubits and uh use these to make a decision on on the labels uh that decision on uh which label is predicted for the data that we

provided in the first place so convolutional neural networks as I had mentioned are particularly near term fly algorithms uh due to it their main properties the first is that qcn have a logarithmic number of layers and parameters with respect to the cubits that they uh start the the circuit with and this especially results uh in in shorter depth circuits compared to some other uh algorithms uh that we have considered and the second very desirable property that we will discuss in the next slide is that they don't suffer from this problem of socalled Baron plateus and

this becomes a very important issue in training Quantum neural network and it is it is a great uh theoretical result that these circuit architectures don't suffer from this problem making them very um near-term uh Quantum Computing and Quantum machine learning friendly architectures so the last thing that I want to do in this lecture is to Briefly summarize uh the issue of Baron plos um and uh to show uh that Quantum uh convolutional neural networks to to to emphasize that Quantum convolutional neural networks don't suffer from it so this issue of Baron Platos uh is a

is a becomes a problem with randomly initialized uh deep circuits and we when we try to train them and the phenomenon is that the gradients of the con cost function vanish exponentially with the number of cubits uh under certain circumstances so this especially becomes a problem if we have a circuit that has a o structure similar to this like an alternating layer onsets Things become quite hopeless if the circuit is quite deep and especially if the cost function uh depends on on uh on a globally on on global if the cost function is is global

meaning that the cost function is a function of almost the entirety of of the of the cubits that we have in the circuit for this type of onsets uh there is something that we can uh do so if we constrain our circuits our onsets to consist of shorter depth uh circuits uh if we meaning if we construct shallower circuits and if we constrain our lost functions as we train this algorithm to be more local to to be rather local than Global then there is hope for mitigating such issues uh and there are also this is

a very active research area and there are also different initialization strategies that uh one can try to employ to avoid uh this issue of Baron plateaus and because of the structure that we took a look at for Quantum convolutional neural networks they inherently don't suffer from this problem uh they they discard cubits along the way have logarithmic number of parameters and circuit depth and their cast function uh is because we discard cubits the cost function becomes intuitively more local rather than Global relying on a less number of cubits so they don't suffer uh they have

been theoretically shown to not suffer from this problem making them great candidate for near-term Quantum machine learning applications so our goal in this lecture was to provide an overview of quantum machine learning algorithms that can be implemented on near-term quantum computers on our path to Quantum utility if you would like to explore any of these Concepts more in depth I would also encourage you to revisit a previous kiskat Global Summer School uh that took place in 2021 which focused exclusively on Quantum machine learning and I hope you find this information useful and thank you very

much for joining today's lecture