The term “reward” generally refers to the measurable merit of an activated action. The purpose of the reward function is to measure the effectiveness of a classifier in stabilizing the bicycle, i.e. bringing the bicycle back to its upright position from a near fall position. The problem, however, is that in our case the reward of an activated action could not be immediately calculated, as the calculation requires knowledge of the system’s response which occurs with a delay. Consider, for example, the case with as input. If the role angle and its derivative are both positive ( and ), viewing the bicycle body from behind (Fig. 3), its center of mass is placed on the right side of the negative z axis, …show more content…
Note that since in this case no controller is used, the control signal in Case 1 is zero. Moreover, gravity induced torque which is applied to the system as an external torque is virtually constant due to the insignificant change in its arm (distance from the pivot point to the point where the force acts) during . This justifies the constant acceleration assumption.
Case 2) The bicycle is controlled by the control signal proposed by CCSDR: In this case, the control signal should be applied to the unmanned bicycle in real world environment and the resulting roll angle should be measured at the end of . Solving the governing equations (3-6) for by the forth order Runge-Kutta method gives the value of the roll angle at the end of . The aim of the controller is to stabilize the bicycle in the upright position. Calling the roll angle , the reward would be calculated using the following equation.
According to the above equation, if the roll angle of the bicycle is smaller in the controlled mode than in the uncontrolled mode (i.e. the bike is closer to the upright position), the action is deemed effective and the reward is assigned to it, otherwise no reward is …show more content…
As usual, applying GA consists of three phases: selection, crossover and mutation. In the selection phase, using a roulette wheel selection, two classifiers (called parents) are chosen probabilistically in a ‘survival of the fittest’ manner, where classifiers with a higher fitness value are more likely to be selected than those with a lower one. The crossover operator is applied on the two selected parents at a predefined rate. Then, at another predefined rate, each bound (lower or upper) of the generated offspring could be mutated. The resulting offspring are inserted into the population and in order to keep the population size constant, two other classifiers are deleted. The removed classifiers are low-fitness ones that have participated in a threshold number of experiments, that is, have had sufficient time for their parameters to be accurately