Jekyll2018-03-26T12:11:50+00:00/Bayesian PhysicistA physicist that loves to explore things, especially machine learning and AI
How likely are Uber’s autonomous vehicles safer?2018-03-25T20:54:57+00:002018-03-25T20:54:57+00:00/2018/03/25/how-likely-uber-av-safer<p>Last week there was a <a href="https://www.nytimes.com/2018/03/19/technology/uber-driverless-fatality.html">tragic news</a>
of death by an Uber’s self-driving car. According to this news and the fatality
report by <a href="http://www.iihs.org/iihs/topics/t/general-statistics/fatalityfacts/state-by-state-overview">IIHS</a>,
<a href="http://faculty.washington.edu/dwhm/2018/03/19/are-ubers-autonomous-vehicles-safe/">some</a>
estimated the probability of the crash happened if Uber’s autonomous vehicles (AV)
are as safe as non-AV using negative exponential distribution. The answer
is around 3\%, which can also happen by bad luck.
Specifically, from the <a href="http://www.iihs.org/iihs/topics/t/general-statistics/fatalityfacts/state-by-state-overview">IIHS data</a>,
it was obtained there was 1 fatal crash for every 93 million miles travelled by
non-AV cars (i.e. 34,439 fatal crashes in 3,220,667 million miles in the US).
The author also extrapolated from <a href="http://www.iihs.org/iihs/topics/t/general-statistics/fatalityfacts/state-by-state-overview">a report</a>
by the time the crash happened (i.e. last week), Uber’s AV would have collected
3 million miles.</p>
<p>Using the same data, my question is slightly different, “how likely are Uber’s
AV safer than non-AV on average?” To answer the question, we can use
the <a href="https://en.wikipedia.org/wiki/Poisson_distribution">Poisson distribution</a>,</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:poisson-distribution}
P(k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}
\end{equation}</script>
<p>where \(k\) is the number of occurrence and \(\lambda\) is the expected
number of occurrence. In 3 million miles travelled, the expected number of fatal
crashes for non-AV is \(\lambda_{nAV} \approx 3/93 \approx 0.0323\).
The Uber AV would be safer if \(\lambda_{AV} < \lambda_{nAV}\). Given the
information that there is \(k = 1\) fatal crash in 3 million miles for Uber AV, we can
infer the expected number of occurrence with Bayesian inference,</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:posterior-distribution}
P(\lambda_{AV} | k) = \frac{P(k | \lambda_{AV}) P(\lambda_{AV})}{\int_0^\infty P(k | \lambda_{AV}) P(\lambda_{AV})\ \mathrm{d}\lambda_{AV}}.
\end{equation}</script>
<p>The term \(P(k|\lambda_{AV})\) is the Poisson distribution given in the equation
\(\ref{eq:poisson-distribution}\). The prior distribution can take different
forms to capture our prior belief on how safe the AV is. As a general form, we can
take the prior distribution to be</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:prior-lambda}
P(\lambda_{AV}) \propto \lambda_{AV}^p.
\end{equation}</script>
<p>Putting the equation \(\ref{eq:prior-lambda}\) to the equation
\(\ref{eq:posterior-distribution}\) with \(k = 1\) gives us</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:posterior-distribution2}
P(\lambda_{AV} | k=1) = \frac{\lambda_{AV}^{1+p} e^{-\lambda_{AV}}}{\Gamma(p+2)}
\end{equation}</script>
<p>where \(\Gamma(z)\) is the <a href="https://en.wikipedia.org/wiki/Gamma_function">gamma function</a>.</p>
<p>Let’s take 3 forms of prior distributions: (1) uniform, \(p=0\), (2) log-uniform,
\(p = -1\), and (3) the <a href="https://en.wikipedia.org/wiki/Jeffreys_prior">Jeffreys</a>
prior for Poisson distribution, \(p=-0.5\).
The log-uniform and Jeffreys prior put a lot of belief of small \(\lambda_{AV}\),
which assumes the AV tends to be safe. Here is the plot of all prior distributions
mentioned.</p>
<div style="text-align:center"><img title="The prior distributions for uniform, log-uniform, and Jeffreys" src="/assets/av-prior-distributions.png" width="500" /></div>
<p>By substituting the values of \(p\) to the posterior distribution equation
\(\ref{eq:posterior-distribution2}\), we can plot the posterior distribution
of \(\lambda_{AV}\) as shown in the figure below.</p>
<div style="text-align:center"><img title="The posterior distributions for uniform, log-uniform, and Jeffreys" src="/assets/av-posterior-distributions.png" width="500" /></div>
<p>To calculate the likelihood it is safe, we can integrate
the area under the curve for \(\lambda_{AV} < \lambda_{nAV}\) with
\(\lambda_{nAV}\approx 0.0323\) from equation \(\ref{eq:posterior-distribution2}\),
which gives us</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\mathcal{L}(\lambda_{AV} < \lambda_{nAV}) =
\int_0^{\lambda_{nAV}} P(\lambda_{AV} | k=1)\ \mathrm{d}\lambda_{AV} =
1 - \frac{\Gamma(p+2, \lambda_{nAV})}{\Gamma(p+2)}
\end{equation} %]]></script>
<p>where \(\Gamma(z,x)\) is the <a href="https://en.wikipedia.org/wiki/Incomplete_gamma_function">incomplete gamma function</a>.</p>
<p>For uniform (\(p=0\)), log-uniform (\(p=-1\)), and Jeffreys (\(p=-0.5)\)
priors, the likelihood of Uber AV being safer than non-AV respectively are
\(0.00051\), \(0.032\), and \(0.0043\). From these calculation, we can see
even if we have strong prior that the Uber AV is safer (i.e. log-uniform prior),
there still a small chance \(3.2\%\) of the Uber AV is now safer than non-AV.
Personally I would prefer Jeffreys prior as it is invariant under re-parameterization,
so I belief that only miniscule chance, \(0.43\%\), that Uber AV is safer, which means
the non-AV is almost certainly safer than Uber AV, for now. I believe
(and hope) Uber will improve to reduce the expected number of fatal crashes in
the future.</p>Last week there was a tragic news of death by an Uber’s self-driving car. According to this news and the fatality report by IIHS, some estimated the probability of the crash happened if Uber’s autonomous vehicles (AV) are as safe as non-AV using negative exponential distribution. The answer is around 3\%, which can also happen by bad luck. Specifically, from the IIHS data, it was obtained there was 1 fatal crash for every 93 million miles travelled by non-AV cars (i.e. 34,439 fatal crashes in 3,220,667 million miles in the US). The author also extrapolated from a report by the time the crash happened (i.e. last week), Uber’s AV would have collected 3 million miles.Face interpolation with optimal transport2018-03-11T15:27:44+00:002018-03-11T15:27:44+00:00/2018/03/11/face-interpolation<p>I have been re-working on solving proton radiography and shadowgraphy lately.
The first time I did some work on this topic was two years ago (2016) where
there was a need to retrieve magnetic field strength or refractive index
variation from the obtained proton radiogram or shadowgram (inverse problem).
The techniques have been known to physicists for years (especially
shadowgraphy). However, people usually suggest using Poisson equation solver to
solve the inverse problem. This only works for small deflection, not for larger
deflection where most interesting cases are.</p>
<p>In 2016, I and some people in Oxford and Chicago realised that the shadowgraphy
and proton radiography inverse problem is actually the <strong>optimal transport</strong>
problem. The problem states more or less like “<em>Given two density profiles
(source and target profiles), determine the best way to move the densities from
the source profile to form the target profile, so that the total distance
travelled by the densities is minimised.</em>” For simplicity, you can think the
densities as pile of sand in the picture below.</p>
<p><img title="Source and target profiles" src="/assets/source-target-proton-radiography.png" width="350" /></p>
<p>The output of the problem stated above in this case is what I call as
the <em>deflection potential</em>, \(\Phi\), which regulates the displacement from
the source profile to the target profile as,
<script type="math/tex">\begin{equation}
\mathbf{r}_{target} = \mathbf{r}_{source} - \nabla \Phi
\end{equation}</script>
where \(\mathbf{r}\) is the position on source or target profile.</p>
<p>There have been a lot of algorithms to obtain the <em>deflection potential</em> from
known source and target profiles. At the moment, my preference is from Sulman,
<em>et al.</em> (2011), which was also used in Bott, <em>et al.</em> (2017). With some simple
modification of the algorithm, we can cut down the run time from approximately
4 minutes to approximately 2-4 seconds (see the implementation in my
<a href="https://github.com/mfkasim91/invert-shadowgraphy/tree/fast-inverse">GitHub repo</a>).</p>
<p>Beyond shadowgraphy and proton radiography, we can use the code for other
purposes. One of them is for “<em>face interpolation</em>”. Given two faces images,
we can regard one of them as the <em>source</em> profile and the other one as the
<em>target</em> profile. Putting them to the algorithm, we can obtain the deflection
potential of those two faces, \(\Phi\). To interpolate the face, we can just
multiply the deflection potential with some numbers and regard it as the
deflection potential, i.e. \(\Phi\rightarrow\eta\Phi\), where
\(\eta=[0,1]\).</p>
<p>Here are the two faces images I got from the internet.</p>
<p><img title="The two faces to be interpolated" src="/assets/faces-interpolate.png" width="250" /></p>
<p>There is no special reason why I chose those faces. They were just chosen
randomly from the internet. I don’t even know them.</p>
<p>As my algorithm implementation works best if the background is non-zero and
equals to the mean value of the interesting part, I changed the background to
gray image. Putting those two images into the algorithm, we can obtain the
deflection potential. Multiplying the deflection potentials with
\(\eta=[0,1]\), and get the faces, I obtained the animation below.</p>
<p><img title="Faces interpolation animation" src="/assets/faces-animation.gif" width="150" /></p>
<p>The face interpolation demo can be found in my
<a href="https://github.com/mfkasim91/invert-shadowgraphy/tree/fast-inverse">GitHub repo</a>
in the <em>demo_face_interpolation.m</em> file.</p>I have been re-working on solving proton radiography and shadowgraphy lately. The first time I did some work on this topic was two years ago (2016) where there was a need to retrieve magnetic field strength or refractive index variation from the obtained proton radiogram or shadowgram (inverse problem). The techniques have been known to physicists for years (especially shadowgraphy). However, people usually suggest using Poisson equation solver to solve the inverse problem. This only works for small deflection, not for larger deflection where most interesting cases are.Bayesian Inverse Problem2017-04-21T22:10:54+00:002017-04-21T22:10:54+00:00/2017/04/21/bayesian-inverse-problem<p>Over the Easter holiday last week, I challenged myself to derive the posterior of Bayesian inverse problem.
In easier terms, let’s say I have an unknown object enclosed inside a black box. Unfortunately, I can see the object from limited directions only (let’s say there are 3 small holes on the box).
From the observations, I want to know what the full shape of the object. This is called as an <em>inverse problem</em>.</p>
<p>There are a lot of tools in solving the inverse problem, including where the number of observations are limited (like our example case) with reasonable assumptions on the object (e.g. the object is smooth).
However, these tools only cannot answer some questions, like:</p>
<ul>
<li>How confident you are with your answer?</li>
<li>Which part of the object that you are most confident and which part you are least confident?</li>
<li>If you are free to choose the holes location on the box (but you can only make 3 small holes), where should you choose?
Answering these questions need an additional approach from Bayesian inference, thus <em>Bayesian inverse problem</em>.</li>
</ul>
<p>My first starting point was Gaussian Process (GP). In GP, it is assumed that interesting signals/functions are smooth to certain extent. Let’s take a look on the figure below.
Most interesting signals/functions would have shape similar to the red line or green line. It is rarely to make the function in blue line interesting (unless you are interested in noise).</p>
<p><a href="/assets/gp-samples.png"><img src="/assets/gp-samples.png" /></a></p>
<p>Based on the assumption, we can say that the values of a function \(f(x)\) is correlated for nearby points. If \(f(0) = 1\), then it is more likely for \(f(0.01)\) to have value close to 1.
The correlation of two function values at \(x\) and \(x’\) is expressed in a kernel, \(k(x, x’)\).
You can see choice of kernel functions from the <a href="https://en.wikipedia.org/wiki/Gaussian_process#Usual_covariance_functions">Wikipedia of Gaussian Process</a>.
Statistically, the values of a function at several points, \(x_1, …, x_t \), is a random multivariate variables with distribution</p>
<script type="math/tex; mode=display">\begin{equation}
P(\mathbf{f}) = \mathcal{N}\left[0, \mathbf{K} \right] \propto \exp\left(\mathbf{f}^T\mathbf{K}^{-1}\mathbf{f}\right)
\end{equation}</script>
<p>where \(\mathbf{f}\) is just a vector of function values at several points,
\(\mathcal{N}\) is the <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution">Normal distribution</a> and \(\mathbf{K}\) is the covariance matrix with the element \(\mathbf{K}_{ij} = k(x_i, x_j) \).</p>
<p>Now let’s say we make an indirect observation, \(\mathbf{y} = \mathbf{Sf}\). We know our observation result, \(\mathbf{y}\), our observation method, \(\mathbf{S}\), and we want to infer what \(\mathbf{f}\) look like.
This is an inference problem where we want to know the posterior distribution of \(\mathbf{f}\) after knowing \(\mathbf{y}\), i.e. \(P(\mathbf{f}|\mathbf{y})\). The posterior distribution can be derived from Bayesian theorem,</p>
<script type="math/tex; mode=display">\begin{equation}
P(\mathbf{f}|\mathbf{y}) = \frac{P(\mathbf{f},\mathbf{y})}{P(\mathbf{y})}.
\end{equation}</script>
<p>In the equation above, we know neither \(P(\mathbf{f},\mathbf{y})\) nor \(P(\mathbf{y})\). Fortunately, Normal distribution is very convenient with linear transformation. If the vector \(\mathbf{f}\) is multiplied by a matrix
\(\mathbf{S}\), then the distribution becomes</p>
<script type="math/tex; mode=display">\begin{equation}
P(\mathbf{Sf}) = \mathcal{N}\left[0, \mathbf{SKS}^T \right].
\end{equation}</script>
<p>That gives us the probability over the observation, \(\mathbf{y}\). To calculate the joint probability distribution, we can multiply the vector \(\mathbf{f}\) with a transformation matrix, \(\mathbf{A} = [\mathbf{I}^T, \mathbf{S}^T]^T\),
i.e.</p>
<script type="math/tex; mode=display">\begin{equation}
\begin{pmatrix}
\mathbf{f}\\
\mathbf{y}
\end{pmatrix}
=
\mathbf{Af} =
\begin{pmatrix}
\mathbf{I}\\
\mathbf{S}
\end{pmatrix}
\mathbf{f}.
\end{equation}</script>
<p>Thus, the joint probability becomes,</p>
<script type="math/tex; mode=display">\begin{equation}
P(\mathbf{f}, \mathbf{y}) = \mathcal{N}\left[0, \mathbf{AKA}^T \right].
\end{equation}</script>
<p>Knowing \(P(\mathbf{f}, \mathbf{y})\) and \(P(\mathbf{y})\), now we can divide the two distributions to get the posterior probability, \(P(\mathbf{f} | \mathbf{y})\).
The next part is a bit messy, because we need to express the Normal distribution in the exponential form, calculating the inverse of the matrix, and express it back in the convenient \(\mathcal{N}\) form.
In doing the inversion of matrix \(\mathbf{AKA}^T\), I used the block matrix inversion from <a href="https://en.wikipedia.org/wiki/Block_matrix#Block_matrix_inversion">Wikipedia</a> (the one that has \((\mathbf{A} - \mathbf{BD}^{-1}\mathbf{C})^{-1}\) form in it).
Long story short, I ended up with the equation below,</p>
<script type="math/tex; mode=display">\begin{equation}
P(\mathbf{f}|\mathbf{y}) = \mathcal{N}\left[\mathbf{KS}^T(\mathbf{SKS}^T)^{-1}\mathbf{y}, \mathbf{K} - \mathbf{KS}^T(\mathbf{SKS}^T)^{-1}\mathbf{SK} \right].
\end{equation}</script>
<p>I was quite happy ended up with that equation.
With the posterior probability above, we can answer several questions posed before:</p>
<ul>
<li><em>How confident you are with your answer?</em> It can be calculated from the covariance of the equation above (\(\mathbf{K} - \mathbf{KS}^T(\mathbf{SKS}^T)^{-1}\mathbf{SK}\)).</li>
<li><em>Which part of the object that you are most confident and which part you are least confident?</em> Same as above.</li>
<li><em>If you are free to choose the holes location on the box (but you can only make 3 small holes), where should you choose?</em> This is the interesting part. We can choose how we observe the object (choosing the observation matrix, \(\mathbf{S}\)) so that the
posterior covariance is minimum.</li>
</ul>Over the Easter holiday last week, I challenged myself to derive the posterior of Bayesian inverse problem. In easier terms, let’s say I have an unknown object enclosed inside a black box. Unfortunately, I can see the object from limited directions only (let’s say there are 3 small holes on the box). From the observations, I want to know what the full shape of the object. This is called as an inverse problem.1010! Challenge Part 1: Implementation and Optimisation2017-04-10T22:29:23+00:002017-04-10T22:29:23+00:00/2017/04/10/1010-challenge-01<p>October last year, I was challenged by my wife to beat the game <a href="http://1010ga.me/">1010!</a> with AI. The score target she choosed was 50000.
In the game, there is a 10x10 board and a batch of tiles to be placed on the board. One batch of tiles consists of 3 tiles.
The tiles can be placed anywhere on the board, but cannot be rotated. Once a row(s) or a column(s) are full, they dissapear.
The game is similar to tetris. An example play can be seen on this <a href="https://www.youtube.com/watch?v=x4tAyV16D_4">YouTube link</a>.</p>
<p><a href="/assets/snapshot-1010.png"><img src="/assets/snapshot-1010.png" width="400" /></a></p>
<p>This game has \(2^{100} = 1.27\times 10^{30}\) possible states. It has more states than Othello and Backgammon.
The game tree complexities is unbounded, as one player can play the game infinitely long, if lucky.
1010! is clearly more complex than <a href="http://2048game.com/">2048</a>, where many have made AIs to solve the game (including <a href="https://www.facebook.com/photo.php?fbid=10208247014366508&set=a.2105079480047.116338.1637320666&type=3&theater">myself</a>).</p>
<h3 id="implementation">Implementation</h3>
<p>I did a simple modification on the rule of the game to make the implementation of the AI easier: the tile comes one-by-one instead of in a batch of three.
This makes the game harder to play, but easier to implement using tree-search, as the branching factor is reduced.</p>
<p>My first implementation was in Python, using binary representation of the board. As the board is 10x10, it can be represented using 128 bits: zero if blank, one otherwise.
The upper left square is represented by the least significant bit (0th bit), the square on the right of it is the first bit, the leftmost square of the 2nd row is the 10th bit, and so on.
Every tile is encoded using the bit representation, e.g. the tile with 5 squares in a row is represented by <code class="highlighter-rouge">0b11111</code>, 2 squares in a column is <code class="highlighter-rouge">0b10000000001000000000</code>.
If the tile is placed at some positions on the board, the tile’s bits are shifted to the left by the corresponding amount.
Checking if the tile can be placed on a certain place can be done by doing <code class="highlighter-rouge">AND</code> operation. Putting the tile on the board is done by <code class="highlighter-rouge">OR</code> operation. Removing rows and columns can be done by <code class="highlighter-rouge">XOR</code> operation.</p>
<p>To check the performance, I run some random plays with these steps: (1) list all valid steps, (2) choose a valid step randomly, (3) repeat from step (1) until the no valid step available.
With Python, the program can run 10,000 random plays (about 180,000 steps) in about 5.2 seconds.
This is not good enough as some game-playing AIs can evaluate more than million steps in a second with a single core.</p>
<p>Another implementation is using C++ compiled with <code class="highlighter-rouge">-O3</code> option.
With C++, I implemented the game with standard array: the board is represented by an array of 100 boolean elements.
It consists of a lot of loops in the process: checking if the squares are blank, put a tile on the board, etc.
Initially the process seems not really efficient compared to the bit operation.
However, with the array implementation in C++, the program only needs 1.6 seconds to play 100,000 random plays (about 1,800,000 steps).
Thinking it can be further optimised, I implemented the bit operation in C++. Fortunately, my version of G++ has <code class="highlighter-rouge">__int128</code> type for integer 128 bits.
In short, the bit-operation implementation in C++ outperforms all my other implementations: 100,000 random plays (about 1,800,000 steps) in 0.75 second.</p>
<table>
<thead>
<tr>
<th>Language </th>
<th>Implementation </th>
<th style="text-align: right">Plays </th>
<th style="text-align: right">Time (s) </th>
<th style="text-align: right">Time / plays (us) </th>
<th style="text-align: right">Speed up</th>
</tr>
</thead>
<tbody>
<tr>
<td>Python</td>
<td>Bit-operation</td>
<td style="text-align: right">10,000</td>
<td style="text-align: right">5.2</td>
<td style="text-align: right">520</td>
<td style="text-align: right">1</td>
</tr>
<tr>
<td>C++</td>
<td>Array</td>
<td style="text-align: right">100,000</td>
<td style="text-align: right">1.6</td>
<td style="text-align: right">16</td>
<td style="text-align: right">32.5</td>
</tr>
<tr>
<td>C++</td>
<td>Bit-operation</td>
<td style="text-align: right">100,000</td>
<td style="text-align: right">0.75</td>
<td style="text-align: right">7.5</td>
<td style="text-align: right"><strong>69</strong></td>
</tr>
</tbody>
</table>
<p>I was a bit surprised to see how slow a Python script can be. With the same implementation, i.e. using bit operations, my C++ implementation runs almost 70 times faster than my implementation in Python.
This is also inline with the test performed <a href="http://blog.dhananjaynene.com/2008/07/performance-comparison-c-java-python-ruby-jython-jruby-groovy/">here</a>.
As a conclusion, use C++, it is fast. The codes can be found <a href="https://github.com/mfkasim91/mfkasim91.github.io/tree/master/assets/codes/1010">here</a>.</p>October last year, I was challenged by my wife to beat the game 1010! with AI. The score target she choosed was 50000. In the game, there is a 10x10 board and a batch of tiles to be placed on the board. One batch of tiles consists of 3 tiles. The tiles can be placed anywhere on the board, but cannot be rotated. Once a row(s) or a column(s) are full, they dissapear. The game is similar to tetris. An example play can be seen on this YouTube link.My solution at ISCSO 2016 and meta optimisation2016-12-15T19:12:00+00:002016-12-15T19:12:00+00:00/2016/12/15/meta-optimisation<p>Two years ago, I participated in <a href="http://www.brightoptimizer.com/">ISCSO</a> (International Student Competition in Structural Competition),
a competition on structural optimisation.
However, I didn’t submit my solution that time because I didn’t reach the target set by the organiser.
After reading some literatures about optimisation, especially Bayesian Optimisation, I participated again in ISCSO this year and
<a href="http://www.brightoptimizer.com/winner-of-iscso-2016/">won</a>.
This post is about the approach I used in the competition.</p>
<h3 id="problem">Problem</h3>
<p>Even though the competition is about structural optimisation, no prior knowledge on structure or civil engineering required.
The organiser sets up a MATLAB function to do all the civil engineering calculations.
The participants just need to find the right parameters and read the output.</p>
<p>This year’s problem is about cantilever. The topology of the cantilever is fixed, but the \(z\) position of some points and sizes of every truss are to be optimised.
The topology is shown by the picture below (credit to:<a href="http://www.brightoptimizer.com/">http://www.brightoptimizer.com/</a>).</p>
<p><a href="/assets/iscso-problem.png"><img src="/assets/iscso-problem.png" width="600" /></a></p>
<p>The objective is to minimise the weight of the total structure without violating two constraints (i.e. maximum displacement and maximum stress).
Fortunately, the organiser has set up a MATLAB function below.</p>
<div class="language-matlab highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Weight</span><span class="p">,</span> <span class="n">Const_Vio_Stress</span><span class="p">,</span> <span class="n">Const_Vio_Disp</span><span class="p">]</span> <span class="o">=</span> <span class="n">ISCSO_2016</span><span class="p">(</span><span class="n">Sections</span><span class="p">,</span> <span class="n">Coordinates</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</code></pre></div></div>
<p><code class="highlighter-rouge">Sections</code> is a \(1 \times 117\) vector with each element is an integer from 1 to 37.
<code class="highlighter-rouge">Coordinates</code> is a \(1 \times 7\) integer vector with value between 1000 to 3500.
<code class="highlighter-rouge">Const_Vio_Disp</code> and <code class="highlighter-rouge">Const_Vio_Stress</code> are two constraints that has to be zero in the final design.
<code class="highlighter-rouge">Weight</code> is the weight of the structure. This is the output to be minimised in the optimisation.</p>
<h3 id="cma-es">CMA-ES</h3>
<p>There are 124 variables in total to be optimised and they are all integers.
It also means the optimisation algorithm needs to search for the optimum design in 124 dimensions search space.
A lot of optimisation techniques are available, but for a hundred dimensions, it seems that a lot of people use CMA-ES.
CMA-ES is also used in Bayesian Optimisation to optimise the surrogate function.</p>
<p>In CMA-ES, a user specifies an initial position as the centre point in the search space and standard deviation for each dimension.
The algorithm then generates a number of points around the centre point based on normal distribution with the corresponding standard deviation.
All generated points are evaluated and it chooses the best half of the population.
It then updates the centre point based on the mean position of the best half of the population and the previous centre points.
Not only that, the algorithm also updates the covariance matrix based on the covariance matrix of the best half of the population, previous covariance matrix, and the centre point.
With the new centre point and covariance matrix, the algorithm generates a new population around the new centre point
based on normal distribution with the corresponding covariance matrix.</p>
<p>This is a <a href="https://arxiv.org/pdf/1604.00772.pdf">tutorial</a> for CMA-ES written by N. Hansen, who came up with the idea of CMA-ES.
The source code for MATLAB can also be found <a href="https://www.lri.fr/~hansen/cmaes_inmatlab.html">here</a>.
Here is a nice illustration of CMA-ES from <a href="https://en.wikipedia.org/wiki/CMA-ES">Wikipedia</a>.</p>
<p><a href="/assets/cma-es.png"><img src="/assets/cma-es.png" width="600" /></a></p>
<h3 id="loss-function">Loss function</h3>
<p>Back to ISCSO. The objective is to find the minimum weight without violating the two constraints.
My first step is to set a loss function as</p>
<script type="math/tex; mode=display">L = w + \lambda_1 c_1 + \lambda_2 c_2 + \lambda_3 (c_1 + c_2 > 0),</script>
<p>where \(w\) is the weight, \(c_1\) and \(c_2\) are the first and the second constraints, respectively, and
\(\lambda_i\) are variables for the penalty terms.
The fourth term on the right hand side of the equation is necessary to make sure that both constraints are zero.
Without this term, \(c_1\) and \(c_2\) can take a very small number to reduce the penalty.</p>
<h3 id="meta-optimisation">Meta-optimisation</h3>
<p>Setting \(\lambda_1 = \lambda_2 = \lambda_3 = 1000\) arbitrarily, I tried to minimise the loss function using CMA-ES.
CMA-ES has several tunable parameters, but the author <a href="https://arxiv.org/pdf/1604.00772.pdf">suggested</a> default values for every tunable parameters.
Using CMA-ES with the default values of its parameters, I got the weights around 3200-3400.
Running it longer did not improve much.
However, there are 11 variables yet to be tuned for the CMA-ES.</p>
<p>I was thinking to use Bayesian Optimisation, but Bayesian Optimisation works great for moderate dimensions (less than 10).
The other option is to use another CMA-ES to tune the CMA-ES.
This process is called as <strong>meta-optimisation</strong>, which is optimising the optimiser.</p>
<p>The result is surprising. Now the optimiser can get the weights around 2800-2900 consistently.
Before optimising the optimiser, our optimiser was hardly went below 3100, but now it is consistently reach 2800-2900.
To get the best results, I run it several times.
There I got my best result shown in the <a href="http://www.brightoptimizer.com/winner-of-iscso-2016/">website</a> (i.e. 2816).</p>Two years ago, I participated in ISCSO (International Student Competition in Structural Competition), a competition on structural optimisation. However, I didn’t submit my solution that time because I didn’t reach the target set by the organiser. After reading some literatures about optimisation, especially Bayesian Optimisation, I participated again in ISCSO this year and won. This post is about the approach I used in the competition.How many papers were submitted to BayesOpt16?2016-12-10T14:22:00+00:002016-12-10T14:22:00+00:00/2016/12/10/how-many-papers-were-submitted<p>As I mentioned in my <a href="http://sp.mfkasim.com/2016/11/15/my-paper-at-nips-workshop-on-bayesian-optimization/">recent post</a>,
my paper has been accepted for poster presentation at Bayesian Optimisation Workshop at NIPS 2016.
List of accepted paper has also appeared on the workshop’s <a href="https://bayesopt.github.io/accepted.html">website</a>.
There are 26 papers accepted in total.
From the submission, my paper was given ID 12 and appears 9th on the list.</p>
<p><a href="/assets/accepted-papers.png"><img title="9 accepted papers" src="/assets/accepted-papers.png" /></a></p>
<p>Since the list of accepted papers is not ordered in alphabetical order (title nor author), I assume the list is ordered by
the submission ID.
I also assume that the submission ID is given by the order of submission.
The question is “<em>given the above information, estimate how many papers were submitted to BayesOpt 2016?</em>”</p>
<p>Let’s denote \(d=12\) as the \(d\)-th paper submitted, \(r=9\) as the \(r\)-th paper accepted, \(a=26\) as the total accepted papers,
and \(s\) as the total papers submitted. It is clear enough to say that \(s \geq a\), as it is impossible to have more accepted papers than submitted papers.
Given those information, we want to calculate the probability of \(s\),</p>
<script type="math/tex; mode=display">\begin{equation}
P(s|a,d,r) = \frac{P(a|d,r,s)P(s)}{\sum_{s_i=a}^{\infty} P(a|d,r,s_i)P(s_i)}.
\label{eq:bayes}
\end{equation}</script>
<p>In order to calculate \(P(a|d,r,s)\) from the equation above, we introduce a new variable, \(\eta\), the acceptance rate for large samples.
Given the acceptance rate, \(\eta\), and the total submissions, \(s\), we can calculate the probability of having number of accepted papers, \(a\), using binomial distribution,</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:a-s-eta}
P(a|s,\eta) = \left(\begin{array}{c} s \\ a \end{array}\right) \eta^a (1-\eta)^{s-a}.
\end{equation}</script>
<p>To get the probability distribution value of \(\eta\), we can use beta distribution with information that there are \(r\) papers accepted out of \(d\) submissions.
This is in similar form with the result on my <a href="http://sp.mfkasim.com/2016/10/21/what-is-the-chance-ahok-wins-the-election-in-one-round/">previous post</a>,</p>
<script type="math/tex; mode=display">\begin{equation}
\label{eq:eta-d-r}
P(\eta | d,r)\ \mathrm{d}\eta = \frac{\eta^r (1-\eta)^{d-r}}{B(r+1,d-r+1)}\ \mathrm{d}\eta,
\end{equation}</script>
<p>where \(B(\alpha, \beta)\) is the <a href="https://en.wikipedia.org/wiki/Beta_function">beta function</a>.</p>
<p>Now we can use equation \eqref{eq:a-s-eta} and \eqref{eq:eta-d-r} to obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\label{eq:a-d-r-s}
P(a|d,r,s) & = \int_0^1 P(a|s,\eta) P(\eta|d,r)\ \mathrm{d}\eta \nonumber \\
& = \left(\begin{array}{c} s \\ a \end{array}\right) \frac{1}{B(r+1,d-r+1)} \int_0^1 \eta^{a+r} (1-\eta)^{s-a+d-r}\ \mathrm{d}\eta \nonumber \\
& = \left(\begin{array}{c} s \\ a \end{array}\right) \frac{B(a+r+1, s-a+d-r+1)}{B(r+1, d-r+1)}.
\end{align} %]]></script>
<p>Obtaining \(P(a|d,r,s)\), we can use the Bayes theorem in equation \eqref{eq:bayes} to estimate the number of submissions.
Assuming that the probability of having number of submissions, \(s\), is uniform from \(a\) to \(\infty\).
This is also the same prior assumption in <a href="https://en.wikipedia.org/wiki/German_tank_problem">German tank problem</a>.
As this is a very small number, we can denote it as \(\Omega\).
Thus,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\label{eq:final-results}
P(s|a,d,r) & = \left(\begin{array}{c} s \\ a \end{array}\right) \frac{B(a+r+1, s-a+d-r+1)}{B(r+1, d-r+1)} \Omega
\left[\sum_{s_i=a}^{\infty} \left(\begin{array}{c} s_i \\ a \end{array}\right) \frac{B(a+r+1, s_i-a+d-r+1)}{B(r+1, d-r+1)} \Omega \right]^{-1} \nonumber \\
& = \left(\begin{array}{c} s \\ a \end{array}\right) B(a+r+1, s-a+d-r+1)
\left[\sum_{s_i=a}^{\infty} \left(\begin{array}{c} s_i \\ a \end{array}\right) B(a+r+1, s_i-a+d-r+1) \right]^{-1}.
\end{align} %]]></script>
<p>With the equation above, it is now possible to calculate and plot the probability distribution.
The probability distribution of number of submissions is shown below.</p>
<p><a href="/assets/submissions-probability.png"><img src="/assets/submissions-probability.png" /></a></p>
<p>From the last equation, we can calculate the most probable number of submissions, expected number of submissions as well as its standard deviation.
The most probable number of submissions is \(31\), while the expected number of submissions is \(36.5 \pm 9.4\).</p>
<p><strong>UPDATE</strong></p>
<p>In BayesOpt16 workshop, the organiser mentioned that there are <strong>31</strong> papers were submitted to the workshop.
The prediction is correct!</p>
<p><a href="/assets/submitted-papers.jpg"><img src="/assets/submitted-papers.jpg" width="400" /></a></p>As I mentioned in my recent post, my paper has been accepted for poster presentation at Bayesian Optimisation Workshop at NIPS 2016. List of accepted paper has also appeared on the workshop’s website. There are 26 papers accepted in total. From the submission, my paper was given ID 12 and appears 9th on the list.Stippling pictures with Lloyd’s algorithm2016-12-06T19:03:00+00:002016-12-06T19:03:00+00:00/2016/12/06/stippling-pictures-with-lloyds-algorithm<p>(The MATLAB code can be found <a href="https://github.com/mfkasim91/stippling-lloyds">here</a>)</p>
<p>When I was doing my project about <a href="https://arxiv.org/pdf/1607.04179.pdf">inverting proton radiograms and shadowgrams</a>, I discovered that the technique I employed can be used to make stipple pictures.
The stippling technique I explained below comes from <a href="https://www.cs.ubc.ca/labs/imager/tr/2002/secord2002b/secord.2002b.pdf">this paper</a>.</p>
<p><strong>What is stippling?</strong></p>
<p>Stippling is a technique to produce a picture using small dots. The picture shown on top of this post is one of the example of stippling.
At far, it looks like a common gray picture, but if you look at it closely, it is actually resembled of small dots. Look at picture below (click for the full size picture).</p>
<p><a href="/assets/jokowi-stipple-bw.png"><img title="Jokowi" src="/assets/jokowi-stipple-bw.png" width="800" /></a></p>
<p>The picture above is the stippled picture of Presiden Jokowi generated using weighted <a href="https://en.wikipedia.org/wiki/Lloyd's_algorithm">Lloyd’s algorithm</a>.
The source is from <a href="http://www.kemendagri.go.id/">Kemendagri</a> website.
Before I am going to explain about Lloyd’s algorithm, it is better to know about Voronoi diagram first.</p>
<p><strong>Voronoi diagram</strong></p>
<p>Consider a 2D plane and there are several dots on the plane. Now we are going to determine for <strong>all</strong> positions on the plane, which dot is the closest.
For example in the picture below, point A is the closest to the dot 2, compared to the other dots. So in this case, point A belongs to dot 2.</p>
<p><a href="/assets/voronoi-example-01.png"><img title="Example" src="/assets/example-voronoi-01.png" width="300" /></a></p>
<p>In construction of Voronoi diagram, we don’t consider one point only, but we consider all (continuous) points on the plane.
After determining which dot is the closest to each points on the plane, we can draw borders between the dots to divide the plane into several regions or known as <em>Voronoi cells</em>.
All points inside one cell belongs to the dot in the same cell.
The result is a Voronoi diagram.</p>
<p><a href="/assets/voronoi-example-02.png"><img title="Example" src="/assets/example-voronoi-02.png" width="300" /></a></p>
<p>There have been a lot of libraries for various programming languages to construct Voronoi cell, so we don’t need to implement the algorithm by ourselves.</p>
<p><strong>Lloyd’s algorithm</strong></p>
<p>For a bounded plane, Lloyd’s algorithm is an algorithm to divide the plane into several regions with approximately the same size.
The algorithm is simple:</p>
<ol>
<li>We start by deploying several dots on the plane randomly.</li>
<li>Construct the Voronoi diagram inside the bounded plane.</li>
<li>Calculate the centroid of each cell.</li>
<li>Move the dots to its cell’s centroids.</li>
<li>Repeat from step 2 until any stopping conditions fulfilled.</li>
</ol>
<p>Here is a nice illustration of Lloyd’s algorithm from <a href="https://en.wikipedia.org/wiki/Lloyd's_algorithm">Wikipedia</a>.</p>
<p><a href="/assets/lloyds-algorithm.png"><img title="Lloyd's algorithm" src="/assets/lloyds-algorithm.png" /></a></p>
<p>A Voronoi cell is always a convex polygon. Therefore, the formula of computing the centroid is quite straightforward. It is given by</p>
<script type="math/tex; mode=display">C_x = \frac{1}{6A} \sum_{i=0}^{n-1} \left(x_i + x_{i+1}\right)\left(x_i y_{i+1} - x_{i+1} y_i\right)</script>
<script type="math/tex; mode=display">C_y = \frac{1}{6A} \sum_{i=0}^{n-1} \left(y_i + y_{i+1}\right)\left(x_i y_{i+1} - x_{i+1} y_i\right)</script>
<p>where the area, \( A \) is</p>
<script type="math/tex; mode=display">A = \frac{1}{2} \sum_{i=0}^{n-1} \left(x_i y_{i+1} - x_{i+1} y_i\right).</script>
<p>The coordinates, \( (x_i, y_i) \), appear in counter-clockwise order and \( (x_n, y_n) = (x_0, y_0) \).
Multiplication of the centroid position and the area is called as the <em>first moment</em> of the area.</p>
<p><strong>Weighted Lloyd’s algorithm</strong></p>
<p>So far, we only consider a uniform plane, without any particular weights at particular positions.
How about the plane is not uniformly weighted so that the dots prefer to move to area with higher weight?
If this is the case, the steps on Lloyd’s algorithm given above does not change.
The only thing that changes is on how we calculate the centroid of each cell.</p>
<p>If the weight is given in pixels (i.e. uniform weight for a given square region), each cell is clipped by every pixel it touches.
Then we calculate the first moment and area of each clipped region.
To get the total first moment and area of the cell, we can sum all first moments and area from every clipped regions, respectively.
Thus, the centroid position is simply the first moment divided by the total area.</p>
<p>Let’s denote superscript \(\ ^{(j)}\) as the property of the \(j\)-th clipped region within a cell.
The weight of the clipped region is given by \(w^{(j)}\).
The first moment and area of the region are given by</p>
<script type="math/tex; mode=display">S_x^{(j)} = \frac{1}{6} w^{(j)} \sum_{i=0}^{n-1} \left(x_i^{(j)} + x_{i+1}^{(j)}\right)\left(x_i^{(j)} y_{i+1}^{(j)} - x_{i+1}^{(j)} y_i^{(j)}\right)</script>
<script type="math/tex; mode=display">S_y^{(j)} = \frac{1}{6} w^{(j)} \sum_{i=0}^{n-1} \left(y_i^{(j)} + y_{i+1}^{(j)}\right)\left(x_i^{(j)} y_{i+1}^{(j)} - x_{i+1}^{(j)} y_i^{(j)}\right)</script>
<script type="math/tex; mode=display">A^{(j)} = \frac{1}{2} w^{(j)} \sum_{i=0}^{n-1} \left(x_i^{(j)} y_{i+1}^{(j)} - x_{i+1}^{(j)} y_i^{(j)}\right).</script>
<p>Thus, the centroid of the cell is given by</p>
<script type="math/tex; mode=display">C_x = \frac{1}{A} \sum_{j=0}^{m-1} S_x^{(j)}</script>
<script type="math/tex; mode=display">C_y = \frac{1}{A} \sum_{j=0}^{m-1} S_y^{(j)}</script>
<p>where the total area is</p>
<script type="math/tex; mode=display">A = \sum_{j=0}^{m-1} A^{(j)}.</script>
<p>To clip the cell with each pixel efficiently, I am using <a href="https://en.wikipedia.org/wiki/Sutherland%E2%80%93Hodgman_algorithm">Sutherland-Hodgman algorithm</a>.</p>
<p><strong>Stippling with weighted Lloyd’s algorithm</strong></p>
<p>From here, the things become relatively straightforward.
The first step to do stippling for an image is to convert the image into grayscale image.
One problem with the grayscale image is that darker regions have lower pixel values/weights and brighter regions have higher weights.
On the other hand, we want more dots in darker regions and less dots in brighter regions.
The solution is just to take the complement of the image.</p>
<p><a href="/assets/jokowi-process.jpg"><img title="Pre-processing" src="/assets/jokowi-process.jpg" width="700" /></a></p>
<p>Once the complement is taken, we can deploy random dots on the image and perform the weighted Lloyd’s algorithm.
To make it converge faster, I used simple rejection method when deploying the random dots.
The dots deployed in higher weight regions get more chance to be accepted.
If a dot is rejected, then it must be deployed somewhere else.
For most cases, repeating the iterations in the weighted Lloyd’s algorithm 50 times should be enough.</p>
<p>For those who want to try stippling their images, I have uploaded my MATLAB code in <a href="https://github.com/mfkasim91/stippling-lloyds">GitHub</a>.
Feel free to give your thought below.</p>(The MATLAB code can be found here)Estimating how many people were in 212 prayer rally2016-12-02T22:14:12+00:002016-12-02T22:14:12+00:00/2016/12/02/predicting-how-many-people-in-212<p>Today (2nd December 2016) there were prayer rally to protest against Jakarta’s chinese and christian governor, Ahok, that was accused of insulting Islam.
There has been many speculations about how many people were participating in the Friday prayer rally. Some says 2 million people, some even says 7 million.
I am intrigued to estimate how many people were participating the rally based on photos available on the internet and simple geometry.</p>
<p>First, let us see the satellite pictures of the venue where the rally took place. It was around National Monument (Monas) in Jakarta.
This is the map of Monas compiled from several images from Google Maps.</p>
<p><img title="Monas map" src="/assets/monas.png" width="250" /></p>
<p>We are going to fill in some area on the map above that is filled with people, based on information from some pictures from the internet.
Here are the first images.</p>
<p><img title="212 Rally - 01a" src="/assets/212-pics-01a.png" width="250" />
<img title="212 Rally - 01b" src="/assets/212-pics-01b.png" width="250" />
<img title="212 Rally - 01c" src="/assets/212-pics-01c.png" width="250" /></p>
<p>From these pictures, we can mark down some areas around the round fountain.</p>
<p><img title="Monas markdown 01" src="/assets/monas-markdown-01.png" width="250" /></p>
<p>And now here are some pictures around Monas.</p>
<p><img title="212 Rally - 02a" src="/assets/212-pics-02a.png" width="250" />
<img title="212 Rally - 02b" src="/assets/212-pics-02b.png" width="250" /></p>
<p>To make sense of the direction, it is reasonable to use <a href="https://en.wikipedia.org/wiki/Istiqlal_Mosque,_Jakarta">Istiqlal Mosque</a> as reference.
On map, it is shown as big white square at the northeast of Monas. With that, we can mark some areas that were occupied by the people.
From these pictures, it seems that the southeast part of Monas was relatively less dense. So let’s mark the area around Monas.</p>
<p><img title="Monas markdown 02" src="/assets/monas-markdown-02.png" width="250" /></p>
<p>Now from the mask, we can estimate the area occupied by the people.
To estimate the area, we can calculate how many pixels are occupied by the mask below and later normalise it using the scale.
Click <a href="/assets/monas-mask.png">here</a> for the full size image.</p>
<p><img title="Monas mask" src="/assets/monas-mask.png" width="250" /></p>
<p>The code to calculate the mask area in pixels is as below.</p>
<figure class="highlight"><pre><code class="language-matlab" data-lang="matlab"><span class="n">img</span> <span class="o">=</span> <span class="nb">imread</span><span class="p">(</span><span class="s1">'monas-mask.png'</span><span class="p">);</span> <span class="c1">% read the image</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">img</span><span class="p">(:,:,</span><span class="mi">3</span><span class="p">);</span> <span class="c1">% read only the blue channel</span>
<span class="nb">sum</span><span class="p">(</span><span class="n">img</span><span class="p">(:)</span> <span class="o"><</span> <span class="nb">max</span><span class="p">(</span><span class="n">img</span><span class="p">(:)));</span> <span class="c1">% calculate the mask pixels</span></code></pre></figure>
<p>The resulted area is 138495 pixels.
About the scale, Google Maps give scale of 100 m by 84 pixels, so 100 x 100 square meter is occupied by 7056 pixels.
I put the scale at bottom right of the map pictures.
With that scale, we can estimate the area to be \( A=138495 \times \frac{(100)(100)}{7056} = 1.9628\times10^5\ \mathrm{m}^2 \).</p>
<p>With the knowledge that the people were doing the prayer during the rally, a reasonable estimate for that is one person occupied about 0.5 square meter.
Thus, the estimate number of people during the rally was about
<script type="math/tex">N \approx \frac{A}{0.5\ \mathrm{m}^2} \approx 392559.</script></p>
<p>Based on my estimation, there are about 400,000 people during the rally. It is really far from some claims that said it was 2 million or even 7 million.</p>
<p>The number is estimated from the pictures shown in this blog. If you have additional picture to refine my estimate, feel free to contact me.</p>
<p><strong>UPDATE</strong></p>
<p>There has been many inputs from other people that there are several regions are not covered on the map.
So I took the inputs and let’s update the estimate.
One of my friend said that there were also people that were praying in Tugu Tani (green arrow below).
And another friend said that he was praying in a place I didn’t mark and he witnessed there were still long line behind him (red arrow below).
Let’s expand the mask.</p>
<p><img title="Monas markdown 03" src="/assets/monas-markdown-03.png" width="300" /></p>
<p><img title="Monas mask 02" src="/assets/monas-mask-02.png" width="300" /></p>
<p>Applying the same code as above (with appropriate file name), we obtain the pixels covered by the mask is 178244 pixels.
And for the space occupied by a person, there are some people say that it was really cramped during the prayer, and they estimated to be around 3 persons per square meter.
Let’s take that number as the upper bound estimate.
Therefore, with the same scale as before, with 3 persons per square meter, it is estimated that there were about <strong>757,840</strong> people during the rally.
If we are using 2 persons per square meter as before, there were about <strong>505,227</strong> people.</p>Today (2nd December 2016) there were prayer rally to protest against Jakarta’s chinese and christian governor, Ahok, that was accused of insulting Islam. There has been many speculations about how many people were participating in the Friday prayer rally. Some says 2 million people, some even says 7 million. I am intrigued to estimate how many people were participating the rally based on photos available on the internet and simple geometry.My Paper at NIPS Workshop on Bayesian Optimization2016-11-15T11:53:22+00:002016-11-15T11:53:22+00:00/2016/11/15/my-paper-at-nips-workshop-on-bayesian-optimization<p>I am really excited today getting an email from BayesOpt16 organiser mentioning that the paper I submitted is accepted for Bayesian Optimization Workshop at NIPS 2016. For students in machine learning field, maybe workshop paper is not as cool as the conference paper, but for me, this makes me really happy, as I learned the topic in my spare time.<!--more--></p>
<p>As a physics student studying laser-plasma interaction, I usually think how to choose the parameters of laser and plasma to optimise the interactions (e.g. how to efficiently transfer energy from laser to form a wave in plasma, called wakefield, so that it can accelerate electrons to high energy). We have simulation program to simulate the interaction, but it takes quite long time to perform one simulation and also very expensive (e.g. a 3D simulations can take half day using 1024 cores). The optimisation methods that I ever read always need a lot of simulations to get the optimised parameters (e.g. genetic algorithm, CMA-ES). And some of the efficient ones (e.g. gradient descent) need the gradient of each parameter, which is really hard to obtain in laser-plasma simulations.</p>
<p>I discovered Bayesian Optimisation when I read <a href="https://scholar.google.co.uk/citations?user=nzEluBwAAAAJ&hl=en&oi=ao">Nando de Freitas' publications</a>. The objective of Bayesian Optimisation match exactly the same with what I usually think: optimise a black-box function with minimum number of evaluations without the knowledge of gradient. However, after trying several times Bayesian Optimisation methods, it seems that we need to choose the correct hyper-parameters of the algorithm to work correctly, even though most of the time it works really well. And then I read about <a href="http://papers.nips.cc/paper/4304-optimistic-optimization-of-a-deterministic-function-without-the-knowledge-of-its-smoothness.pdf">Simultaneous Optimistic Optimisation (SOO)</a> which needs less hyper-parameters and more reliable (but sometimes it needs more function evaluations).</p>
<p>Bayesian Optimisation is suitable for choosing parameters of laser plasma in the simulations to optimise the parameters, but it can't be used to optimise shapes, e.g. what is the shape of laser or density profile of the plasma to make the interaction efficient? Shape optimisation is an infinite dimensional optimisation problem and sometimes can be solved using calculus of variations. The famous example of shape optimisation is the brachistochrone problem. It states, "given two points in space with constant gravity, what is the shape of the path between two points so that a bead can travel without friction from the higher point to the lower point in the shortest time possible?" It can be solved easily using calculus of variations as the Fréchet derivative for that case is easily obtained. But how if I change the case so (1) the bead travel with friction, or (2) the gravity is not constant in space? The Fréchet derivative of the shape is not easily obtained. This is the case for laser-plasma interaction, because the Fréchet derivative of the laser shape and density profile of the plasma are not easily obtained (i.e. they interact non-linearly).</p>
<p>To optimise the shape, I employed the SOO method. It is a tree-based optimisation method with a very relaxed constraints. Detail of the SOO method can be found in its paper (<a href="http://papers.nips.cc/paper/4304-optimistic-optimization-of-a-deterministic-function-without-the-knowledge-of-its-smoothness.pdf">link</a>). My idea of employing SOO to 1D shape optimisation is as follows. It starts by fixing the positions of two end points and connected by a straight line. Then it place a new point in the middle of the end points and optimise the position of the middle points. The optimum position of the middle points is searched within a searching area. Once it reach several number of evaluations, new points in the middle of existing points are placed for the next stage of the algorithm. The positions of all points, except the end points, are then optimised to give the maximum/minimum function values (e.g. travel time of a bead). So for the first stage, the algorithm only optimises 1D (i.e. the position of the middle point). In the second stage, it optimises 3D (i.e. the previous middle point and two points between the middle points and the end points). And so on. So in the \( n \)-stage, it optimises \( 2^n-1\) dimensions. The key in this idea is that the search area's width for a stage is 4 times smaller than the search area's width for the previous stage. This makes this algorithm works well for high dimensions optimisation.</p>
<p>I tested the algorithm to optimise brachistochrone and catenary cases and for both cases, the algorithm works very well. It can determine the optimum positions of 15 points (excluding the end points) in 1000 evaluations. It also outperforms other algorithms employing Bayesian Optimisation, even with smaller number of dimensions, which is 7 (in most cases, higher dimension optimisation is harder to do than lower number of dimensions). I also tested with brachistochrone with friction case (where the Fréchet derivative is hard to obtain) and it also works well. However, I didn't include the latter case in my paper because I haven't found the correct analytic solution as a benchmark (I believe there is one out there).</p>
<p>My plan for the algorithm is to look for a new idea to make it works with unknown constraints. If I can work out the case with unknown constraints, then I think there will be new physics can be discovered using the method (even though they are not necessarily related to my project). And hopefully I can submit the paper to ICML 2017!</p>I am really excited today getting an email from BayesOpt16 organiser mentioning that the paper I submitted is accepted for Bayesian Optimization Workshop at NIPS 2016. For students in machine learning field, maybe workshop paper is not as cool as the conference paper, but for me, this makes me really happy, as I learned the topic in my spare time. As a physics student studying laser-plasma interaction, I usually think how to choose the parameters of laser and plasma to optimise the interactions (e.g. how to efficiently transfer energy from laser to form a wave in plasma, called wakefield, so that it can accelerate electrons to high energy). We have simulation program to simulate the interaction, but it takes quite long time to perform one simulation and also very expensive (e.g. a 3D simulations can take half day using 1024 cores). The optimisation methods that I ever read always need a lot of simulations to get the optimised parameters (e.g. genetic algorithm, CMA-ES). And some of the efficient ones (e.g. gradient descent) need the gradient of each parameter, which is really hard to obtain in laser-plasma simulations. I discovered Bayesian Optimisation when I read Nando de Freitas' publications. The objective of Bayesian Optimisation match exactly the same with what I usually think: optimise a black-box function with minimum number of evaluations without the knowledge of gradient. However, after trying several times Bayesian Optimisation methods, it seems that we need to choose the correct hyper-parameters of the algorithm to work correctly, even though most of the time it works really well. And then I read about Simultaneous Optimistic Optimisation (SOO) which needs less hyper-parameters and more reliable (but sometimes it needs more function evaluations). Bayesian Optimisation is suitable for choosing parameters of laser plasma in the simulations to optimise the parameters, but it can't be used to optimise shapes, e.g. what is the shape of laser or density profile of the plasma to make the interaction efficient? Shape optimisation is an infinite dimensional optimisation problem and sometimes can be solved using calculus of variations. The famous example of shape optimisation is the brachistochrone problem. It states, "given two points in space with constant gravity, what is the shape of the path between two points so that a bead can travel without friction from the higher point to the lower point in the shortest time possible?" It can be solved easily using calculus of variations as the Fréchet derivative for that case is easily obtained. But how if I change the case so (1) the bead travel with friction, or (2) the gravity is not constant in space? The Fréchet derivative of the shape is not easily obtained. This is the case for laser-plasma interaction, because the Fréchet derivative of the laser shape and density profile of the plasma are not easily obtained (i.e. they interact non-linearly). To optimise the shape, I employed the SOO method. It is a tree-based optimisation method with a very relaxed constraints. Detail of the SOO method can be found in its paper (link). My idea of employing SOO to 1D shape optimisation is as follows. It starts by fixing the positions of two end points and connected by a straight line. Then it place a new point in the middle of the end points and optimise the position of the middle points. The optimum position of the middle points is searched within a searching area. Once it reach several number of evaluations, new points in the middle of existing points are placed for the next stage of the algorithm. The positions of all points, except the end points, are then optimised to give the maximum/minimum function values (e.g. travel time of a bead). So for the first stage, the algorithm only optimises 1D (i.e. the position of the middle point). In the second stage, it optimises 3D (i.e. the previous middle point and two points between the middle points and the end points). And so on. So in the \( n \)-stage, it optimises \( 2^n-1\) dimensions. The key in this idea is that the search area's width for a stage is 4 times smaller than the search area's width for the previous stage. This makes this algorithm works well for high dimensions optimisation. I tested the algorithm to optimise brachistochrone and catenary cases and for both cases, the algorithm works very well. It can determine the optimum positions of 15 points (excluding the end points) in 1000 evaluations. It also outperforms other algorithms employing Bayesian Optimisation, even with smaller number of dimensions, which is 7 (in most cases, higher dimension optimisation is harder to do than lower number of dimensions). I also tested with brachistochrone with friction case (where the Fréchet derivative is hard to obtain) and it also works well. However, I didn't include the latter case in my paper because I haven't found the correct analytic solution as a benchmark (I believe there is one out there). My plan for the algorithm is to look for a new idea to make it works with unknown constraints. If I can work out the case with unknown constraints, then I think there will be new physics can be discovered using the method (even though they are not necessarily related to my project). And hopefully I can submit the paper to ICML 2017!What is the chance Ahok wins the election in one round?2016-10-21T08:10:01+00:002016-10-21T08:10:01+00:00/2016/10/21/what-is-the-chance-ahok-wins-the-election-in-one-round<p>I got a text message earlier today from my brother. He described some statistics from SMRC (a research organisation in Indonesia) about the people's choices for the next Jakarta governor election in 2017. There are 3 candidates for the governor position, one of them is the incumbent, Ahok. Out of 648 people they surveyed, 45.4% choose Ahok, 22.4% for Agus, 20.7% prefer Anies, and 11.5% choose not to disclose or have no choice (<a href="/assets/rilis-dki-oktober2016_final_REV.pdf">link</a>, in Bahasa Indonesia). The question from <a href="https://www.facebook.com/m.firdaus.kasim/posts/10210200236098831">my brother </a>is, "(if the election is now, when the survey was conducted) <em>what is the chance of Ahok wins the election in one round</em>?" Given that to win the election in one round, Ahok needs to get 50% of the voters.<!--more--></p>
<p>As I am now practicing my skill on Bayesian probability, I solved this problem using naive Bayesian. An assumption on this problem is that they use random sampling in gathering the samples. In fact, they used multi-stage random sampling, but getting the details of their method is hard, so I think random sampling is a reasonable assumption.</p>
<p>The problem is posed as follows. There are \( n \) samples sampled from \( N \) populations. If \( a \) out of \( n \) choose Ahok, how much the proportion in the population that chooses Ahok? Assuming that \( N \gg n\).</p>
<p>Denote the proportion in the population is \( \eta \), so the probability of the proportion of people choosing Ahok in the population has the value \( \eta \) is</p>
<p>$$ P(\eta | a) = \frac{P(a | \eta) P(\eta)}{P(a)}. $$</p>
<p>\( P(a | \eta) \) denotes the probability of finding \( a \) samples out of \( n \) that chooses Ahok if the proportion in the population is \( \eta\). Due to the population size is much larger than the sample size, we can safely assume that the sampling is sampling with replacement. Thus, from binomial distribution,</p>
<p>$$ P(a | \eta) = \left(\begin{array}{c} n \\ a \end{array}\right) \eta^a (1-\eta)^{n-a}. $$</p>
<p>Now the prior distribution of the value of \( P(\eta) \) is basically just a continuous uniform distribution from 0 to 1, so</p>
<p>$$ P(\eta) = \mathrm{d}\eta\ \mathrm{for}\ \eta\in[0,1]. $$</p>
<p>To find the prior distribution of \( P(a) \), we can integrate \( P(a|\eta) \) for the whole value of \( \eta \) using the prior distribution of \( P(\eta) \). Therefore,</p>
<p>$$ P(a) = \left(\begin{array}{c} n \\ a \end{array}\right) \int_0^1 P(a | \eta)\ \mathrm{d}\eta = \left(\begin{array}{c} n \\ a \end{array}\right) \int_0^1 \eta^a (1-\eta)^{n-a}\ \mathrm{d}\eta. $$</p>
<p>The integral results the <a href="https://en.wikipedia.org/wiki/Beta_function">Beta function</a>, which is \( P(a) = \left(\begin{array}{c} n \\ a \end{array}\right) B(a+1, n-a+1) \).</p>
<p>Obtaining all the prior distributions, we now can write the distribution of the proportion of the population that choose Ahok known that \( a \) out of \( n \) samples choose Ahok,</p>
<p>$$ P(\eta|a) = \frac{1}{B(a+1, n-a+1)} \eta^a (1-\eta)^{n-a}\ \mathrm{d}\eta. $$</p>
<p>This is known as <a href="https://en.wikipedia.org/wiki/Beta_distribution">Beta distribution</a>. For \( n = {10, 20, 50}\) and \( a = n/2\), the probability distribution function (PDF) of the Beta distribution is shown below.</p>
<div align="center"><img align="middle" class="size-medium wp-image-22 aligncenter" src="/assets/pdf-beta-300x225.png" alt="PDF Beta Distribution" width="300" height="225" /></div>
<p>It is seen that as there are more sampling, the variance of \( \eta \) becomes smaller. To get the margin of error with 95% confidence, we can search the range of the distribution to get the area of 0.95, as the area below each plot is integrated to 1. Below is the plot of margin of error with 95% confidence versus number of samples, with \( a/n = 0.5\).</p>
<div align="center"><img class="size-medium wp-image-23 aligncenter" src="/assets/margin-of-error-300x225.png" alt="Margin of Error" width="300" height="225" /></div>
<p>The margin of error with 95% confidence can be approximated by \( \sim 100\%/\sqrt{n}\). This is in agreement with the number the research organisation provided. In their <a href="/assets/rilis-dki-oktober2016_final_REV.pdf">presentation (page 3)</a>, they said the respondent is 648 and margin of error about 3.9%, which agrees with our approximated margin of error, \( \sim 100\%/\sqrt{n} \approx 3.93\%\).</p>
<p>Back to the problem. To calculate the probability of Ahok winning the election in one round, he needs \( \eta > 0.5 \). Thus, to calculate the probability, we can integrate the beta distribution from \( \eta = 0.5 \) to 1. It can be achieved by calculating the <a href="https://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function">incomplete beta function</a>.</p>
<p>$$ P(\eta>0.5|a) = \int_{0.5}^1 \frac{\eta^a (1-\eta)^{n-a} d\eta}{B(a+1, n-a+1)}$$</p>
<p> </p>
<p>$$ P(\eta>0.5|a) = 1-I(0.5; a+1, n-a+1). $$</p>
<p>Looking at the data, there are 573 respondents that give definite choices. Out of 573 people, 294 people choose Ahok (51.3%) and 48.7% other do not choose Ahok. Thus, \( n=573\) and \( a=294\). Inserting these numbers from the equation above, we obtain</p>
<p>$$ P(\eta>0.5 | a) = 73.5\%. $$</p>
<p>As conclusion, using the random sampling model, we found that the probability of Ahok wins the election in one round (if the election was when the survey was conducted) is <strong>73.5%</strong>.</p>I got a text message earlier today from my brother. He described some statistics from SMRC (a research organisation in Indonesia) about the people's choices for the next Jakarta governor election in 2017. There are 3 candidates for the governor position, one of them is the incumbent, Ahok. Out of 648 people they surveyed, 45.4% choose Ahok, 22.4% for Agus, 20.7% prefer Anies, and 11.5% choose not to disclose or have no choice (link, in Bahasa Indonesia). The question from my brother is, "(if the election is now, when the survey was conducted) what is the chance of Ahok wins the election in one round?" Given that to win the election in one round, Ahok needs to get 50% of the voters. As I am now practicing my skill on Bayesian probability, I solved this problem using naive Bayesian. An assumption on this problem is that they use random sampling in gathering the samples. In fact, they used multi-stage random sampling, but getting the details of their method is hard, so I think random sampling is a reasonable assumption. The problem is posed as follows. There are \( n \) samples sampled from \( N \) populations. If \( a \) out of \( n \) choose Ahok, how much the proportion in the population that chooses Ahok? Assuming that \( N \gg n\). Denote the proportion in the population is \( \eta \), so the probability of the proportion of people choosing Ahok in the population has the value \( \eta \) is $$ P(\eta | a) = \frac{P(a | \eta) P(\eta)}{P(a)}. $$ \( P(a | \eta) \) denotes the probability of finding \( a \) samples out of \( n \) that chooses Ahok if the proportion in the population is \( \eta\). Due to the population size is much larger than the sample size, we can safely assume that the sampling is sampling with replacement. Thus, from binomial distribution, $$ P(a | \eta) = \left(\begin{array}{c} n \\ a \end{array}\right) \eta^a (1-\eta)^{n-a}. $$ Now the prior distribution of the value of \( P(\eta) \) is basically just a continuous uniform distribution from 0 to 1, so $$ P(\eta) = \mathrm{d}\eta\ \mathrm{for}\ \eta\in[0,1]. $$ To find the prior distribution of \( P(a) \), we can integrate \( P(a|\eta) \) for the whole value of \( \eta \) using the prior distribution of \( P(\eta) \). Therefore, $$ P(a) = \left(\begin{array}{c} n \\ a \end{array}\right) \int_0^1 P(a | \eta)\ \mathrm{d}\eta = \left(\begin{array}{c} n \\ a \end{array}\right) \int_0^1 \eta^a (1-\eta)^{n-a}\ \mathrm{d}\eta. $$ The integral results the Beta function, which is \( P(a) = \left(\begin{array}{c} n \\ a \end{array}\right) B(a+1, n-a+1) \). Obtaining all the prior distributions, we now can write the distribution of the proportion of the population that choose Ahok known that \( a \) out of \( n \) samples choose Ahok, $$ P(\eta|a) = \frac{1}{B(a+1, n-a+1)} \eta^a (1-\eta)^{n-a}\ \mathrm{d}\eta. $$ This is known as Beta distribution. For \( n = {10, 20, 50}\) and \( a = n/2\), the probability distribution function (PDF) of the Beta distribution is shown below. It is seen that as there are more sampling, the variance of \( \eta \) becomes smaller. To get the margin of error with 95% confidence, we can search the range of the distribution to get the area of 0.95, as the area below each plot is integrated to 1. Below is the plot of margin of error with 95% confidence versus number of samples, with \( a/n = 0.5\). The margin of error with 95% confidence can be approximated by \( \sim 100\%/\sqrt{n}\). This is in agreement with the number the research organisation provided. In their presentation (page 3), they said the respondent is 648 and margin of error about 3.9%, which agrees with our approximated margin of error, \( \sim 100\%/\sqrt{n} \approx 3.93\%\). Back to the problem. To calculate the probability of Ahok winning the election in one round, he needs \( \eta > 0.5 \). Thus, to calculate the probability, we can integrate the beta distribution from \( \eta = 0.5 \) to 1. It can be achieved by calculating the incomplete beta function. $$ P(\eta>0.5|a) = \int_{0.5}^1 \frac{\eta^a (1-\eta)^{n-a} d\eta}{B(a+1, n-a+1)}$$ $$ P(\eta>0.5|a) = 1-I(0.5; a+1, n-a+1). $$ Looking at the data, there are 573 respondents that give definite choices. Out of 573 people, 294 people choose Ahok (51.3%) and 48.7% other do not choose Ahok. Thus, \( n=573\) and \( a=294\). Inserting these numbers from the equation above, we obtain $$ P(\eta>0.5 | a) = 73.5\%. $$ As conclusion, using the random sampling model, we found that the probability of Ahok wins the election in one round (if the election was when the survey was conducted) is 73.5%.