Just done some very basic experiments using the Titanic Dataset (which I found here). The dataset contains some very basic qualitative information on each of the 2201 souls aboard the RMS Titanic:
Class (1st, 2nd, 3rd or Crew)
Sex (Male, Female)
Age (Adult, Child)
Survivor (Yes, No)
I altered the dataset slightly and presented it to a simple neural network, to see if it could predict whether a person lived or died, based on their age, sex and class. The answer is, of course, that there’s no simple rule for predicting survival from such basic data, but the neural network did quite well, reaching an error rate of 0.2 (ish).
In order to make the dataset more neural network-friendly I expanded the “Class” column into four booleans (rather than using a number). This gives a dataset with six boolean inputs and one boolean output. You can download the dataset here: titanic.tdf
Inputs: 1st Class, 2nd Class, 3rd Class, Crew, Age (0 = Adult, 1 = Child), Sex (0 = Male, 1 = Female)
Output: Survived (0 = No, 1 = Yes)
During training (using a genetic algorithm) the network often reaches an initial solution where it simply outputs a zero. Most people on the Titanic died, so the network can do reasonably well by simply assuming that everyone died. After a while, the network draws a few more conclusions about the dataset and discovers some features which allow it to reach a lower error rate.
The graph shows the fitness curve during training, the green series is the best fitness and the blue series the average fitness of the population (for more info see the article on Evolutionary Algotithms). The intermediate plateau (the flat section half way up) shows the period where the genetic algorithm became stuck in the local minima associated with the “output a constant zero” solution. After a while a better solution was found and the fitness increased further.
The really interesting thing about the neural network’s solution to the problem isn’t that it reached any kind of useful error rate (which it didn’t), but that its weights encode the relative importances of the various inputs. Using my importance-propagation algorithm (currently awaiting publication) we can draw the network like this…
In the picture, larger neurons have a higher importance (i.e. a higher impact on the output of the network). Weights are shown as lines connecting the neurons together; solid weights are positive and dashed weights are negative; the darker a weight, the higher its value. We can see that the most useful inputs for survival prediction were “3rd Class”, “Age”, and “Sex”.
We can deduce whether an input is excitory or inhibitory on the overall output of the network (i.e. does it have a positive or negative effect on the network’s output) by looking at the weights which connect from Input to Hidden neurons and from Hidden to Output neurons. A route from input to output with two weights with the same sign (both positive or both negative) implies that the input is excitory. Conversely, a route with weights with different signs (one positive and one negative) implies that the input is inhibitory. Think of multiplication:
positive * negative = negative
negative * positive = negative
negative * negative = positive
In our example, given that the weight connecting neuron I2 to H3 is large and negative and the weight connecting from H3 to the output neuron (H7) is large and positive we can deduce that input I2 — “3rd Class” — has an inhibitory effect on the network’s output. Using these rules we can make the following judgements about the inputs…
- 3rd Class passengers were likely to die
- Women and Children were likely to survive
- None of the other inputs make much of a difference
I haven’t used any other statistical techniques to verify these findings (I guess some information theory would be useful here), but the results are quite compelling. Perhaps the network is confirming that the crew did indeed shout “Women and children first!” and that James Cameron was right about steerage passengers being locked below decks? Or maybe not!