To understand and implement a Multi-Layer Perceptron (MLP) with focus on architecture, activation functions, and training

The XOR problem is an example of:
Why is a single-layer perceptron unable to solve the XOR problem?
Which activation function is commonly used in the output layer of an MLP for binary classification problems?
What is the primary purpose of a hidden layer in a Multi-Layer Perceptron (MLP)?
What is the purpose of the learning rate in the training of an MLP?