Neural networks partly seem to undermine the traditional theory of machine learning, which relies heavily on ideas of probability theory and statistics. What is the mystery of their success?
The researchers show that networks with an infinite number of neurons are mathematically equivalent to simpler machine learning models – kernel methods. The striking results can be explained if this equivalence extends beyond “perfect” neural networks
ML models are generally thought to perform better when they have the right number of parameters. If there are too few parameters, the model may be too simple and fail to capture all the nuances. Too many parameters and the model becomes more complex, learning such fine details that it then cannot generalize. That’s what’s called overlearning.
“It’s a balance between learning too well from the data and not learning at all. You want to be in the middle,” says Mikhail Belkin, a machine learning researcher at the University of California, San Diego, excited by the new prospects.
Deep neural networks like VGG are widely believed to have too many parameters, which means their predictions should suffer from overtraining. But this is not the case. On the contrary, such networks generalize new data with surprising success. Why? No one knew the answer to this question, although they tried to find out the reason.
Naftali Tishbi, a computer scientist and neuroscientist at the Hebrew University of Jerusalem, argued that deep neural networks first learn from data, and then go through an information bottle-neck, discarding irrelevant information. This is what helps in generalization. Other scientists believe that this does not happen in all networks.
The mathematical equivalence of kernel methods and idealized neural networks gives clues as to why and how networks with a huge number of parameters arrive at their solutions.
Kernel methods are algorithms that find patterns in data through their projection onto very high dimensions. By studying the more comprehensible equivalents of kernels of idealized neural networks, researchers can learn why complex deep networks converge in the learning process to solutions that generalize well to new data.
“A neural network is a bit like a Rube Goldberg machine. It’s unclear what’s really important about this machine,” Belkin argues. – Nuclear methods are not that complicated. I think simplifying neural networks to nuclear methods allows you to isolate the driving force behind what’s going on.”