What it actually is

Despite the biological-sounding name, a neural network is a specific kind of mathematical function: a stack of layers, each of which transforms its input by multiplication and a simple nonlinear squashing.

Strip away the marketing, and a modern neural network is doing roughly this:

output = layer_N(layer_N-1(... layer_2(layer_1(input)) ...))

where each layer_i is:

output = nonlinear(weights × input + bias)

That's it. Weights and biases are the adjustable parameters; the nonlinearity (usually ReLU — "max(0, x)") gives the network enough flexibility to learn complicated patterns.

Train this with enough data and enough layers, and it can recognize images, translate languages, generate text, predict structures of proteins, beat humans at games. The magic isn't in any layer; it's in the sheer scale of stacking simple operations and tuning millions or billions of parameters.

A concrete example

Consider an image classifier that decides whether a 28×28 grayscale image is a digit (0-9).

The input is a vector of 784 numbers (one per pixel). The output we want is one of 10 numbers — the probability of each digit being the right answer.

A simple network might look like:

  • Input layer: 784 numbers.
  • Hidden layer: 128 nodes, each connected to all 784 input nodes by weights. Apply ReLU to each result.
  • Hidden layer: 64 nodes, each connected to all 128 nodes of the previous layer. Apply ReLU.
  • Output layer: 10 nodes, each connected to all 64 previous nodes. Apply softmax (which turns the outputs into probabilities).

Total parameters: 784×128 + 128×64 + 64×10 ≈ 110,000.

Train on 60,000 labeled digit images, and this small network can correctly classify new digit images about 97% of the time. Add convolutional layers (specialized for images) and you can push to 99.5%.

That basic recipe — input, multiply by weights, apply nonlinearity, repeat, output probabilities — scales up. Modern image models have millions of parameters; modern language models have hundreds of billions.

Why nonlinearity matters

If the network had no nonlinearity — just stacked linear transformations — then no matter how many layers you stacked, the whole thing would collapse mathematically into a single linear transformation. You'd just be doing one big matrix multiply.

The nonlinearity (ReLU, sigmoid, tanh — choices vary) breaks this collapse. Each layer can do something genuinely different from the layers above and below. Stacking nonlinear layers gives the network exponentially growing expressive capacity.

This is why "deep" matters. Mathematically, a network with one hidden layer (universal approximation theorem) can in principle approximate any continuous function — but only at impractical sizes. With many layers, the same approximation power becomes practical.

What "training" looks like

Training a neural network is a long optimization loop:

  1. Initialize weights randomly. Output is meaningless.
  2. Run a batch of training examples through the network. Compare the outputs to the desired outputs. Compute the loss (a number summarizing how wrong the network is).
  3. Backpropagation. Use calculus (specifically, the chain rule) to compute how much each weight contributed to the loss.
  4. Update weights. Move each weight slightly in the direction that would reduce the loss. The size of the move is the "learning rate."
  5. Repeat. Process millions of batches.

A typical modern training run might be a few weeks to months on hundreds to thousands of specialized chips (GPUs or TPUs). The model improves slowly, with the loss falling roughly logarithmically over time.

Backpropagation — the algorithm for computing parameter gradients in a stacked nonlinear function — was the breakthrough that made deep neural networks trainable in the 1980s. It had been independently invented multiple times, but Geoffrey Hinton's group popularized it.

Why "deep" beat alternatives

Through most of ML's history, neural networks were one of several competing approaches (support vector machines, decision trees, kernel methods). They weren't obviously the best — they were hard to train, required lots of data, and didn't always work.

The "deep" revolution started around 2012. Three things converged:

  • More data. The internet provided labeled datasets at unprecedented scale.
  • GPUs. Originally designed for video games, GPUs turned out to be excellent at the matrix multiplications neural networks need. A neural network that would have taken weeks to train on a CPU could train in days on a GPU.
  • Better techniques. Improvements like ReLU activations (replacing slower sigmoids), better weight initialization, dropout for regularization, and batch normalization.

The result: in 2012, a deep convolutional neural network (AlexNet) crushed the ImageNet image-classification competition by a huge margin over previous methods. Within a few years, deep networks were dominant in image, speech, and language tasks.

Today, almost all production ML is deep neural networks. The competing approaches still exist for specific niches but are no longer the mainstream.

What's in a trained network

You can sometimes interpret what an individual layer or neuron is doing. Early work on image networks showed:

  • Layer 1 filters detect simple things: edges, color blobs.
  • Layer 2 detects more complex patterns: corners, textures.
  • Layer 3-4: object parts (eyes, wheels, doors).
  • Deeper layers: whole objects (faces, cars).

So a network is doing genuine feature extraction — building increasingly abstract representations of the input. The features aren't manually designed; the network learns them from data.

For language models, similar hierarchical structure is observed: early layers handle syntax (word categories, sentence structure), middle layers semantics (meaning, references), later layers high-level reasoning patterns.

This interpretability is partial and often only suggestive. A modern large language model has so many parameters that no full understanding is possible — but the pattern is real, and gives some insight.

Why they're so big now

Performance scales with size — bigger networks, more data, more compute generally produce better models. The scaling laws (Kaplan et al. 2020 and follow-ups) quantify this.

Concretely:

  • AlexNet (2012, image classification): ~60 million parameters.
  • BERT (2018, language): 340 million parameters.
  • GPT-3 (2020): 175 billion parameters.
  • Modern frontier models (2025): hundreds of billions to a few trillion parameters.

Each new generation pushes scale further. Training costs scale roughly cubically — bigger models, more data, longer training. The biggest models today cost over $100 million to train. Inference (running the model) is also expensive but much cheaper per request than training.

Whether and when scaling will plateau is one of the field's hardest open questions.

Where neural networks fall short

Despite the success, they have specific weaknesses:

  • Sample inefficiency. They need vast amounts of training data, often more than humans need to learn similar tasks.
  • Brittleness. Adversarial inputs (carefully-crafted, often imperceptible to humans) can flip their outputs.
  • Distribution shift. Performance degrades when test data differs from training data.
  • No explicit reasoning. They pattern-match; they don't reason from first principles.
  • Compositional generalization. They sometimes fail to combine concepts in new ways that humans find natural.
  • Energy use. Training huge models costs millions of dollars in electricity.

Many of these are being worked on; some are fundamental to the approach. The field is gradually mixing in other techniques (retrieval-augmented generation, tool use, hybrid symbolic/neural systems) to compensate.

Want a guided 5-minute course on neural networks, with quizzes? NerdSip can generate one tuned to your current level.

The takeaway

A neural network is a stack of mathematical layers, each doing a matrix multiplication followed by a simple nonlinearity, with millions to trillions of adjustable parameters. Train it on enough data with gradient descent, and it can approximate complex input-output mappings — recognizing images, translating language, generating text, predicting outcomes. The name is misleading (it's not really biology-inspired in any deep way), but the math has turned out to be one of the most successful ideas in computing history. Modern AI is mostly neural networks plus specific architectural tricks for specific tasks.