Oğuz Kaan Yüksel
  • Research
  • Publications

On this page

  • Generalization
  • Optimization
  • Data

Generalization

Classical generalization theories heavily rely on the assumption that data is independent and identically distributed (i.i.d.). However, this assumption does not hold in real-world scenarios with sequential data exhibiting temporal dependencies. Therefore, it is crucial to establish sample complexity results based on number of tokens instead of number of sequences for tasks such as natural language.

That is, instead of upper bounds \(\mathrm{Generalization-error} \leq f(N)\), we want to derive \(\mathrm{Generalization-error} \leq f(N,T)\) where \(N\) is the number of sequences, \(T\) is the length of the sequences and \(f\) is some decaying function in its parameters.

Long-Context Linear System Identification in ICLR 2025

Establishes the finite sample complexity of learning the simplest autoregressive system, linear dynamical systems.

On the Sample Complexity of Next-Token Prediction in AISTATS 2025

Extension to discrete Markov chains that mix.

Generalization Bounds for Autoregressive Processes and In-Context Learning in submission

Extension to unbounded autoregressive processes and in-context learning.

Optimization

Generalization is not agnostic to optimization as the optimization algorithm introduces a crucial implicit bias that needs to be taken into account. This bias acts as an constraint on the training trajectory and hence effectively shrinks the class complexity.

Incremental Learning of Sparse Attention Patterns in Transformers in submission

Gradient flow analysis of multi-head single-layer attention.

First-order ANIL provably learns representations despite overparametrization in ICLR 2024

Gradient descent analysis of meta-learning in two-layer linear network.

Data

Understanding the properties of the data is important for two main reasons. First, in non-i.i.d. settings, characterizing data dependencies is essential for deriving concentration bounds. Second, data distribution directly affects the training trajectories of the model, influencing the optimization path.

Incremental Learning of Sparse Attention Patterns in Transformers in submission

Introduced a Markov chain whose temporal structure guides the learning order of the transformer.

Generalization Bounds for Autoregressive Processes and In-Context Learning in submission

On the Sample Complexity of Next-Token Prediction in AISTATS 2025

Introduced rephrasability, a looser version of mixing, to derive generalization bounds.

Feel free to reach out via email or social media if you have any questions!