High-Performance Language Technologies for Europe
Massive text collections for pre-training are the ‘crude oil’ of the large language model (LLM) era. The process of ‘refining’ high-quality datasets from web data at scale presupposes computational infrastructure and technological muscle that is often …