Marek Kozlowski

Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski

Dec 6, 202590 min

AI Summary

→ WHAT IT COVERS Marek Kozlowski, head of Poland's National Information Processing Institute AI Lab, explains how Project PLUM (Polish Large Language Models) builds locally controlled AI by performing language adaptation on Llama and Mistral base models using ~200 billion curated Polish tokens, targeting performance parity with models 10x larger for language and cultural tasks. → KEY INSIGHTS - **Language Adaptation vs. Full Pretraining:** Rather than training from scratch — which requires at least 1 trillion tokens for stable results — PLUM continues pretraining Llama and Mistral base models on ~200 billion curated Polish tokens. This "language adaptation" injects local linguistic and cultural knowledge while preserving existing multilingual capabilities, achieving competitive Polish-language performance without the compute cost of full pretraining runs. - **Frontier Model Quality Degrades for Niche Languages Over Generations:** Benchmarking on Poland's PLCC (Polish Holistic Cultural Competency) benchmark reveals that successive Claude and GPT model releases show declining Polish language and cultural performance. As frontier labs prioritize coding and reasoning benchmarks, niche language quality becomes a trade-off casualty — meaning organizations relying on cloud APIs risk worsening performance over time without any warning or recourse. - **Small Fine-Tuned Models Match Large Cloud Models for Specific Tasks:** When a business has 10–20 defined use cases and prepares at least 1,000 supervised fine-tuning instructions per task, a smaller on-premise model matches zero-shot or few-shot performance from large cloud LLMs. This approach reduces energy costs, eliminates cloud dependency, and enables deployment in regulated sectors where data cannot leave the organization's infrastructure. - **Domain Adaptation Requires ~10 Billion Clean Tokens to Be Worthwhile:** PLUM's work with Central Eastern Europe's largest bank demonstrates that domain-specific continued pretraining delivers measurable quality gains — but only when the organization can supply roughly 10 billion tokens post-deduplication and filtering. Since raw data shrinks by 3–4x through curation, organizations need 30–40 billion raw tokens, a threshold fewer than 100 European companies realistically meet. - **EU Regulation Eliminates ~80% of Usable Training Data:** The EU AI Act combined with local authorship rights legislation creates constraints far stricter than any voluntary model constitution. These regulations prevent large-scale web scraping and require detailed model cards disclosing training data, compute, and security measures. PLUM compensates by securing bilateral agreements with publishers and libraries, and by building internal human annotation pipelines producing organic instruction and preference datasets. - **Organic Human-Annotated Data Drives Quality at the SFT Stage:** Synthetically generated instruction data from other LLMs degrades model output quality when those synthetic examples contain poor linguistic structure. PLUM employs dozens to hundreds of human annotators to create and review instructions and preference pairs manually. This organic data pipeline — combined with publishing dataset samples and a ~100-page technical cookbook on Hugging Face — differentiates PLUM from open-weight-only releases that share no training data transparency. → NOTABLE MOMENT Kozlowski reveals that when his team analyzed successive Claude model releases against the PLCC benchmark, Polish cultural and linguistic performance measurably declined across versions. This means organizations that deeply integrate a cloud LLM into Polish-language workflows could find their vendor's next release quietly performs worse on their core use case with no rollback option available. 💼 SPONSORS [{"name": "Google DeepMind / AI Studio", "url": "https://ai.studio/build"}, {"name": "Tasklet", "url": "https://tasklet.ai"}, {"name": "Framer", "url": "https://framer.com/design"}, {"name": "Shopify", "url": "https://shopify.com/cognitive"}] 🏷️ Sovereign AI, Language Model Adaptation, EU AI Regulation, Small Language Models, On-Premise AI Deployment, Polish NLP

Read Full Summary Listen

Featured On 1 Podcast

Cognitive Revolution

Top resources Marek Kozlowski mentions

PLCC (Polish Holistic Cultural Competency benchmark)

All Appearances

Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski

AI Summary

Explore More

Never miss Marek Kozlowski's insights