October 17, 2025

Unstructured data analysis with LLMs: from clinical notes to actionable insights

The past few years have opened a new frontier in data science: extracting insights from unstructured data that were previously inaccessible. Text, images, and documents that once sat outside traditional analytics pipelines can now be processed, classified, and summarised with a level of precision that was impossible before the transformer era. It feels like a new store has opened with tools aplenty, but unlike toys, these tools reshape what’s operationally possible in healthcare.

Why Look at Unstructured Data?

Before transformers, data science focused almost entirely on numeric data: structured tables, metrics, KPIs. But some of the most valuable information in healthcare lives in text, not numbers:

Clinical notes in electronic health records (EHRs)
Open-ended survey and feedback responses
Social media posts and transcripts from patient conversations
Meeting and consultation notes
Academic literature and policy documents
Financial and operational reports

Each of these contains meaning, patterns, and signals that are invisible to conventional analytics.

From Free Text to Structured Insight

Large language models (LLMs) of all sizes can now perform classification, summarisation, and semantic search on text. Frontier models like Gemini or ChatGPT can be accessed via APIs to process text at scale: identifying patterns, extracting variables, or finding “needles in haystacks” within your data lake.

If data privacy or cost is a concern, inference frameworks such as Ollama or vLLM make it possible to deploy smaller models locally. A Gemma 12B parameter model can run on a 16 GB laptop; a mid-range GPU desktop can host even more powerful models that rival frontier performance in specialised tasks.

Fine-Tuning and In-House Models

For smaller or sensitive datasets, fine-tuning compact models like Hugging Face’s SmolLM (135M–1.7B) or Google’s Gemma 270M can yield fast, lightweight classifiers that stay entirely in-house. Tools like Axolotl and Modal make fine-tuning and deployment easier than ever; no massive infrastructure required.

LLM-Powered ETL and Multimodal Processing

Frameworks such as DocETL extend the traditional ETL concept to unstructured data, letting teams build and orchestrate LLM-based pipelines at production scale. Its companion UI, DocWrangler, helps analysts prototype transformations interactively.

And because many documents blend text with images and spatial layout (think PDFs with charts, forms, or handwritten annotations), multimodal models are critical. They can reason over text and visuals, capturing relationships between paragraphs and figures that would break conventional parsers.

ModernVBERT (250 M) is a standout here: a compact vision-language model optimised for document retrieval that often outperforms much larger architectures.

Retrieval-Augmented Generation: Context Matters

Retrieval-augmented generation (RAG) combines search and generation, retrieving relevant context from a knowledge base before answering a query. Instead of relying solely on keywords, modern RAG pipelines use semantic search to find meaning in context.

They draw on a mix of:

Dense embeddings (numerical representations of meaning),
Cross-encoder reranking (early interaction for precision),
Late-interaction encoders (scalable retrieval), and
Hybrid search (combining lexical BM25 with semantic proximity).

For healthcare organisations, this means LLMs that understand the intent behind a question, not just the words, and can find the right guidance or precedent in seconds.

Use Case: Clinical Quality Assurance and Emergency Flagging

Imagine a national healthcare system capturing millions of free-text clinical notes daily in its EHR. Each note records symptoms, assessments, and plans but they’re unstructured and historically reviewed only through slow, manual audits.

An LLM-powered RAG pipeline can automate and scale this entire process:

Ingestion and Pre-Processing New clinical notes are extracted daily into a data lake. Using LLMs and VLMs, each note is cleaned, chunked, and embedded into a vector store for semantic retrieval.
Guideline Comparison The pipeline retrieves relevant clinical guidelines (e.g. WHO or national protocols) ranked by semantic similarity. The LLM evaluates each note against these standards for completeness, adherence, and safety (e.g. “no follow-up arranged after high blood pressure”).
Emergency Flagging Language patterns like “chest pain,” “shortness of breath,” or “altered consciousness” trigger alerts routed to clinical teams in real time.
Scoring and Feedback Each note receives a structured scorecard, adherence % metrics can be created, missing items flagged phrases can be assessed and counted and systematically reviewed.
Analytics and Governance Results feed into a live data dashboard, surfacing quality metrics by clinic, provider, and timeframe. Clinical leads can drill down into examples, see which rules were violated, and close the feedback loop.

The outcome: a continuously learning healthcare system that monitors documentation quality in real time, detects risk early, and empowers clinical governance teams with actionable evidence, all powered by language models.

What This Means for Healthcare Organisations

Unstructured data is no longer dark data. With the right pipelines, hospitals, health networks, and research programmes can turn clinical text into structured intelligence; improving safety, compliance, and operational insight.

‍

Unstructured data analysis with LLMs: from clinical notes to actionable insights

Why Look at Unstructured Data?

From Free Text to Structured Insight

Fine-Tuning and In-House Models

LLM-Powered ETL and Multimodal Processing

Retrieval-Augmented Generation: Context Matters

Use Case: Clinical Quality Assurance and Emergency Flagging

What This Means for Healthcare Organisations

Related Articles

Other Articles

Choosing a dashboarding platform worth your investment

Causal inference from observational data

Five practical examples of how AI Is already shaping healthcare in South Africa

Related Articles

February 24, 2026
Choosing a dashboarding platform worth your investment
February 24, 2026

Insight
February 5, 2026
Causal inference from observational data
Insight
February 5, 2026

Insight
December 1, 2025
Five practical examples of how AI Is already shaping healthcare in South Africa
Insight
December 1, 2025