In every AI project there’s an unglamorous step that quietly decides the quality–price ratio of your results: getting documents into a clean, consistent format before they hit a model. Microsoft’s new Python utility MarkItDown aims squarely at that chokepoint. It converts dozens of file types—from PDFs and Office documents to HTML, CSV/JSON/XML, images (with OCR), audio (with transcription), ZIPs, EPubs and even YouTube URLs—into tidy Markdown, preserving headings, lists, tables, links, and metadata.
The premise is simple but consequential: Markdown is the “native language” of mainstream LLMs. It’s close to plain text, token-efficient, and expressive enough to retain meaningful structure. If your pipeline consistently feeds Markdown instead of half-parsed PDFs or brittle HTML, you spend fewer tokens on noise and more on content, and you make downstream chunking, search and RAG dramatically easier.
Below is a practical look at what MarkItDown does today, what changed in the latest release, and how to put it to work without tripping over dependencies.
Why Markdown, and why now?
Markdown gives you the best of both worlds:
- Minimal markup, maximum structure. Headings, lists, tables and links are represented with concise syntax that LLMs reliably “understand.”
- Token efficiency. Compared with verbose HTML, Markdown carries the same intent with fewer bytes—this directly reduces inference cost in token-metered environments.
- Predictable parsing. Sectioning by
#/##headers, extracting links, or flattening tables is far less error-prone than scraping varied office or page formats.
Put bluntly: if you normalize to Markdown first, almost everything else in your AI pipeline (chunking, embeddings, prompts, retrieval) becomes easier and cheaper.
What MarkItDown converts (and what it keeps)
MarkItDown already supports a wide swath of formats, with an emphasis on retaining document intent and structure rather than pixel-perfect appearance:
- Office & PDF: PDF, Word (DOCX), Excel (XLS/XLSX), PowerPoint (PPTX).
- Web & data: HTML, CSV, JSON, XML.
- Multimedia: images (EXIF + OCR text), audio (EXIF + speech transcription for WAV/MP3).
- Containers & links: ZIP (iterates contents), EPUB, YouTube URLs (fetches transcript).
The output is usually readable to humans, but the design goal is machine consumption: Markdown that’s faithful to the source’s sections, lists, tables and links—ideal for indexing, embeddings and RAG.
Important breaking changes in the latest release
Between v0.0.1 and v0.1.0, MarkItDown introduced several backwards-incompatible tweaks you should know about:
- Optional dependency groups. Extras are now organized by feature (e.g.,
[pdf],[docx],[pptx],[audio-transcription], etc.). For the “kitchen sink” behavior, install:pip install 'markitdown[all]' convert_stream()now requires binary streams. Feed it a file-like binary object (e.g., file opened inrbor anio.BytesIO). It no longer accepts text streams likeio.StringIO.DocumentConverternow reads from streams, not file paths. No temporary files are created. If you maintain a plugin or custom converter, you may need code changes. If you use the high-levelMarkItDownclass or the CLI, you likely don’t.
Installation: all-in, or à la carte
MarkItDown requires Python 3.10+. A virtual environment is recommended.
- All extras (easiest):
pip install 'markitdown[all]' - Selective extras (smaller footprint):
pip install 'markitdown[pdf, docx, pptx]'
Available extras include: [pptx], [docx], [xlsx], [xls], [pdf], [outlook], [az-doc-intel], [audio-transcription], [youtube-transcription], and [all].
You can also install from source:
git clone [email protected]:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
Code language: PHP (php)
How to use it (CLI, Python, Docker)
Command line
- Convert a file to stdout:
markitdown path/to/file.pdf > document.md - Or specify output:
markitdown path/to/file.pdf -o document.md - Or pipe data:
cat path/to/file.pdf | markitdown
Python API
- Basic conversion:
from markitdown import MarkItDown md = MarkItDown(enable_plugins=False) # True to enable 3rd-party plugins result = md.convert("test.xlsx") print(result.text_content) - With Azure Document Intelligence for PDFs:
from markitdown import MarkItDown md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>") result = md.convert("test.pdf") print(result.text_content) - LLM-assisted image descriptions (currently for PPTX and images):
from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt") result = md.convert("example.jpg") print(result.text_content)
Docker
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Code language: HTML, XML (xml)
Plugins and MCP: built for pipelines, not demos
MarkItDown supports third-party plugins (disabled by default). List them with:
markitdown --list-plugins
Code language: PHP (php)
Enable with:
markitdown --use-plugins path/to/file.pdf
Code language: PHP (php)
Look for #markitdown-plugin on GitHub, and see the markitdown-sample-plugin package to build your own.
The project also ships an MCP (Model Context Protocol) server for easy integration with LLM apps like Claude Desktop. In practice, that makes it simpler to slot Markdown conversion into a broader, model-aware workflow.
Azure Document Intelligence, OCR and LLM descriptions
- PDFs that fight back (scans, forms, invoices) often need stronger extraction. MarkItDown can call Azure Document Intelligence via CLI (
-dwith-e <endpoint>) or by passingdocintel_endpointin Python. - Images and slides can be described with an LLM to provide useful textual context for retrieval. This is optional (and cost-sensitive), but powerful for image-heavy PPTX decks and photo archives.
What MarkItDown is not (and why that’s fine)
- It does not target visual fidelity for human-perfect conversions. If you need pixel-accurate layout for publishing, use a dedicated converter.
- OCR/transcription quality depends on the underlying models and extras you install; test with representative samples.
- For large volumes, you should batch, parallelize and monitor memory/CPU—especially if OCR or external services are involved.
Where it shines: from chaotic folders to LLM-ready corpora
- Bulk ingestion: PDFs + Office + web + ZIPs → a unified Markdown corpus.
- Multimedia enrichment: OCR and transcription turn pictures and audio into searchable text.
- Token-wise preprocessing: You cut HTML noise, reduce tokens, and improve cost/performance.
- Embedding & RAG: Clean headings, links, and tables simplify chunking and boost retrieval quality.
If you’ve ever stitched together Textract-style outputs with regexes and prayer, you’ll appreciate the “ready-to-chunk” Markdown that lands here.
Getting started without tripping
- Create a virtual environment (venv, uv, or Conda) with Python 3.10+.
- Start with
[all]for prototyping; in production, install only the extras you need. - Test the CLI on both “digital” PDFs and scanned PDFs (OCR path).
- Decide if Azure Document Intelligence is worth it for your PDF backlog.
- If helpful, wire LLM image descriptions for PPTX/JPG ingestion—measure cost/latency.
- Define a chunking strategy (by headings, token size) before embedding.
- Automate in your ETL or CI/CD (Docker image helps standardize builds).
Contributing and quality
The repo welcomes contributions (issues, PRs, plugins) and uses Microsoft’s Open Source Code of Conduct plus a Contributor License Agreement (handled by a bot at PR time). Tests run via hatch and pre-commit hooks to keep quality steady.
Bottom line
MarkItDown fills a very 2025-shaped gap: taking the messy reality of PDFs, Office files, HTML, images and audio and normalizing it into Markdown—the structure LLMs parse best and cheapest. With plugins, an MCP server, optional Azure Document Intelligence, and LLM-assisted descriptions, it’s not about flashy fidelity; it’s about consistent structure, token efficiency, and pipeline fit.
If you build RAG, semantic search, summarizers or classifiers, this utility can be the first link that makes the rest of your chain stronger.
FAQs
How do I convert PDF to Markdown while keeping links and tables?
Use the CLI: markitdown file.pdf -o out.md. For scans or complex layouts, enable OCR or call Azure Document Intelligence (-d -e <endpoint>) to improve extraction fidelity.
Can I extract text from images and audio into Markdown (OCR + transcription)?
Yes. MarkItDown reads EXIF, performs OCR for images, and transcribes WAV/MP3 (install the relevant extras). The output folds into one Markdown document with the extracted text and metadata.
Should I install markitdown[all] or pick extras?
For prototyping, markitdown[all] is fastest. In production, install just what you need—e.g., pip install 'markitdown[pdf, docx, pptx]'—to shrink dependency footprint and attack surface.
How do I slot MarkItDown into a RAG or embeddings pipeline?
Convert to Markdown, segment by headings and token size, preserve links/tables where relevant, and send chunks to your vector DB. For scanned PDFs, enable OCR; for slides/images, consider LLM descriptions to get retrieval-friendly text.

