How to Extract Text From a PDF (Three Methods)
Scanned PDFs, protected PDFs, and native PDFs all require different approaches. Here's how to pull text out of each type.
Not all PDFs are the same. A PDF created by exporting from Word contains actual text objects you can select and copy. A PDF created by scanning a paper document contains images of text. The methods for extracting text from each are different.
Method 1: Native PDF Text (The Easy Case)
Open in any PDF viewer, select all text (Ctrl+A or Cmd+A), copy and paste. If this works, you're done. Most PDFs created from digital sources (Word exports, InDesign exports, HTML-to-PDF) are native and allow text selection. If the text selects but looks garbled when pasted, the PDF may have custom font encoding — in that case, try a specialized extractor.
Method 2: Scanned PDFs — OCR Required
If clicking in the PDF selects page regions rather than individual words, it's image-based. Google Docs is the quickest free solution: upload to Google Drive, right-click > Open with > Google Docs. It runs OCR and creates an editable document with the recognized text. Quality varies by scan quality.
Method 3: Batch Extraction for Developers
PyMuPDF (Python) is fast and handles most PDFs: import fitz; doc = fitz.open('file.pdf'); for page in doc: text += page.get_text(). For scanned PDFs, combine with Tesseract: pdf2image converts pages to images, then pytesseract runs OCR. For larger scale extraction, Adobe's PDF Extract API handles both native and scanned PDFs with high accuracy.
When Text Looks Wrong After Extraction
Column-based documents (newspapers, academic papers) extract text in reading order incorrectly — it mixes columns rather than reading each column separately. Two-column PDFs are a known challenge; specialized tools that understand layout (PDFPlumber's `page.extract_text(x_tolerance=5)`) do better here. Tables similarly require specialized extraction to maintain structure.
Scanned document tip
Scan quality matters enormously for OCR. 300 DPI minimum, clean original, no rotation, good contrast. A 600 DPI scan of a clean document will OCR at 99%+ accuracy. A 150 DPI photo taken at an angle might OCR at 80% accuracy requiring significant correction.
Frequently Asked Questions
Why can't I copy text from a PDF?+
What is OCR and how accurate is it?+
What's the best free OCR tool?+
Can I extract text from a protected PDF?+
🔧 Free Tools Used in This Guide
FreeToolKit Team
FreeToolKit Team
We build free browser-based tools and write practical guides that skip the fluff.
Tags: