Extract embedded text from uploaded PDFs and store it as post meta so the document’s contents become searchable in WordPress.
How it works
PDF text extraction kicks in when a PDF is uploaded to the media library: ClassifAI hands the file to Azure AI Vision’s Read API, which returns the embedded text page by page (including from PDFs that have been generated as scans of printed pages, where the API also runs OCR). The extracted text is stored as post meta on the attachment, which means standard WordPress search, REST queries, and search plugins like ElasticPress can index it without any extra configuration. Extraction runs as a background job because the Read API is asynchronous — the PDF appears in the library immediately, and the text fills in once the API has finished.
Configuration
- Meta key used to store the extracted text on the attachment.
- Allowed roles and an allowed-users list for granular access control.
Providers
PDF text scanning is the only ClassifAI feature dedicated to documents (rather than images, audio, or post content) and currently has a single supported provider:
- Microsoft Azure AI Vision — the only supported backend exposing a production-grade Read API for PDFs.
Use cases
- Whitepaper and report libraries where the document is the publication.
- Public-records sites surfacing meeting minutes, budgets, and filings.
- Academic and research archives where searchable full text is the difference between findable and lost.
