Extract data from PDF and all Microsoft Office files in python

1 min readOct 10, 2017

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library “slate”. Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
    pdf_text = slate.PDF(f)
    print pdf_textOutput: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example:

import slate
with open('test_doc.pdf', 'rb') as f:
    pdf_text = slate.PDF(f, "pass the PDF file password here")
    print pdf_textOutput: ['Sample text...', '......', '......']

Extract data from PDF and all Microsoft Office files in python

The article was originally published at MicroPyramid blog.

Written by MicroPyramid

No responses yet