Extract data from PDF and all Microsoft Office files in python

MicroPyramid
1 min readOct 10, 2017

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library “slate”. Slate is a Python package that simplifies the process of extracting text.

Installation:

$ pip install slate
$ pip install pdfminer

Usage:

import slate
with open('sample.pdf', 'rb') as f:
pdf_text = slate.PDF(f)
print pdf_text
Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.

Example:

import slate
with open('test_doc.pdf', 'rb') as f:
pdf_text = slate.PDF(f, "pass the PDF file password here")
print pdf_text
Output: ['Sample text...', '......', '......']

The article was originally published at MicroPyramid blog.

--

--

MicroPyramid

Python, Django, Android and IOS, reactjs, react-native, AWS, Salesforce consulting & development company