Extract data from PDF and all Microsoft Office files in python

We will see how to extract text from PDF and all Microsoft Office files.

Generating OCR for PDF:

The quick way to get/extract text from PDFs in Python is with the Python library “slate”. Slate is a Python package that simplifies the process of extracting text.


$ pip install slate
$ pip install pdfminer


import slate
with open('sample.pdf', 'rb') as f:
pdf_text = slate.PDF(f)
print pdf_text
Output: ['Sample text...', '......', '......']

* The PDF class, of slate, takes file-like object and extracts all the text from the PDF file. It provides the output as a list of strings(one for each page).

* NOTE: If the PDF file has password, then pass the password as second parameter.


import slate
with open('test_doc.pdf', 'rb') as f:
pdf_text = slate.PDF(f, "pass the PDF file password here")
print pdf_text
Output: ['Sample text...', '......', '......']

Python, Django, Android and IOS, reactjs, react-native, AWS, Salesforce consulting & development company