PDF Barcode Reader

First of all the script is long. This for I will split it in parts and put the full script to single page! It is all about a PDF file which is scanned and transferred to a server. On the server a python2 is running as a cronjob and starting the script. First of all I've made a bash script which is doing the job but after a while I made a proof of concept in python. The performance of the py-script is way much faster than the bash one. So let's start by modules ....

The Barcode

LN=195177&KN=201178&BN=774&DL=06052019

  • LN ->[L]ieferschein [N]ummer = DeliveryNumber
  • KN -> [K]unden [N]ummer = CustomerNumber
  • BN ->[B]austellen [N]ummer = ProjectNumber
  • DL ->[D]atum [L]ieferung = DeliveryDate

The import[ance]

It is self explainable that we have to tell python which modules we are going to use in our script. The goal is to read a pdfFile (with any name) out of a folder. Transform it to a jpeg with ghostscript. Read the BarCode inside the document (jpeg) and rename the file. But renaming is not all we gonna do. We do create folders to structure the storage.

  • ghostscript
  • pyzbar
  • PIL
  • pyPdf
  • [optional] daemonize
import sys
import glob
import os
import locale
import ghostscript
import random
import logging

from pyzbar.pyzbar import decode
from PIL import Image
from pyPdf import PdfFileReader
from time import sleep
from daemonize import Daemonize

The Object

Every script needs an argument to know what it should work on. In our case it is a folder where the big/small amount of pdf files are placed in. Ghostscript will transform the pdf files into a JPG image to hand it over to zBar. Anyhow this will be done in /tmp folder.

m0r4k@bash:~ pyscan.py ./source ./destination
                           ^           ^
                         argv[1]     argv[2]

The random gives every filename a random number because it could be possible that you "scan" two times the same document (ie. one original signed and on dublicate). So the possibility to ovewrite the file because of the same file[name]content is not possible.

class QrObj:
    def __init__(self, inputFile=''):
        self.LN = ''
        self.BN = ''
        self.KN = ''
        self.DL = ''
        self.nodata = 0
        self.newFile = ''
        self.random = random.randint(1, 9999999999)
        self.prefix = '.'
        self.inputFile = inputFile
        self.outputFile = ''
        self.pnum = ''

        try:
            sys.argv[1]
        except IndexError:
            print('no file is given')
            exit()
        else:
            if inputFile == '':
                if os.path.exists(sys.argv[1]):
                    self.inputFile = sys.argv[1]
                else:
                    print('File not Found')
                    exit()
            else:
                self.inputFile = inputFile

        try:
            sys.argv[2]
        except IndexError:
            self.outputFile = '/tmp/pdf2jpeg.jpg'
        else:
            self.outputFile = sys.argv[2]

the Ghost

Ghostscript is a suite of software based on an interpreter for Adobe SystemsPostScript and Portable Document Format (PDF) page description languages. [source:wikipedia]

REMEMBER this is the pdf tuning part. If the barcode is not recognized by zbar you have to tune the import (scanner) or export (ghostscript arguments).

def ghost(objx):
    args = [
        "pdf2jpg",  # actual value doesn't matter
        "-q",
        "-sDEVICE=jpeg",
        "-dBATCH",
        "-dNOPAUSE",
        "-dFirstPage=1",
        "-dLastPage=" + str(objx.pnum),
        "-r200 -sPAPERSIZE=a4",
        "-sOutputFile=" + str(objx.outputFile),
        str(objx.inputFile)
    ]

    encoding = locale.getpreferredencoding()
    args = [a.encode(encoding) for a in args]
    pn = ghostscript.Ghostscript(*args)
    pn.exit()

ZBar reader

ZBar reads the Barcode from an image - in our case Jpg. As better the image.file is as better the results are! [source:ZBar]

  • Scan EAN/UPC codes and link to web sites that perform product searches. Links
  • Scan QR codes containing URLs and link directly to the web site.
  • Scan QR codes containing an E-mail address and send mail.
  • Scan a UPS or FedEx tracking number and link to the package tracking web page.
  • Maintain a simple list of scanned barcodes.
  • Send an E-mail containing a single barcode or your entire list of scanned barcodes.
  • Cut and paste barcode data into other apps

There is a fork 2017 existing but original ZBar is from 2009. [Never touch a running System]

def zbarDecode(objx):
    decoded = decode(Image.open(objx.outputFile))

    if not decoded:
        objx.nodata = 1
        return 0

    else:
        decode_list = str(decoded[0][0]).strip().replace("&", " ")
        decode_split = decode_list.split()

        for i in decode_split:
            split = i.split('=')
            setattr(objx, split[0], split[1])

Files and Folders

The script is made for scan & save. You've got a pdf with a barcode and you want to but it to a folder and rename the file that every operating system could find it by key-values. You can extend the script to put the barcode infos to a database or make a filename search based web interface. This is the part you can do what ever you want with the information you've got from the barcode.

def renameFile(objx):
    LN = 'L' + objx.LN
    BN = 'B' + objx.BN
    KN = 'K' + objx.KN
    DL = 'D' + objx.DL
    RAND = objx.random

    errorDir = objx.prefix + '/error/'

    if objx.nodata == 0:
        newFileName = LN + "_" + KN + "_" + BN + "_" + DL + "_" + str(RAND) + ".pdf"
        objx.newFile = newFileName
        createFolder(objx)

    else:
        if not os.path.exists(errorDir):
            os.makedirs(errorDir)

        os.rename(objx.inputFile, errorDir + objx.inputFile)

    os.remove(objx.outputFile)


def createFolder(objx):
    day = objx.DL[:2]
    month = objx.DL[2:4]
    year = objx.DL[4:8]
    date = day + '.' + month + '.' + year
    directory = objx.prefix + '/' + year + '/' + month + '/' + day

    if not os.path.exists(directory):
        os.makedirs(directory)

    os.rename(objx.inputFile, directory + '/' + objx.newFile)

Putting all together

It's time to put all files together. Just remember to install al requirements with pip for example. For daemonized version uncomment the last two lines.

pyscan.py

#!/usr/bin/python2.7

import sys
import glob
import os
import locale
import ghostscript
import random
import logging

from pyzbar.pyzbar import decode
from PIL import Image
from pyPdf import PdfFileReader
from time import sleep
from daemonize import Daemonize

filepath = os.getcwd()

class QrObj:
    def __init__(self, inputFile=''):
        self.LN = ''
        self.BN = ''
        self.KN = ''
        self.DL = ''
        self.nodata = 0
        self.newFile = ''
        self.random = random.randint(1, 9999999999)
        self.prefix = '.'
        self.inputFile = inputFile
        self.outputFile = ''
        self.pnum = ''

        try:
            sys.argv[1]
        except IndexError:
            print('no file is given')
            exit()
        else:
            if inputFile == '':
                if os.path.exists(sys.argv[1]):
                    self.inputFile = sys.argv[1]
                else:
                    print('File not Found')
                    exit()
            else:
                self.inputFile = inputFile

        try:
            sys.argv[2]
        except IndexError:
            self.outputFile = '/tmp/pdf2jpeg.jpg'
        else:
            self.outputFile = sys.argv[2]


def pagenum(objx):
    pdf_pn = PdfFileReader(open(objx.inputFile))
    pnum = pdf_pn.getNumPages()
    objx.pnum = pnum


def ghost(objx):
    args = [
        "pdf2jpg",  # actual value doesn't matter
        "-q",
        "-sDEVICE=jpeg",
        "-dBATCH",
        "-dNOPAUSE",
        "-dFirstPage=1",
        "-dLastPage=" + str(objx.pnum),
        "-r200 -sPAPERSIZE=a4",
        "-sOutputFile=" + str(objx.outputFile),
        str(objx.inputFile)
    ]

    encoding = locale.getpreferredencoding()
    args = [a.encode(encoding) for a in args]
    pn = ghostscript.Ghostscript(*args)
    pn.exit()


def zbarDecode(objx):
    decoded = decode(Image.open(objx.outputFile))

    if not decoded:
        objx.nodata = 1
        return 0

    else:
        decode_list = str(decoded[0][0]).strip().replace("&", " ")
        decode_split = decode_list.split()

        for i in decode_split:
            split = i.split('=')
            setattr(objx, split[0], split[1])


def renameFile(objx):
    LN = 'L' + objx.LN
    BN = 'B' + objx.BN
    KN = 'K' + objx.KN
    DL = 'D' + objx.DL
    RAND = objx.random

    errorDir = objx.prefix + '/error/'

    if objx.nodata == 0:
        newFileName = LN + "_" + KN + "_" + BN + "_" + DL + "_" + str(RAND) + ".pdf"
        objx.newFile = newFileName
        createFolder(objx)

    else:
        if not os.path.exists(errorDir):
            os.makedirs(errorDir)

        os.rename(objx.inputFile, errorDir + objx.inputFile)

    os.remove(objx.outputFile)


def createFolder(objx):
    day = objx.DL[:2]
    month = objx.DL[2:4]
    year = objx.DL[4:8]
    date = day + '.' + month + '.' + year
    directory = objx.prefix + '/' + year + '/' + month + '/' + day

    if not os.path.exists(directory):
        os.makedirs(directory)

    os.rename(objx.inputFile, directory + '/' + objx.newFile)


def initial(files=''):
    parx = QrObj(inputFile=files)
    pagenum(parx)
    ghost(parx)
    zbarDecode(parx)
    renameFile(parx)
    print('job is done for' + parx.newFile)



pid = "/tmp/test.pid"
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.propagate = False
fh = logging.FileHandler("/tmp/test.log","w")
fh.setLevel(logging.DEBUG)
logger.addHandler(fh)
keep_fds = [fh.stream.fileno()]

logger.debug(os.getcwd())

def main():
    if sys.argv[1] == '-d':
        os.chdir(filepath)
        while True:
            files = glob.glob('*.pdf')
            if files:
                for file in files:
                    initial(file)

            files = []
            sleep(60)



#daemon = Daemonize(app="test_app",pid=pid, action=main, keep_fds=keep_fds)
#daemon.start()