Digitize scientific plots with Python

Frequently, scientific data are published as plots correlating one quantity against another. Hovewer, it is quite difficult for someone that wishes it, to get access to the original raw data from which the plot was created. The latter is even worse in the case of plots appearing in old books or printed papers. Instead of using a ruler to extract the data with limited accuracy, I will show you how you can use Python to digitize the curves in the plot and extract the data as arrays. You can find the complete code of the curve digitizer tool here.

Let's assume that you have a plot available as an image in .jpg or .png format. Firstly, we will read and display this image with Python. To do this, we will use tkinter and matplotlib.

from tkinter import Tk, filedialog
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# open the dialog box
# first hide the root window
root = Tk()

# open the dialog
filein = filedialog.askopenfilename(
    title = "Select image to digitize",
    filetypes = (
        ("jpeg files","*.jpg"),
        ("png files","*.png"))
# read and show the image    
img = mpimg.imread(filein)
_, ax = plt.subplots()
ax.axis('off')  # clear x-axis and y-axis

Here, I prefered to select the image with a dialog box instead of typing the complete path of the image file.

The loaded image has a "coordinate system" whose origin is the upper left corner as illustrated in Fig. 1. The position of every pixel in the image is given in terms of number of pixel columns to the right x number of pixel rows downwards. Of course, this system is different to the coordinate system of the plot in the image. We must correlate these two systems in order to be able to digitize the plot. To do this, we define two reference lengths in \(x\) and \(y\) directions, respectively. From these two lengths, we will calculate a scaling factor in each direction that expresses how many units correspond to one pixel in that direction. The units may be anything depending on your chart (e.g. meters, millimeters, etc). Also, the units may be either positive or negative quantities. This scaling factor is calculated from the expression:

\[ \text{scaling factor} = \frac{\text{reference length}}{\text{ending pixel}-\text{starting pixel}} \]

In the above scaling method, rotation or skewness of the chart due to bad scanning are not taken under consideration. These would require extra corrections/manipulations. You should try to get an image where the two axes of the plot are as parallel as possible with its edges.

Plot digitization method and reference lengths
Figure 1: Image illustrating a plot that will be digitized. Image coordinate system, examples of reference lengths and sampling points.

Since we will calculate the scaling factor twice, one for the horizontal and one for the vertical direction, we can write one Python function that we will call two times. The desired direction (\(x\) or \(y\)) will be given as input to the function. The starting and ending pixels of the reference length will be captured by clicking on the image with the mouse. Depending on the requested direction of the reference length, we will utilize either the horizontal or the vertical coordinates of the captured pixels. The complete function is the following:

from tkinter import messagebox, simpledialog
def getReferenceLength(index):
    Get the reference length in the requested direction

    USAGE: factor = getReferenceLength(index)
    index = 0 for x-direction or 1 for y-direction

    # define a 'direction' string
    direction = 'x' if index == 0 else 'y'

    # get the reference length
    reply = False
    while not reply:
        messagebox.showinfo("Select reference length",
            "Use the mouse to select the reference length in {:s} direction.".format(direction) +
            "Click the start and the end of the reference length."
        coord = plt.ginput(
            ) # capture only two points
        # ask for a valid length
        validLength = False
        while not validLength:
            reflength = simpledialog.askfloat("Enter reference length",
                 "Enter the reference length in {:s} direction".format(direction))
            if isinstance(reflength, float):
                validLength = True
                messagebox.showerror("Error","Please provide a valid length.")
        # calculate scaling factor

        reply = messagebox.askyesno("Length confirmation",
            "You selected {:4.0f} pixels in {:s} direction corresponding to {:4.4f} units. Is this correct?".format(deltaref, direction, reflength)
    return factor

When using the code, you should aim for a large reference length at both directions. This will reduce the error when we will convert the pixel coordinates of the digitized curves into the real coordinates of the plot. 

Now that we have a method to calculate the scaling factors, we can use the mouse to digitize the curve. We will set the origin to the first selected point, thus the coordinates of the subsequent points will be relative to the first point. The expressions that calculate the coordinates are given by:

\[ \begin{array}{l} x=\left(x-x_0\right)\cdot x_\text{scaling factor}\\ y=\left(y-y_0\right)\cdot y_\text{scaling factor} \end{array} \]

where \(x_0\) and \(y_0\) are the pixel coordinates of the first point. At the end of the procedure you will get a \(n\)-by-\(2\) matrix (numpy array) containing the coordinates. The first column is the \(x\) coordinate and the second is the \(y\) coordinate, respectively. If you wish, you can manually add an offset to translate the curve on the \(x-y\) plane. The Python code is the following:

from numpy import asarray

# get reference length in x direction
xfactor = getReferenceLength(0)

# get the reference length in y direction
yfactor = getReferenceLength(1)

messagebox.showinfo("Digitize curve",
            "Please digitize the curve. The first point is the origin." +
            "Left click: select point; Right click: undo; Middle click: finish"

# get the curve points
x = plt.ginput(
x = asarray(x)

# convert the curve points from pixels to coordinates
x[:,0] = (x[:,0]-x[0,0]) * xfactor
x[:,1] = (x[:,1]-x[0,1]) * yfactor

Subsequently, you can process the digitized curve as you like or you can save it in a text file for future use. If your image contains more plots that you wish to digitize, you may repeat the above procedure without redefining the reference lengths.

You can find the complete curve digitizer Python code here. The program supports the digitization of multiple curves in a plot as well as the storage of the data into text files.

Related articles

Leave a comment