Paulikas and Blake revisited (Reeves et al. 2011)

This case study reproduces the figures of Reeves et al. (2011), “On the relationship between relativistic electron flux and solar wind velocity: Paulikas and Blake revisited” (doi:10.1029/2010JA015735).

Setup

Create a directory to hold files for this case study. Within this directory, create subdirectories code, data, and plots. (Using version control on the code directory is recommended; the SpacePy team uses git.)

Obtaining energetic particle data

We require the 1.8-3.5 MeV electron flux from the LANL-GEO ESP detector, available in the paper’s auxiliary material (scroll down to “Supporting information” on the paper’s page. The ESP data are in Data Set S1. Save this file to the data directory; the filename is assumed to be jgra20797-sup-0003-ds01.txt.

The data file was corrupted on upload to AGU, and the code to fix it is non-trivial, so this is a good chance to learn how to run someone else’s code. (Appendix: Fixing the ESP data file has step-by-step information on each portion of this process.) Copy all of the following and paste it into a file called fix_esp_data.py in the code directory.

import os.path


datadir = os.path.join('..', 'data')
in_name = os.path.join(datadir, 'jgra20797-sup-0003-ds01.txt')
out_name = os.path.join(datadir, 'jgra20797-sup-0003-ds01_FIXED.txt')
infile = open(in_name, 'r')
outfile = open(out_name, 'w')
data = infile.read()
infile.close()

data = data.replace('\r', '\n')
data = data.replace('\n\n', '\n')
data = data.split('\n')

for i in range(15):
    outfile.write(data.pop(0) + '\n')
oldline = None
for line in data:
    if line[0:2] in ['19', '20', '2']:
        if not oldline is None:
            outfile.write(oldline + '\n')
        oldline = line
    else:
        oldline += line
outfile.write(oldline + '\n')
outfile.close()

Now this script can be run with python fix_esp_data.py. It should create a file called jgra20797-sup-0003-ds01_FIXED.txt in the data directory.

File fixed, we can load and begin examining the data. Change to the code directory and start your Python interpreter. (IPython is recommended, but not required.)

In the following examples, do not type the leading >>>; this is the Python interpreter prompt. IPython has a different prompt that looks like In [1].

>>> import os.path
>>> datadir = os.path.join('..', 'data')
>>> print(datadir)
../data

The first line imports the os.path module from the Python standard library. Python has a huge standard library. To keep this code organized, it is divided into many modules, and a module must be imported before it can be used. (The Python module of the week is a great way to explore the standard library.)

The second line makes a variable, datadir, which will contain the path of the data directory. The os.path.join() function provides a portable way of “gluing” together directories in a path, and will use backslashes on Windows and forward slashes on Unix. The third line then prints out the value of this variable for confirmation; note this is a Unix system.

Note that string constants in Python can use single or double quotes; we could just as well have written:

>>> datadir = os.path.join("..", "data")

or even:

>>> datadir = os.path.join('..', "data")

The full path can also be used (and this is a better case for using a variable.) For example, I am preparing this example in a directory reeves_morley_friedel_2011 in my home directory, so I could use:

>>> datadir = os.path.join('home', 'jniehof', 'reeves_morley_friedel_2011',
...                        'data')

This very long line can be typed across two lines in Python, and because the line break happens within parentheses, a line continuation character is not required.

Returning to reading the ESP data file:

>>> fname = os.path.join(datadir, 'jgra20797-sup-0003-ds01_FIXED.txt')

creates a variable holding the full path to the fixed file.

>>> import numpy

The import statement imports any installed module, just as if it were in the standard library. Here we import the very useful numpy module, which is a prerequisite for SpacePy and useful in its own right.

>>> esp_fluxes = numpy.loadtxt(fname, skiprows=14, usecols=[1])

loadtxt() makes it easy to load data from a file into a numpy ndarray, a very useful data container. skiprows skips the header information, and specifying only column 1 (first column is column 0) with usecols will only load the fluxes for 1.8-3.5MeV. We only load the fluxes at this point because they can be represented as floats, which numpy arrays store very efficiently.

>>> import datetime

The datetime module provides Python objects which can manipulate dates and times and have some understanding of the meanings of dates, making for easy comparisons between dates, date arithmetic, and other useful features.

>>> convert = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')

This line sets up a converter to be used later. strptime() creates a datetime from a string, given a format definition (here specified as year-month-day). So:

>>> print(datetime.datetime.strptime('2010-01-02', '%Y-%m-%d'))
2010-01-02 00:00:00

lambda is a simple shortcut for a one-liner function; wherever convert(x) is used after the definition, it functions like datetime.datetime.strptime(x, '%Y-%m-%d'). This makes it easier to parse a date string without specifying the format all the time:

>>> print(convert('2010-01-02'))

This converter can be used with loadtxt():

>>> esp_times = numpy.loadtxt(fname, skiprows=14, usecols=[0],
...                           converters={0: convert}, dtype=numpy.object)

The converters option takes a Python dictionary. The default dtype is float, which cannot store datetimes; using numpy.object allows storage of any Python object.

Since it would be useful to be able to load the data without typing so many lines, create a file called common.py in the code directory with the following contents:

import datetime
import os.path

import numpy


datadir = os.path.join('..', 'data')

def load_esp():
    fname = os.path.join(datadir, 'jgra20797-sup-0003-ds01_FIXED.txt')
    esp_fluxes = numpy.loadtxt(fname, skiprows=14, usecols=[1])
    convert = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
    esp_times = numpy.loadtxt(fname, skiprows=14, usecols=[0],
                              converters={0: convert}, dtype=numpy.object)
    return (esp_times, esp_fluxes)

All needed imports are at the top of the file, with one blank line between standard library imports and other imports and two blank lines after them. datadir is defined as a global variable, outside of the function (but notice that it is available to the load_esp function.)

The rest of the file defines a function which returns the dates and fluxes in a tuple. The next section shows how to use this function.

Solar Wind data and averaging

The top panel of figure 1 shows the ESP fluxes overplotted with the solar wind velocity. Fortunately, the omni module of SpacePy provides an interface to the hourly solar wind dataset, OMNI. get_omni() returns data for a particular set of times. In this case, we want hourly data, covering 1989 through 2010 (we’ll cut it down to size later). tickrange() allows us to specify a start time, stop time, and time step.

>>> import spacepy.omni
>>> import spacepy.time
>>> times = spacepy.time.tickrange('1989-01-01', '2011-01-01',
...                                datetime.timedelta(hours=1))
>>> d = spacepy.omni.get_omni(times)
>>> vsw = d['velo']
>>> vsw_times = d['UTC']

We’ll also load the esp data:

>>> import common
>>> esp_times, esp_flux = common.load_esp()

Even though we have not installed common.py, the import statement finds it because it is in the current directory.

load_esp returns a tuple, which can be unpacked into separate variables.

Now we need to produce 27-day running averages of both the flux and the solar wind speed. Fortunately there are no gaps in the time series:

>>> import numpy
>>> d = numpy.diff(vsw_times)
>>> print(d.min())
1:00:00
>>> print(d.max())
1:00:00
>>> d = numpy.diff(esp_times)
>>> print(d.min())
1 day, 0:00:00
>>> print(d.max())
1 day, 0:00:00

numpy.diff() returns the difference between every element of an array and the previous element. min() and max() do exactly what they sound like. So this code confirms that every time in the vsw data is on a continuous one hour cadence, and the ESP data is on a continuous one day cadence.

>>> import scipy.stats
>>> esp_flux_av = numpy.empty(shape=esp_flux.shape, dtype=esp_flux.dtype)
>>> for i in range(len(esp_flux_av)):
...     esp_flux_av[i] = scipy.stats.nanmean(esp_flux[max(i - 13, 0):i + 14])

numpy.empty() creates an empty array, taking the shape and dtype from the esp_flux array. empty does not initialize the data in the array, so it is essentially random junk; use zeros() to create an array filled with zeros.

len() returns the length of an array, and range then iterates over each number from 0 to length minus 1, i.e. the entire array. Each element is then set to a 27-day average: from 13 days before a day’s measurement through 13 days after. (Python slices do not include the last element listed; they are half-open). Note that these slices can happily run off the end of the esp_flux array, but we use max() to ensure the first index does not go negative. (Negative indices have special meaning in Python.)

scipy.stats.nanmean (deprecated in scipy 0.18.0) takes the mean of a numpy array, but skips any elements with a value of “not a number” (nan), which is often used for fill. (This is our first exposure to the scipy module.)

For the solar wind averaging, the times need to cover the 24 * 13.5 = 324 hours previous, and 324 hours following (non-inclusive). There is also a more efficient way than using an explicit loop:

>>> vsw_av = numpy.fromiter((scipy.stats.nanmean(vsw[max(0, i - 324):i + 324])
...                         for i in range(len(vsw))),
...                         count=len(vsw), dtype=vsw.dtype)

fromiter() makes a numpy array from an iterator, which is like a list except that it holds information on generating each element in a sequence rather than creating the entire sequence. count provides numpy with the number of elements in the output (so it can make the entire array at once); dtype here is just copied from the input.

The type of iterator used here is a generator expression, closely related to a list comprehension. These are among the most powerful and most difficult to understand concepts in Python. An illustrative, although not useful, example:

>>> for i in (x + 1 for x in range(10)):
...     print(i)

Here (x + 1 for x in range(10)) is a generator expression that creates an iterator, which will return the numbers 1 through 10. At no point is the complete list of all numbers constructed, saving memory.

In our calculation of esp_flux_av, we created an explicit loop in Python. The generator expression used to compute vsw_av has no explicit loop, and the actual looping is handled in (much faster) compiled C code.

Making Figure 1

To actually plot, we need access to the pyplot module:

>>> import matplotlib.pyplot as plt
>>> plt.ion()

This alternate form of the import statement shouldn’t be overused (it can make code harder to read by masking the origin of functions), but is conventional for matplotlib.

ion() turns on interactive mode so plots appear and are updated as they’re created.

>>> plt.semilogy(esp_times, 10 ** esp_flux_av, 'b')
>>> plt.draw()
>>> plt.draw()

semilogy() creates a semilog plot, log on the Y axis. The first two arguments are a list of X and Y values; after that there are many options to specify formatting (such as the color, used here.)

The ESP fluxes are stored as the log of the flux; ** is the exponentiation operator so the (geometric!) average is plotted properly.

draw() draws the updated plot; sometimes it needs to be called repeatedly. Use it whenever you want the plot updated; it will not be included from here on.

>>> plt.xlabel('Year', weight='bold')
>>> plt.ylabel('Electron Flux\n1.8-3.5 MeV', color='blue', weight='bold')
>>> plt.ylim(1e-2, 10)
(0.01, 10)

xlabel() and ylabel() set the labels for the axes. Note the newline (\n) in the string for the Y label. ylim() sets the lower and upper limits for the Y axis; there is, of course, xlim() as well.

These are the simplest, although not most flexible, ways to work with plots. To produce the full Figure 1, we’ll move out of interactive mode:

>>> plt.ioff()
>>> plt.show()

ioff() turns off interactive mode. Once interactive mode is off, show() displays the full plot, including controls for panning, zooming, etc. Until the plot is closed, nothing further can happen in the Python window.

>>> fig = plt.figure(figsize=[11, 8.5])

figure() creates a new Figure; the size specified here is US-letter paper, landscape orientation.

>>> ax = fig.add_subplot(111)

add_subplot() creates an Axes object, which can contain an actual plot. 111 here means that the figure will have 1 subplot and the new subplot should be in position (1, 1); more on this later.

>>> fluxline = ax.plot(esp_times, 10 ** esp_flux_av, 'b')

plot() puts the relevant data into the plot; again specifying a blue line. It returns a list of Line2D objects, which we save for later use.

>>> ax.set_yscale('log')

set_yscale() switches the Y axis between log and linear (set_xscale() for the X axis).

>>> ax.set_ylim(1e-2, 10)
>>> ax.set_xlabel('Year', weight='bold')
>>> ax.set_ylabel('Electron Flux\n1.8-3.5 MeV', color='b', weight='bold')

set_ylim() (and set_xlim()), set_xlabel(), and set_ylabel() function much as above, but operate on a particular Axes object.

>>> ax2 = ax.twinx()

twinx() establishes a second Y axis (two values twinned on one X axis) on the same plot.

>>> vswline = ax2.plot(vsw_times, vsw_av, 'r')
>>> ax2.set_ylim(300, 650)
>>> ax2.set_ylabel('Solar Wind Speed', color='r', rotation=270, weight='bold')

The resulting Axes object has all the methods that we’ve used before. Note rotation on set_ylabel() to make the text run top-to-bottom rather than bottom-to-top.

>>> ax.set_xlim(esp_times[0], esp_times[-1])

Since the solar wind data extends beyond the ESP data, this sets the X axis to match the ESP data. Note -1 to refer to the last element of the array.

>>> leg = ax.legend([fluxline[0], vswline[0]], ['Flux', 'Vsw'],
...                 loc='upper left', frameon=False)

legend(), as may be expected, creates a Legend on the axes. The first parameter is a list of the matplotlib objects to make a legend for; since the plotting commands return these, we can pass them back in. Each plotting command returns a list. In this case we just take the 0th element of each list since we know there’s only one line from each plotting command. The second parameter is the text used to annotate each line.

>>> fluxtext, vswtext = leg.get_texts()
>>> fluxtext.set_color(fluxline[0].get_color())
>>> vswtext.set_color(vswline[0].get_color())

The default text color is black, so we use get_texts() to get the Text objects for the annotations. Again, we know there are two (we just created the legend). Then set_color() sets the color based on the the existing color for each line (get_color()).

To see the results:

>>> plt.show()

Close the window when done. Now we want to save the output:

>>> fig_fname = os.path.join('..', 'plots', 'fig1a.eps')
>>> fig.savefig(fig_fname)

savefig() saves the figure, in this case as an encapsulated PostScript file (to the plots directory).

Let’s tweak a few things. For one, there’s a lot of padding around the figure, which can make it difficult to properly scale for publication. The way around this is to specify a Bbox (bounding box), basically the lower left and upper right corners (in inches) to include in the saved figure. Getting this right tends to be a matter of trial and error. (get_tightbbox() is supposed to help with this, but it doesn’t quite work yet.)

>>> import matplotlib.transforms
>>> bob = matplotlib.transforms.Bbox([[0.52, 0.35], [10.5, 7.95]])
>>> fig.savefig(fig_fname, bbox_inches=bob, pad_inches=0.0)

Better, but all the text is awfully small. Once the figure is fit in the paper it’ll be really small. And the font isn’t that great.

>>> import matplotlib
>>> matplotlib.rcParams['axes.unicode_minus'] = False
>>> matplotlib.rcParams['text.usetex']= True
>>> matplotlib.rcParams['font.family'] = 'serif'
>>> matplotlib.rcParams['font.size'] = 14
>>> bob = matplotlib.transforms.Bbox([[0.4, 0.35], [10.7, 7.95]])
>>> fig.savefig(fig_fname, bbox_inches=bob, pad_inches=0.0)

Now the font is bigger and it’s rendered using TeX, which should match the body of the paper better (assuming the paper is in LaTeX). The larger font means tweaking the bounding box. unicode_minus fixes a problem where negative numbers on the axis don’t render properly in TeX. Matplotlib has many more options for customization.

The end result is a nice figure that can be printed full-size, put in a PDF, or included directly in a paper.

Now we need the bottom half of Figure 1. From SIDC, download the “Monthly mean total sunspot number” (monthssn.dat). Put it in the data directory.

>>> import bisect
>>> import datetime
>>> monthfile = os.path.join(common.datadir, 'monthssn.dat')
>>> convert = lambda x: datetime.datetime.strptime(x, '%Y%m')
>>> ssn_data = numpy.genfromtxt(monthfile, skip_header=2400, usecols=[0, 2, 3],
...                             converters={0: convert}, dtype=numpy.object,
...                             skip_footer=24)
>>> idx = bisect.bisect_left(ssn_data[:, 0], datetime.datetime(1989, 1, 1))
>>> ssn_data = ssn_data[idx:]
>>> ssn_times = ssn_data[:, 0]
>>> ssn = numpy.asarray(ssn_data[:, 1], dtype=numpy.float64)
>>> smooth_ssn = numpy.asarray(ssn_data[:, 2], dtype=numpy.float64)
>>> ssn_times += datetime.timedelta(days=15)

Much of this should be familiar. genfromtxt() is a little more flexible than loadtxt(); here it allows the skipping of lines at the end as well as the beginning (skipping 200 years at the start, 2 at the end, where data are provisional.) Here we load both times and the sunspot numbers in the same command so that if any lines don’t load, they will not wind up in any of the arrays.

bisect provides fast functions for searching in sorted data; bisect_left() is roughly a find-the-position-of function. Having found the position of the start of 1989, we then keep times from then on (specifying a start index without a stop index in Python means “from start to end of the list.”) Note that, although bisect is meant to work on lists, it also works fine on numpy arrays; this is a common feature of Python known as duck typing.

We then use asarray() to convert the ssn and smooth_ssn columns to float arrays. Note the slice notation: [:, 0] means take all indices of the first dimension (line number) and only the 0th index of the second dimension (column in the line). Finally, we use timedelta to shift the date associated with a month from the beginning to roughly the middle of the month. Adding a scalar to an array does an element-wise addition.

>>> import matplotlib.figure
>>> fig = plt.figure(figsize=[11, 8.5],
...                  subplotpars=matplotlib.figure.SubplotParams(hspace=0.1))
>>> ax = fig.add_subplot(211)

When creating the figure this time, we use matplotlib.figure.SubplotParams to choose a slightly smaller vertical spacing between adjacent subplots. Tweaking SubplotParams also provides an alternative to tweaking bounding boxes.

Then we create a subplot with the information that there will be 2 rows, 1 column, and this is the first subplot. Now everything acting on ax, above, can be repeated, although we skip setting the xlabel since only the bottom axis will be labeled.

>>> fluxline = ax.plot(esp_times, 10 ** esp_flux_av, 'b')
>>> ax.set_yscale('log')
>>> ax.set_ylim(1e-2, 10)
>>> ax.set_ylabel('Electron Flux\n1.8-3.5 MeV', color='b', weight='bold')
>>> ax2 = ax.twinx()
>>> vswline = ax2.plot(vsw_times, vsw_av, 'r')
>>> ax2.set_ylim(300, 650)
>>> ax2.set_ylabel('Solar Wind Speed', color='r', rotation=270, weight='bold')
>>> ax.set_xlim(esp_times[0], esp_times[-1])
>>> leg = ax.legend([fluxline[0], vswline[0]], ['Flux', 'Vsw'],
...                 loc='upper left', frameon=False)
>>> fluxtext, vswtext = leg.get_texts()
>>> fluxtext.set_color(fluxline[0].get_color())
>>> vswtext.set_color(vswline[0].get_color())

Then we move on to adding the solar wind:

>>> ax3 = fig.add_subplot(212, sharex=ax)

This adds another subplot, the second in the 2x1 array. Its x axis is shared with the existing ax. (This is poorly documented; see this example)

>>> plt.setp(ax.get_xticklabels(), visible=False)
>>> plt.setp(ax2.get_xticklabels(), visible=False)

setp() sets a property. get_xticklabels() returns all the tick labels (Text) for the x axis; setp then sets visible to False for all of them. This hides the labeling on the axis for the upper subfigure.

>>> ax3.set_xlabel('Year', weight='bold')
>>> ax3.set_ylabel('Sunspot Number', weight='bold')
>>> smoothline = ax3.plot(ssn_times, smooth_ssn, lw=2.0, color='k')
>>> ssnline = ax3.plot(ssn_times, ssn, color='k', linestyle='dotted')

There is nothing new here except for the specifications of linewidth and linestyle; see plot() for details. Note k as the abbreviation for black (to avoid confusion with blue.)

>>> leg2 = ax3.legend([ssnline[0], smoothline[0]],
...                   ['Sunspot Number', 'Smoothed SSN'],
...                   loc='upper right', frameon=False)
>>> ax3.set_ylim(0, 200)
>>> ax3.set_xlim(esp_times[0], esp_times[-1])
>>> fig_fname = os.path.join('..', 'plots', 'fig1.eps')
>>> fig.savefig(fig_fname, bbox_inches=bob, pad_inches=0.0)

All of this has been seen for the top half of figure 1.

Following is the complete code to reproduce Figure 1.

import bisect
import datetime
import os.path

import common
import matplotlib
import matplotlib.figure
import matplotlib.pyplot as plt
import matplotlib.transforms
import numpy
import scipy
import scipy.stats
import spacepy.omni
import spacepy.time


matplotlib.rcParams['axes.unicode_minus'] = False
matplotlib.rcParams['text.usetex']= True
matplotlib.rcParams['font.family'] = 'serif'
matplotlib.rcParams['font.size'] = 14
bob = matplotlib.transforms.Bbox([[0.4, 0.35], [10.7, 7.95]])

times = spacepy.time.tickrange('1989-01-01', '2011-01-01',
                               datetime.timedelta(hours=1))
d = spacepy.omni.get_omni(times)
vsw = d['velo']
vsw_times = d['UTC']
esp_times, esp_flux = common.load_esp()
esp_flux_av = numpy.empty(shape=esp_flux.shape, dtype=esp_flux.dtype)
for i in range(len(esp_flux_av)):
    esp_flux_av[i] = scipy.stats.nanmean(esp_flux[max(i - 13, 0):i + 14])
vsw_av = numpy.fromiter((scipy.stats.nanmean(vsw[max(0, i - 324):i + 324])
                         for i in range(len(vsw))),
                         count=len(vsw), dtype=vsw.dtype)
monthfile = os.path.join(common.datadir, 'monthssn.dat')
convert = lambda x: datetime.datetime.strptime(x, '%Y%m')
ssn_data = numpy.genfromtxt(monthfile, skip_header=2400, usecols=[0, 2, 3],
                            converters={0: convert}, dtype=numpy.object,
                            skip_footer=24)
idx = bisect.bisect_left(ssn_data[:, 0], datetime.datetime(1989, 1, 1))
ssn_data = ssn_data[idx:]
ssn_times = ssn_data[:, 0]
ssn = numpy.asarray(ssn_data[:, 1], dtype=numpy.float64)
smooth_ssn = numpy.asarray(ssn_data[:, 2], dtype=numpy.float64)
ssn_times += datetime.timedelta(days=15)

fig = plt.figure(figsize=[11, 8.5],
                 subplotpars=matplotlib.figure.SubplotParams(hspace=0.1))
ax = fig.add_subplot(211)
fluxline = ax.plot(esp_times, 10 ** esp_flux_av, 'b')
ax.set_yscale('log')
ax.set_ylim(1e-2, 10)
ax.set_ylabel('Electron Flux\n1.8-3.5 MeV', color='b', weight='bold')
ax2 = ax.twinx()
vswline = ax2.plot(vsw_times, vsw_av, 'r')
ax2.set_ylim(300, 650)
ax2.set_ylabel('Solar Wind Speed', color='r', rotation=270, weight='bold')
ax.set_xlim(esp_times[0], esp_times[-1])
leg = ax.legend([fluxline[0], vswline[0]], ['Flux', 'Vsw'],
                loc='upper left', frameon=False)
fluxtext, vswtext = leg.get_texts()
fluxtext.set_color(fluxline[0].get_color())
vswtext.set_color(vswline[0].get_color())

ax3 = fig.add_subplot(212, sharex=ax)
plt.setp(ax.get_xticklabels(), visible=False)
plt.setp(ax2.get_xticklabels(), visible=False)
ax3.set_xlabel('Year', weight='bold')
ax3.set_ylabel('Sunspot Number', weight='bold')
smoothline = ax3.plot(ssn_times, smooth_ssn, lw=2.0, color='k')
ssnline = ax3.plot(ssn_times, ssn, color='k', linestyle='dotted')
leg2 = ax3.legend([ssnline[0], smoothline[0]],
                  ['Sunspot Number', 'Smoothed SSN'],
                  loc='upper right', frameon=False)
ax3.set_ylim(0, 200)
ax3.set_xlim(esp_times[0], esp_times[-1])

fig_fname = os.path.join('..', 'plots', 'fig1.eps')
fig.savefig(fig_fname, bbox_inches=bob, pad_inches=0.0)

Appendix: Fixing the ESP data file

This appendix provides a detailed explanation of the script that fixes the ESP data file.

First set up a variable to hold the location of the data, as above:

>>> import os.path
>>> datadir = os.path.join('..', 'data')

Examining the data file, it is clear that something is odd: lines appear to have been broken inappropriately; for example, the data for 1989-10-12 are split across two lines. So the first task is to fix this file, first opening the original (broken) file and an output (fixed) file:

>>> in_name = os.path.join(datadir, 'jgra20797-sup-0003-ds01.txt')
>>> out_name = os.path.join(datadir, 'jgra20797-sup-0003-ds01_FIXED.txt')
>>> infile = open(in_name, 'r')
>>> outfile = open(out_name, 'w')

These lines open() the original file for reading (r), and a new file for writing (w). Note that opening a file for writing will destroy any existing contents.

The file happens to contain a mixture of carriage returns and proper newlines, so to begin all the carriage returns need to be rewritten as newlines:

>>> data = infile.read()
>>> infile.close()
>>> data = data.replace('\r', '\n')
>>> data = data.replace('\n\n', '\n')

file.read() reads all data from the file at once, so this is not recommended for large files. In this case it makes things easier. Once the data are read, file.close() the file. Calling the replace() method on data replaces all instances of the first parameter ('\r') with the second ('\n'). \r is the special code indicating a carriage return; \n, a newline. For a literal backslash, use \\. Once the carriage returns have been replaced with newlines, a second round of replacement eliminates duplicates.

Now that the line endings have been cleaned up, it’s time to rejoin the erroneously split lines. First copy over the 15 lines of header verbatim:

>>> data = data.split('\n')
>>> for i in range(15):
...     outfile.write(data.pop(0) + '\n')

split() splits a string into a list, with the split between elements happening wherever the provided parameter occurs. A simple example:

>>> foo = 'a.b.c'.split('.')
>>> print(foo)
['a', 'b', 'c']

The splitting character is not present in the output.

The advantage of a list is that it makes it easy to access individual elements: >>> print(foo[1]) b

The first element of a Python list is numbered zero.

range returns a list of numbers, starting from 0, with the parameter specifying how many elements are in the list:

>>> print(range(5))
[0, 1, 2, 3, 4]

The last number is 4 (not 5 as might be expected), but there are 5 elements in the list.

The for executes the following indented statement once for every element in the in list:

>>> for i in ['a', 'b', 'c']:
...     print i
a
b
c

Indentation is significant in Python! Normally indents are four spaces and the tab key will do the job. (In the above example, you may need to hit enter twice after the print statement, the second to terminate the indentation.)

pop returns one element from a list, and deletes it from the list. Using 0 pops off the first element, and file.write() writes a string to a file. + can be used to concatenate two strings together. Since split() removed the newlines, they need to be readded.

So this little block of code splits the data into a list on newlines and, repeating fifteen times, takes the first element of that list and writes it, with a newline, to the output. Now data contains only the actual lines of data.

>>> oldline = None
>>> for line in data:
...     if line[0:2] in ['19', '20', '2']:
...         if not oldline is None:
...             outfile.write(oldline + '\n')
...         oldline = line
...     else:
...         oldline += line
>>> outfile.write(oldline + '\n')
>>> outfile.close()

None is a special Python value specifically indicating nothing; it’s used here to mark the first time around the loop.

line[0:2] gets the first two characters in the string line, and the in operator compares the resulting string to see if it is present in the following list. This will return True if the line begins with 19 or 20. The if statement executes the following indented block if the condition is True. So, if this is True, the previous line probably ended properly and it can be written out. First there is an additional check that this isn’t the first time around the loop, and then the previous line (which we know ended cleanly) is written out. The currently-read line then becomes the new “previous” line.

The 2 is a special case: if the line is less than two characters long, line[0:2] will return the entire line, and it so happens that these cases always correspond to the previous line being whole.

If this test fails, everything under else is executed. Here the assumption is that the previous line didn’t end cleanly and the current line is actually a continuation of it, so the current line is appended to the previous. a += b is a shortcut for a = a + b.

Once the loop terminates, the last line is written out, and the file closed.