Tuesday, November 12, 2013

A Python program to FTP a mix of text and binary files

When using FTP to download files from a server to your local computer, you must of course be sensitive to line endings. My server is a Linux box, so line endings are just LF. My laptop runs Windows 7, with CRLF line endings.

I recently wrote a Python program to automate the backup of files from the server to my laptop, and had to take this difference into account.

To make matters worse, some of the text files on the server have CRLF, even though it's a Linux box. The files came from a variety of sources -- uploaded by different developers, created in WordPress admin, etc -- and therefore don't have consistent line endings.

And there's another wrinkle: Some of the files are UTF-8 encoded and contain characters that can't be represented as ASCII.

At first, I thought I would simply use the Python ftplib module's retrbinary() method for binary files and retrlines() for text files. That didn't work out. The biggest obstacle was errors that resulted with the UTF-8 files when trying to read a line of text that contained a non-ASCII character and write it to a file. Two lesser hurdles were (1) figuring out when it was really necessary to add a CR, vs when the CR was already present, and (2) the possibility that the last line of the file doesn't end with a newline of any sort.

Eventually, I hit on a simple approach that almost (but not quite) 100% effective. It's good enough for my requirement, which is just to create a backup that's usable in case of disaster. My program uses retrbinary() for ALL files. If the file extension indicates a binary file, no additional action is taken. If the file extension indicates a text file, we replace every occurrence of the byte 0x0a (LF) with the two bytes 0x0d and 0x0a (CRLF) -- unless the 0x0a was already preceded by 0x0d, in which case we leave it alone.

Why is this only ALMOST 100% effective? Because it doesn't handle the case where a LF is the first byte in a buffer read by retrbinary(), and was preceded by a CR in the PREVIOUS buffer. In this case, my program inserts an extraneous CR. It wouldn't be terribly hard to fix, but not worth the trouble for my purpose.

Here's a simplified excerpt of the Python code:

from ftplib import FTP

# This is the callback function for retrbinary().
def processBytes(buffer):
 # Use the file that was opened in retrieve()
 global f

 previousByte = 0

 # Create an empty byte array.
 buffer2 = bytearray()

 # Loop over all the bytes that were read by retrbinary()
 for b in buffer:

  # Is the byte a LF? Is it NOT preceded by a CR?
  if b == 0x0a and previousByte != 0x0d:

   # Prepend a CR to the LF.
   buffer2.append(0x0d)
  buffer2.append(b)
  previousByte = b

 # Write the modified byte array to the local file.
 f.write(buffer2)

# Retrieve the specified file from the FTP server.
def retrieve(fname):
 global f
 global ftp

 print(fname)

 # Open the local file for writing in binary mode.
 f = open(fname, 'wb')

 # Determine whether the file is binary or text. In this simplified example, a file is
 # binary if and only if the file extension is GIF, JPG or PNG.
 name = fname.lower()
 if name.endswith('.gif') or name.endswith('.jpg') or name.endswith('.png'):

  # Binary file. Don't modify the bytes. Just write them to the local file.
  ftp.retrbinary('RETR ' + fname, f.write)
 else:

  # Text file. Insert CR's as needed before writing to the local file.
  ftp.retrbinary('RETR ' + fname, processBytes)

def main():
 global ftp

 print("STARTING...")

 # Connect to the FTP server. Replace the arguments with your URL, username and password.
 ftp = FTP('ftp.example.com', 'someuser', 'somepassword')

 # List all the files in the current directory, including the file type in the results.
 files = ftp.mlsd('', ['type'])
 for file in files:

  # If it's a file, not a directory, then download it.
  if file[1]['type'] == 'file':
   retrieve(file[0])
 ftp.quit()
 print("DONE!")

if __name__ == '__main__':
 main()

No comments:

Post a Comment