Translating Listserv archive to mbox format

November 30, 2008

We wanted to import our old mailing list entries from the OpenMRS mailing list Listserv archives into Nabble.  No problem.  Finding the GET listname FILELIST and GET listname file1, GET listname file2, … commands was easy enough.  A quick search of Nabble support made it clear that I needed to send them mbox files.  So, I set out in search of a Listserv to mbox converter.  I found a couple scripts: one in perl and another in PHP.  But trying them out, made it clear that I was going to have to do some tweaking.  After a few near misses, I thought: “I could do this easier in Groovy.”  So, I ended up with this script.  Basically, it came down to leaving the messages and their headers alone and just adding a From_ line in front.  Otherwise, the only tricky part was getting the dates right (GMT time without timezone specified in the From_ line and a some reshuffling of the date format in the message header).

Both for future me and anyone else who might benefit, here’s the script I ended up with:

import java.text.SimpleDateFormat

delim = '=' * 73	// LISTSERV separates messages with a bar of equal signs
foundDelim = false	// we skip all content until first delimiter
inHeader = true		// true when processing header data
def header = ""		// holds current header data

dfListserv = new SimpleDateFormat("E, d MMM yyyy HH:mm:ss z")
dfHeader = new SimpleDateFormat("E MMM dd HH:mm:ss yyyy z")
dfMbox = new SimpleDateFormat("E MMM dd HH:mm:ss yyyy")
dfMbox.timeZone = TimeZone.getTimeZone("GMT")	// for mbox, convert to GMT and drop timezone reference
cal = Calendar.instance

// Process input line by line from stdin
System.in.eachLine() { line ->
  if (!foundDelim)
    foundDelim = (line == delim)	// skip until we find first delim
  else if (inHeader) {
    // within header
    if (line =~ /^s*$/) {
      // empty line signals end of header
      
      // fetch Date from header and reformat it for output
      m1 = header =~ /(?ms)^Date:s+(.*?)s*$/
      date = dfListserv.parse(m1[0][1])
      cal.time = date
      mboxDate = dfMbox.format(cal.time)
      headerDate = dfHeader.format(cal.time)

      // fetch From from header
      m2 = header =~ /(?ms)^From:s+(.*?)s*$/
      fromHeader = m2[0][1]
      leftBracket = fromHeader.indexOf('<')
      rightBracket = fromHeader.indexOf('>')
      if (leftBracket > 0 && rightBracket > leftBracket)
        from = fromHeader.substring(leftBracket+1, rightBracket)
      else
        from = fromHeader

      // output header with mbox-required From_ line up front and reformatted date
      header = "From $from $mboxDaten" + header.replaceAll(/(?m)^Date:s+.*$/, "Date: $headerDate")
      println "$headern"

      inHeader = false	// no longer in header
      header = ""	// clear for next message
    } else {
      header += "$linen"	// accumulate full header data
    }
  } else if (line == delim) {
    // if we find a delim, begin processing next line as header
    print "nn"
    inHeader = true
  } else {
    // within a message, just send it through untouched
    println line
  }
}
3 Comments