2008-08-31 16:33Using GetLive and mb2md to escape from HotmailPeople are sometimes surprised to hear that I have a Hotmail account, but I have two answers ready for them. Firstly I tell them that when I signed up, it wasn’t owned by Microsoft (which, looking at the dates, seems unlikely), but perhaps it would be true to say that it wasn’t running on Microsoft technology, as Microsoft famously had to dogfood the servers from FreeBSD to Windows. The other thing I say is that I only have the account so that I can chat to people using MSN, although I do have a moratorium on adding new MSN contacts and am slowly trying to move the rest of my contacts over to Jabber/XMPP-based services like Google Talk. Whatever the excuse, and no matter how excusable it is to have an account with them, what is not excusable is to have the continued availability (not to mention the privacy) of a collection of emails, controlled by the whims of Microsoft. This is not mere paranoia, though. I’m sure at one point Microsoft introduced a new, lower, inactivity-timeout period within which you were required to log in or forfeit your email backlog, and it was just after this change that they started advertising their paid accounts as not having this artificial limitation. GetLiveThe first step to getting free from “Windows Live Hotmail” is getting GetLive, a Free software program that is available in Debian or through its SourceForge project page. The program is designed to log in to your Hotmail account by making HTTP requests using curl and screenscraping the text of each email from the pages it downloads. It has many modes of operation, but I configured it for what I thought was the simplest “do one thing and do it well” approach of just dumping the email out to a single file. The manual even gives the instructions to do this, but it is in the manual that things start to get tricky. I will detail below the steps I found most challenging, as it appears the precise method I chose to use is not particularly well supported. Rather than using lengthy command line arguments, the program reads a config file which you have to create, and the manual guides you through this. Before it even gets to the complications of how to specify the output method, though, there are some gotchas. For instance, while the Hotmail login page suggests that your email address is your username for the service, GetLive uses the term “UserName” to refer specifically to the part of your email address before the “@” of your email address, using the separate “Domain” field of the config file to specify what goes after the “@”. This was presumably obvious to the writer of the software / documentation, but if someone accidentally thinks like a user instead of a programmer while trying to enter this information, they may not realise that they are required to split their email address across these two fields. The next thing worth pointing out is that it asks you to put your Hotmail account password in plain text in this configuration file, so make sure the file isn’t world readable. I personally tend to put a fake password in the first time, just in case something unexpected (or malicious) happens, but of course now I’ve said that, I have to use a fake password the first two times, to stay one step ahead of the crackers. Another little trick I used was to specify that I just wanted the program to extract the messages in my Inbox to make the auditing easier. This was only an instinctive decision, though, as I didn’t think at the time that an audit would turn up any problems. I happily then went on to specify the Processor directive, using the first example from the manual: *) “/bin/cat - >> FetchedMail” might be another interesting one to drop directly in a mbox file. What the manual didn’t say, though, was that when you specify the processor, the quotes are not needed, despite it being a multi-word value. Again, this wasn’t a showstopper, just an unintuitive way of doing things which required a bit more effort on the part of the user. Fortunately this was the last issue with the config file, and running getlive on the commandline then proceeded to download all the emails and put them into a file, as instructed. mb2md and MailutilsNow it should just have been a matter of running mb2md (“Mailbox to Maildir”), another Perl script available in Debian, to convert this file into a directory full of emails. Indeed, I did run mb2md -s /tmp/getlive.mbox -d /tmp/getLiveDir as guided by the manual, and despite a few messages of the form Use of uninitialized value $t in utime at /usr/bin/mb2md line 1039, <MBOX> line {some 5 digit number}., the target directory was indeed full of hundreds of valid email files. It was only when I was about to delete the emails on my Hotmail account that I thought to go back and do some sort of check that the whole process had been as successful as it seemed at first glance. Notice that earlier I said that GetLive put my emails “into a file”, not “into a correctly formatted Mbox file”. I don’t want to make accusations without the full facts, but it appears that either GetLive was not creating the Mbox file correctly, or mb2md was not reading that file correctly. The simple audit I did which alerted me to the problem, was to check the number of emails that Hotmail listed for my Inbox, and the number of emails actually extracted, which differed 473 to 250. I decided I needed a second opinion, so I rushed ahead with a rather naive grep of the form grep -c “To: myusername@hotmail.com” getlive.mbox which returned the answer 175 because of mailing list posts and the like which aren’t always individually addressed. Rather than try to develop my own heuristic for counting emails and introduce a whole new potential source of errors, I went for a well-established tool with the Debian stamp of approval. The package GNU Mailutils seemed to be just what I was looking for, and came with at least two programs which could analyse the situation for me. I first tried the messages utility, which is described as being able to “count the number of messages in a mailbox.” just by invoking the command on the specified Mbox file. When I did this, I got the answer 470, very close to the 473 number I was expecting, which told me that the emails I wanted were probably in the Mbox file somewhere. In order to confirm this number, I then ran another of the Mailutils called from, using a command like from -f getlive.mbox | wc -l which gave the answer 250. It was now clear that there was something not right about my Mbox file, but by looking at which “From:” addresses the from command found, and thus which emails it could see, I had a simple tool with which to test hypotheses about how to fix the Mbox file. Of course, I could have kept making changes to the Mbox file and rerunning mb2md on it, but this seemed to introduce too many new variables and it didn’t have a —dry-run option to produce just a single number output. Actually it didn’t take many hypothesis tests to realise that the emails that were getting detected had new lines before the “From:” header lines, and that by adding such a new line I increased the output of from from 250 to 251. Let’s say then that I used some clever multi-line awk script to insert such a new line where necessary, unless you think that such a delicate task as editing an Mbox should be done by hand with human oversight, in which case I did that. Running from again I got to 471 senders, and running mb2md on this file produced the console output: 471 messages.. Admittedly this message was preceded by the same uninitialized value warnings, and it was a few emails short of the 473 that Hotmail listed, but it was enough for me to move on to the next phase of the import. Unfortunately, though, it was here that I had second thoughts about my hand-editing approach, so I decided to go back and try a different GetLive configuration. This time, my Processor line was: Processor = /bin/cat - >> /tmp/getlive.mbox ; echo “” >> /tmp/getlive.mbox ; sleep 5 with the sleep command being used because I thought that Hotmail might be suspicious of me hammering their pages. Also, before I ran the command again, I made sure that all my Hotmail emails were marked as Read in the web interface, as the emails which didn’t get imported the first time seemed to include a disproportionate number of those (40 or so) unread ones. I can’t now remember whether this method produced results as good as the hand-edited way, but it didn’t produce an Mbox that was any better, so I had to resort to manually checking the imported set of emails against the source by eye. It turned out that, looking at the files in date order, there were some emails that existed only in the Hotmail interface, and, bizarrely, some that only existed in the imported set. I counted 6 differences like this (a little over 1% of the total number of emails) which fell into 3 categories as follows:
In the case of this last email, GetLive had given the output “GetLive died with message: ‘Unable to download email message. at /usr/bin/getlive line 876.” during the import process, and had then abruptly finished (as this was the last email in its reverse-chronological importing order) giving the impression that this error was fatal and it had abandoned the rest of the import. Ignoring that slight usability confusion, I would say that GetLive is only really to blame for missing the 2 mailing list emails. One might think that as a Hotmail importing program it should be able to deal with emails sent by Hotmail Member Services, but actually the Welcome email wasn’t even available through the Hotmail interface, which indicates to me the relative programming skill and culpability of the two pieces of software. The Hotmail error page, by the way, also visible in the debug output of GetLive, said “Windows Live Hotmail wasn’t able to complete this request. Microsoft may contact you about any issues you report.” As a final clean up, then, I manually edited the three emails with the missing timestamps, using a text editor on the file in my home Maildir folder. In fact, while this did change the “Date:” field value as it appeared in the “Message Pane” of Thunderbird (Icedove), it didn’t change the value in the “Date” column of the list of emails. For a while I searched for a way to make Thunderbird refresh its view of the IMAP headers before realising that the solution was simply to drag the emails out of the folder and then back in again. I was then in possession of an almost verbatim, and much freer, copy of my Hotmail inbox, on my local machine. What are you going to do when the cloud shuts down? Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |