Jan 05 2010
Filesystems and data recovery (an explanation of sorts)
I posted a short guide yesterday to recovering photographs from corrupted memory cards. There were a few interesting technical points in that post that I glossed over for the sake of the explanation, but discussion on Facebook has prompted me to try filling in those gaps. Let’s have a look.
File systems
How is stuff written to and read from a disk, and why would it suddenly stop working?
Reading and writing disks
Picture a disk as a big sheet of grid paper. Each little square in the grid can hold a single letter, and that little square has a location, like the grid reference. We can read and write a single letter at a time by knowing its grid reference. Most things which are written on the paper will take up many little squares, because most useful information is more than one letter long — words, sentences, paragraphs and so on are stored as long sequences of individual letters. This doesn’t really pose a problem because our sheet of paper is easily big enough to hold all the sentences we could ever write. It is also possible to store many different sequences of text on one sheet of paper, so that books sit alongside essays alongside love letters alongside formal complaints. The only problem this does leave us with is organising and retrieving the information, since a gridded sheet where each box contains a single letter will start to look more like a wordsearch after a while.
If we have been sensible we will write our texts onto the great sheet of paper in a consistent manner, such as left-to-right, and starting a new row one line down whenever we fill up the previous row. The system of organisation we use is arbitrary — we could just as easily use right-to-left, write vertically or move up a row instead of down at the end of each line. It doesn’t really matter as long as we are consistent. (This should be obvious from the history of written language!) We can then identify the beginning of a piece of text by knowing the grid reference of the first letter; and we can retrieve the whole text if we know how long the text is, since it was written in a consistent direction. If we do this we should easily be able to organise all the information we have by keeping track of the starting location and length of all the sequences of letters we write.
The disk, like the gridded paper, is full of little slots where data can be stored; and each slot has a location called its “address”. Data can be stored on disk in long sequences of adjacent slots. To get the sense of the original data we only need to know the starting address and to read the correct number of entries from that point onwards. All we need to extract a file from disk is two numbers — the address and the length. From this we can reconstruct the original data.
Now we have many long essays and books and letters stored on the hypothetical sheet of paper, and these long sequences of text can be reached via the much shorter pairs of numbers we mentioned, representing address and length. But where are these numbers kept? The one location that doesn’t require any intelligence to find on a grid of paper is the start, which we shall call address 0 (zero). How can we use this to our advantage? We could write a list, starting at address 0, which contains the pairs of numbers needed to find all the other sequences of letters! It will never be possible to “lose” the sequence at the top corner of the sheet of paper so we will always be able to access this contents listing for the rest of the page.
But think again: if the sheet of paper is covered in many sequences of letters, it’s very conceivable that one of those sequences will contain map references or temperatures or other lists of numbers that could be mistaken for the pairs of numbers which we use to find the texts themselves. If a sequence of pairs of numbers starts at address 0 how do we know when it has ended and a different sequence of numbers has started? We know the numbered pairs start at 0 but we don’t know how long they continue after that. The solution is to enter the length of the first sequence at address zero itself. It then becomes possible to start at the top of the page and read off a list of starting locations for every piece of information on that page, without fear of accidentally reading data which isn’t relevant.
The organisation of real disks and memory cards is done along these lines, and the system of addresses and references which comprise the implementations are called a file system.
(Bonus question: Imagine that one of the sequences of text buried somewhere in our piece of paper was a list of location/length pairs just like the one at address zero. The location/length of this secondary list would be mentioned by the primary one, but the pairs of numbers it contain would not be mentioned. What use would this be? Can you think of a common instance of this sublist in computer file systems?)
Disk corruption and file recovery
In order to reuse a disk people often “format” it, making a clean slate. The formatting process generally doesn’t affect the data on disk, only the contents listings which help to make sense of it. We can format our grid of paper by altering the contents listing which we store as the first sequence. It’s not difficult — all we have to do is change the number at address zero, the one that represents the length of the contents listing, to a zero. From now on, reading that sheet of paper will immediately show that this sheet contains no data!
Quick and reliable access to the text written on the vast hypothetical sheet of paper hinges on the reliability of the contents listing at the top of the page. It allows us to make sense of the lines and lines of individual letters that follow. If this gets corrupted or erased in any way, our sheet of paper is turned from a useful store into a tedious wordsearch. And as you can see with the formatting example, it takes only a tiny change to render the file listing “wrong”.
So, if through chance or misadventure your disk becomes unreadable in this way, you will not be able to read the data straight off. But it’s all still there, somewhere. The process of file recovery is one of trawling through the disk looking for interesting sequences of text, like a word search. Thankfully many files such as images or videos are organised like a file system in miniature — they contain useful header and file length info which aids identification. For example, JPEG images, which is how digital cameras commonly store their images, start all their files with a particular number. File recovery programs can trawl through the disk looking for known sequences such as this and attempting to pull useful data from the mess. In many cases it works very well.
I have used PhotoRec to recover files from a damaged file system for friends in the past, where their hard drive had become damaged and the partition appeared to need ‘formatting’ by XP.
What I found though is that because it doesn’t take in to account which files are deleted (for obvious reasons) it can turn up … files the user maybe hoped were not on the disk any more?
That was exactly the case with us — we ended up recovering images going back to about May which had been long since deleted but clearly hadn’t been overwritten. It looks like images which are deleted on camera are deleted (or, over-written) immediately with the next photo, but this is not the case if you delete photos using iPhoto or whatever.
I’ve heard of people who are made to delete photos by police/military/etc carefully not taking any more pictures, then doing a simple FAT undelete when they get home.
It’d be good to have a ‘immediately transfer the image to a secure host via 3G’ option on a camera to prevent such shenanigans.
nice idea… sadly only phones with cameras could do this; the better camera has no such option and no one appears inclined to build one, it would seem.
you’re absolutely right Rob… I was mortified when I realised Dougal could see photos of mine I thought I’d safely deleted: pictures with lousy composition; pics with under-exposure; those one or two pictures where I’d been foolish enough to use flash, with shameful consequences…. I can’t bring myself to look at him now.