Archive for the 'Computing' Category

Aug 23 2010

Update on SICP: two chapters done

Earlier today our little SICP study group agreed to start on Chapter 3 for our next meeting. There are five chapters in total so we are past the halfway mark and it’s been pretty interesting so far. To recap the recent stuff:

  • Representing data structures as functions, and data as functions. Specifically, calculating with Church numerals and simulating cons, car and cdr without built-in operations.
  • Manipulation of code as data, including symbolic differentiation of simple algebraic formulae. Converting “3x²” into “6x” with relevant simplification and grouping of symbols.
  • Some straight-up data structure work — ordered lists, trees, sets — and using them to implement Huffman encoding.
  • Organising multiple interacting systems, in this case a generic arithmetic package which works (almost) seamlessly on integers, rationals, reals, complex numbers and finally polynomial terms. Using this as a way of discussing functional and object-oriented design and introducing the expression problem.

I can definitely say that this last section, building a multi-module (for some definition of module) system made me really miss static types. Not being able to tell where functions were defined and called with different arities, or were using different symbols, until the problem manifested at runtime, was infuriating. The book is very interesting but the language is not to my taste. :-)

For comparison, here’s Yaron Minsky talking about using Ocaml for large scale programming:

Effective ML from Yaron Minsky on Vimeo.

No responses yet

Aug 06 2010

Why my unorthodox solution to the Cyber Security Challenge worked

In case you didn’t see his comment, I just wanted to promote Andrew’s answer to my question to the front page. I had wondered why my dividing-by-two techniques had worked to find the answer to the second part of the Cyber Security Challenge. The explanation is quite obvious when written in full.

The input sequence was a series of 8 bits which had been permuted in some way. According to the published solution they had been rearranged so that the right-most bits had been looped around and pushed onto the left side:

6 7 8 1 2 3 4 5

The recommended solution was of course to take those three and move them back round to the right hand side:

1 2 3 4 5 6 7 8

My accidental solution relied on two features of the problem. First, as I said previously, the hexadecimal encoding could be reordered before decoding. Since each hex number represents 4 bits, we’re going to reorder around the middle:

6 7 8 1 : 2 3 4 5

If we swap these 4-bit sequences first, we get something which is very close to the desired sequence:

2 3 4 5 : 6 7 8 1

The next step I performed was division-by-two. In binary this is represented by shifting all the numbers one place to the right. (Think of dividing by ten normally — you just shift the decimal point one place to the left.) When we do this right-shift the right-most digit disappears, and a left-most digit is created which is always zero in our case. I’ll mark this place with a # symbol.

# 2 3 4 5 6 7 8

We’ve almost got a full understanding of the solution here, but one problem remains. My method and the official method are nearly identical, except for the left-most bit:

1 2 3 4 5 6 7 8 <-- official solution
# 2 3 4 5 6 7 8 <-- my solution

We don’t know what any of the bits 1–8 represents, but we know that # is always a zero. So my solution “works” when bit number 1 is zero. And it just so happens that ASCII, the standard way of encoding simple characters in binary, only uses the 7 bits on the right. The left bit, number 1, is always 0 for simple text encoding.

If the message had been encoded in some extended ASCII format that uses all 8 bits my method would have failed. For the general case, bit 1 is only zero half the time!

No responses yet

Aug 02 2010

Part 2 of Cyber Security Challenge: Hexadecimal gibberish

This second part of the problem was more difficult, though the solution is much shorter to present. The input message looks like this, in part:

68edcdec4e2c8eae8d2c8e2dedcd6e04d2042fedae52ceac
04ccedaecd8c042ccd8c046cedad0e8dac8eac8c048e0dac

Under the assumption that each pair of characters represents a byte, we split the stream into pairs.

splitStream = splitEvery 2

Of course each pair isn’t actually a hexadecimal number, though it looks like one. It’s a string of characters, so we need to turn it into something the computer will recognise as a number.

Now, there’s a robust, I’m-a-serious-engineer way of doing this and then there’s the way I chose, which is to prefix all the strings with “0x” and use the read routine. The Read typeclass is not meant to represent a robust parsing mechanism but since this is a one-shot thing I think we’ll both just ignore it, yeah?

hexToInt :: String -> Int
hexToInt = read . ("0x" ++) -- ouch!

Having read in our number I tried to do what I did in the previous exercise and print it out, which produced complete nonsense. Then I decided that reversing the string before converting it to numbers would be helpful. I did that, but the result was still nonsense, just of a different shade.

Then I decided that my resulting numbers were a bit large. The number which represent capital A is 65, and I was getting numbers in the 200 range. Then I realised, “all these numbers are even, aren’t they?” So I divided the whole lot by two.

Outputting that message gave me a backwards English message. Aha! It’s close but there’s something wrong. I threw in another “reverse” statement and got an answer. At this point I noticed I had two reverse statements — one at the beginning and one at the end. This seemed like it would be the cause of my problems so I removed them both and, Robert’s your father’s brother, I was back to gibberish again.

What happened? What I had failed to realise was that when I had reverse my original message I had done so before splitting into pairs of characters. So ABCD becomes DCBA. After splitting this is [DC,BA]. If each pair is converted into a printable character, represented by f(XY), then the result is [f(DC),f(BA)]. I then reversed the message to get [f(BA),f(DC)].

If I didn’t do any reversing but went through the same splitting and conversion process, I got [f(AB),f(CD)]. Look — f(AB) is not the same as f(BA)! By reversing the list before splitting and again after conversion, I was implicitly reversing the order of the characters in each pair.

Obviously I had to reverse only the pairs after they’d been split, rather than the whole list at the beginning and end. If I do this then print the result I get an answer which is close to right.

byteToChar = chr . flip div 2 . hexToInt . reverse
message = map byteToChar . splitStream
 
loadfile = (head . lines) `fmap` readFile "hexstring.txt"
main = loadfile >>= putStrLn . message

The resulting message is:

Congratulations youve found and completed the REAL challenge. Your win code is cyb3r=s3cur1ty*ch@ll3nge+26-07-2010.

Please email this code to our team at media@cybersecuritychallenge.org.uk. If youre the first person to do so, and can prove you meet the eligibility criteria (British citizen currently resident in the UK) we will be in touch to advise how to claim your prize. Well done and good luck in the Cyber Security Challenge competitions taking place throughout the rest of the year.

There are a few infelicities which are explained by the official solution. My divide-by-two technique is not the proper approach so I think some of the characters get lost in the decryption process. But as the saying goes, close enough for government work!

The solution states:

The challenge was based on a bitshift operation applied to a string, here each byte’s “3 least significant bits” have been added to the left side of the byte (making them the most significant bits respectively)

Assuming we have bits one to eight:

1 2 3 4 5 6 7 8

what I did was divide everything by two, which we do in binary by shifting everything to one side so the least significant bit disappears. We fill the opposite edge with a zero here:

0 1 2 3 4 5 6 7

What the solution demands is to rotate the bits like so:

6 7 8 1 2 3 4 5

I am still not sure how it is that my solution matches so neatly with their own. Answers on a postcard.

If I hadn’t stumbled on a solution like this, the sensible approach might have been to produce a histogram of common characters. In this case, the most commonly used character encoded the space, followed closely by “e”, the most common letter in standard English texts. From there it would have been a bit of a slog to produce the result but it could be done.

The two other solutions I have found both use the official method without problems. I suspect it’s got something to do with default word lengths (8 bit versus 32 bit) and signed integers, though I’ve not thought deeply on the issue.

3 responses so far

Jul 31 2010

Part 1 of Cyber Security Challenge: Decoding the image

The first part of the Cyber Security Challenge requires you to decode a long stream of letters, numbers and symbols into something coherent. This was by far the easiest part of the challenge. The symbols look like this:

/9j/4AAQSkZJRgABAQEAYABgAAD/4QBaRXhpZgAATU0AKgAAAAgABQMBAAUAAAABAAAASgMDAAEA
AAABAAAAAFEQAAEAAAABAQAAAFERAAQAAAABAAAOxFESAAQAAAABAAAOxAAAAAAAAYagAACxj//b
AEMACAYGBwYFCAcHBwkJCAoMFA0MCwsMGRITDxQdGh8eHRocHCAkLicgIiwjHBwoNyksMDE0NDQf

This is only the first three lines — there’s 362 lines of this. The most important bit is that the last line ends in a = sign. This symbol is used for padding out messages when they are encoded in Base 64 format, so if any string has an equals symbols on the end it’s worth seeing what happens if you decode it as if it were Base64. (The truth of the matter is that by the time my mate Rich had told me about this challenge he’d already decoded the image so I didn’t do any of this stuff!)

Thankfully there are plenty of programs that will do the decoding for you, and the resulting file is a JPEG file. Brilliant:

This is XKCD comic number 538, with a subtle difference. Round the outside of the image there is an uneven dotted border. This isn’t in the original comic, and it looks irregular which suggests that it’s not just a pretty pattern but that it encodes some further information.

For the next stage I converted the image to PNG format because it was the easiest format to load and process.

The first part we need to do is load the image file and extract all the bytes from the actual image part, ignoring any metadata. The PNG file is represents a 24 bit colour image, with each component (red, green and blue) stored as an 8 bit number from 0–255. We just return the whole thing as a long list of bytes, starting at the top left and scanning left to right, top to bottom. It’s not efficient but it’s very simple to reason about because we don’t need to think about array locations.

getimage :: IO [Word8]
getimage = do
  (Right img) <- loadPNGFile "decode.png"
  getElems (imageData img)

The long list may be conceptually simple but it’s not representative of the structure of the image. Since each pixel is represented by 3 consecutive numbers we want to gather those numbers together, so we can start dealing with them as pixels. Thus we convert [r,g,b,r,g,b,r,g …] into [[r,g,b],[r,g,b],…]

splitIntoBytes = splitEvery 3

Even though each pixel can represent many colours we’re really only dealing with black and white here so we can represent each component as fully-on or fully-off. Each pixel can really be stored as boolean values like [True, False, False]. Since it’s possible that some values are not absolutely 0 or absolutely 255 I’ve divided them up the middle — anything darker than 128 is 0, and anything lighter is white.

Once each pixel is a list of booleans we can collapse that down into a single black/white value by taking the conjunction. If all values are True then the pixel is True (ie, black) otherwise False (white).

normalise = and . map (< 128)

At this stage our image is no longer a list of integers but a list of booleans. The original list had height-times-width-times-3 elements. Since we’ve collapsed all those three elements into a single value representing each pixel we now have height-times-width elements. The image is 350x175 pixels so we can split it into rows by cutting the list every 350 elements. This gives us 175 rows.

tomatrix = splitEvery 350 . map normalise . splitIntoBytes

The next stage is, I think, the most beautiful aspect of representing the image as a list of lists. We have a full image but we only want the wavy lines of pixels which make up the border. How do we extract those pixels?

The first thing to notice is that the border is not continuous. The top and bottom edges go to the edge of the screen at both sides, but the two sides don’t meet at the top or bottom. To illustrate:

#################
 
#               #
#               #
#               #
 
#################

So the first question, “where do we start and end?” is answered for us. Each segment is discrete. It was my colleague Rich which pointed out the slight gaps in the border which makes this simplification possible. If the border were complete we’d have to determine whether the corner pieces were part of both the horizontal and the vertical (like a crossword where Across and Down share letters) or not.

The second question we ask is, “which direction does the sequence go in?”. Do we go clockwise from top left, following this alphabetical sequence?

a b c d e
n       f
m       g
l k j i h

Or do we maybe scan left-to-right, ignoring gaps?

a b c d e
f       g
h       i
j k l m n

What about anticlockwise? A mixture of the above? I took the approach which seemed obvious to me, clockwise from top left, and it turned out to be correct. But interestingly the final message is repeated in part so choosing the wrong start point would not have mattered, and it would have been obvious that the output was almost right.

Each row is a list, stored in order from top to bottom. This means the top border is just the first list:

topedge = head

The bottom border is the last list, but because we’ve chosen clockwise we need to reverse the list too.

bottomedge = reverse . last

The right edge is the last element of each list, ignoring the first 3 rows and the last three rows, because as mentioned above, the side patterns are shorter. These two look slightly complicated but it’s mostly dealing with trimming the top and bottom edges off.

rightedge = reverse . drop 3  . reverse . map last . drop 3
leftedge = drop 3 . reverse . map head . drop 3

Now we can extract each edge of the border we can join them all together into our encoded message. This converts our list of lists, which is a complete picture, into a list of bits encoded in the border.

msgstream :: [[Bool]] -> [Bool]
msgstream bits = concatMap ($ bits) [topedge, rightedge, bottomedge, leftedge]

Now we have a long list of bits. And it just so happens the number of bits we have is divisible by eight! This is a good sign because binary information is typically grouped into sets of eight. For the next stage we group our list into sets of 8 bits and turn each 8 into a single number.

One of the tricky aspects about any number is that you can’t tell in advance which end to start from. You and I know that 12 is “twelve” not “twenty-one” but that’s because we know to read numbers from left to right. Computer formats have used both in the past so I wasn’t sure which one would be important to me — is the biggest number the first digit or the last digit? The actual question is, “is the leading digit the most significant bit or the least significant bit?”. I calculated both to see what would happen, and it turned out that most-significant-bit was the way to go. Thankfully, MSB and LSB are just the reverse of each other, so by implementing one we get the other for free!

msb,lsb :: [Bool] -> Int
msb = lsb . reverse
lsb = bitsToInt . map (fromIntegral . fromEnum)

We calculate least significant bit by converting all the True values to 1 and all the False values to zero, and then multiplying element-wise by a stream of powers of two. To illustrate:

  1 2 4 8 16
* 0 1 1 0 1
= 0 2 4 0 16

Then we just add that list up, so “0 1 1 0 1” is “0+2+4+0+16” or 22.

bitsToInt = sum . zipWith (*) (iterate (2*) 1)

After all that prelude we can put this segment together, dividing the image up into its matrix, extracting the bits from the border of the image, splicing them up into sets of eight and converting each 8 bits into a single number:

msgbits = map msb . splitEvery 8 . msgstream . tomatrix

Now what do we do with our numbers? Readable characters are represented internally as numbers, so it is a simple thing to convert between number and printable characters. If we convert each number to a character we get nonsense, but consistent and tantalising nonsense:

Cyrnfr sbyybj guvf yvax: uggcf://plorefrphevglpunyyratr.bet.hx/834wgc.ugzy uggcf://plorefrphevglpunyyratr.bet.hx/834wgc.ugzy

Look at those sequences “uggcf://” repeated twice. That looks so much like a web address, “https://”. If we assume that it is a web address how has it been altered? Each letter in a pair, h/u, t/g, c/p, is 13 characters apart from its partner. But the punctuation symbols aren’t any different.

This looks like the encoding called “rot13” where each character is shifted 13 characters along in the alphabet, and if we reach the end we wrap back to the start. Also, since the first letter of the nonsense message is a capital and none of the rest are it seems like case is being preserved by this encoding.

We convert back to sensible words by rot13 encoding again, since 13+13=26 so any two applications of this transformation will undo each other.

Uppercase and lowercase are treated separately but by the same process. We’re processing some character which we call c and we want to know how many characters it is away from the start of the alphabet. The letter ‘a’ (or ‘A’) is 0 characters away from the start, ‘b’ is 1 character and so on.

If we assume an infinite stream of letters “a b c … x y z a b c…” repeating the alphabet, then the Nth letter in that stream is the one which is N away from the start. The 0th letter is ‘a’, the first letter ‘b’. But if we chop the first 13 characters off this stream, so it’s “n o p … y z a b c …” then the 0th letter is ‘n’, the 1st letter ‘o’ and so on. Each offset now directly maps one letter onto its partner letter. And because the sequences of letters is endless we don’t have to worry about falling off the end. The mapping just loops back on itself:

0 1 2 3 4 5 6 ...
a b c d e f g ...
n o p q r s t ...

Thus we can find the alternate character in our pair by checking first of all how many characters our current letter is from the start, and then looking for the equivalent character in the list which has had its head chopped off. We do the same for the upper case characters and just pass through unchanged anything which isn’t upper or lower case — all the punctuation and spaces.

rot13 :: Char -> Char
rot13 c | isLower c = lowercase!!(ord c - ord 'a')
        | isUpper c = uppercase!!(ord c - ord 'A')
        | otherwise = c
  where lowercase = drop 13 $ cycle ['a'..'z']
        uppercase = map toUpper lowercase

By now I think you’re dying to know what the message says, so let’s finish up here. We decipher a message by converting each integer to a character and performing a rot13 transformation — then we print it.

decipher = putStrLn . map (rot13 . chr)
main = getimage >>= decipher . msgbits

The whole thing is pulled together so we can run it from the command line and we receive the sensible output of:

Please follow this link:
https://cybersecuritychallenge.org.uk/834jtp.html https://cybersecuritychallenge.org.uk/834jtp.html

Next time I’ll look at what happens when we follow that link, and how we complete the next phase of the challenge.

If any of this doesn’t make sense or is confusing please ask questions in the comments. If you want to look at the code in full you can read the solutions for part 1 and part 2 online at http://www.dougalstanton.net/code/cybersecurity/.

As I mentioned in my introductory post I was incredibly impressed by the easy exploratory power of the Haskell code I wrote. Writing little segments to splice, decode, convert and transform made it simple to try out different ways of getting sensible output. I had no idea when I started which way the answer would take me, but putting them together in different combinations in the interpreter was easy and provided instant feedback. (That being said, Rich kinda floored me with his ability to wrangle Excel of all things into solving this problem, though that doesn’t mean it’s the right tool for the job!) Tune in next time for more exciting cryptographic games!

One response so far

Jul 30 2010

Cyber security teaser challenge

Published by Dougal under Computing, Programming

During the week a friend from work pointed me towards the Cyber Security Challenge, which is being touted as a popular method to promote the field of computer security in the UK.

They had released a teaser challenge as a sample of the main competition which seemed like it was worth trying. It took us a couple of days to get to the answer, not least of all because we actually had work to do!

Since the teaser challenge is now well over and the “official” method has been published on the challenge page (move your mouse over the big white space to see the solution) I thought it would be interesting to go through my solution and see (a) what the puzzle-setters had done (b) how the answer was stumbled upon (c) how the answer could have been derived and (d) where my method differed from the official approach, despite reaching basically the same answer. It’s been a while since I thought about the practical techniques of cryptanalysis so my methods are neither the best nor the most sensible, but at least I can be honest. Helen’s just finished reading Cryptonomicon so will probably tell where I went wrong!

If you want to try out your skills the original problem is still available and will probably be there for some time. At least you can see what the starting point was.

I’ll write up my results and methods in the next few days, posting each section as it goes. I’ll also publish the code I used to extract the solution, which was written in Haskell. I’ve never done anything like this before but the Haskell list-munging functions and point-free style made it simple. The effortless compositional style of creating pipelines of functions meant that I could play with little sections, joining and splicing them as necessary, until I found something which seemed like a good result. But there will be more details in the coming days.

No responses yet

Jul 24 2010

Networking and redundancy (double meanings)

Published by Dougal under Humour, Networking, Work

For the past three and a half years I have been working on router redundancy protocols. When your router (or its upstream connection) dies for some reason you want to minimise the loss on people using the network. Ideally users should never notice loss of connection, though in the real world there will be some time delay before things are working again. The work I’ve been doing relies on having a second router which has its own connection to the local network and to the wider world. It acts as a redundant backup so that when the first one dies the second can step into its place within some short period.

When the primary router is working normally the secondary doesn’t do much. Its only role is to monitor the liveliness of the primary machine. The redundant router can often be used for other things when the primary is operating — and many times the primary acts as a redundant router for the secondary’s clients. Each provides backup for the other.

So when I found out recently that I was being made redundant I thought “great! I’ll just sit and watch other people working and take over if they burst into flames”. But it turns out that when people are redundant it’s totally different from when routers are redundant. Instead of being relied on for backup in case of failure, it means “no longer working”. Strange but true! I can see why it wouldn’t catch on very well in networking.

My last day at Cisco is this Friday (30 July). It’s been an interesting few years and provided novel experiences, silly conversations about Star Trek and given me a bit more confidence. I’m sad to be going, and though there will always be loose ends to tie up and the promise of interesting projects on the horizon, the team I’m leaving behind seems to have a glut of these at the moment. I’m also disappointed that the study group at work will continue reading SICP without me. Obviously I can read it alone but the discussion and peer support/pressure was a useful part of it.

Meanwhile, the job hunt continues. Recruitment agencies make this process at least ten times harder by hiding the employer, the industry and the specifics of the job for their own ends. I have had a few friends pass on job details, and had some telephone discussions, but no success yet. Watch this space, or one very much like it.

One response so far

Jul 23 2010

What does your IP address mean?

Published by Dougal under Networking

The internet protocol (IP) address of my laptop is 192.168.0.3/24. What does this mean?

The first and obvious question is, What’s an IP address? It’s an identifier that your computer uses to talk to other computers on the network. It bears a lot of resemblance in form and function to a telephone number. There are prefixes which are shared by every address in the area, and then there’s a bit specific to you.

ethernet cable

An IP address has two important pieces of information embedded in it. The first is the host ID — the identifier of my specific computer. The second is the network address — which is the number of the network my computer is found in, and is analogous to a telephone area code.

Just to make things difficult though, the two numbers are joined together so you can’t tell where one part ends and the other begins. So when talking about addresses we need another piece of information to tell us which part is network and which part is host.

First, let’s write out 192.168.0.3 in binary. This makes a very long number but it will make everything much clearer from here on. Each number between the dots is converted separately. This number is not the same as 19,216,803.

192.168.0.3 = 11000000 10101000 00000000 00000011

I’ve left spaces where the dots were previously. Each section is 8 binary digits long, so each section can represent a maximum number of 11111111 — which is 255 in decimal.

The next bit we come to is choosing a point on that line so that all the digits on the left represent the network address and all the ones on the right show the host address. Looking back up at the address I gave at the top you’ll see a “/24” sitting at the end. This is called the network mask and it works just like a piece of card with a hole in it. You write out your address and then align 24 bits underneath: everything with a 0 underneath is masked out, leaving the network address. We are getting the result of 1 whenever both the address and the mask is 1.

192.168.0.3  = 11000000 10101000 00000000 00000011
mask 24 bits = 11111111 11111111 11111111 00000000
result       = 11000000 10101000 00000000 00000000

Back in decimal land, that network address is 192.168.0.0. Since the network mask is also a binary number it is often written like an IP address, as four decimal numbers separated by dots. The same address can be written as 192.168.0.3/24 or 192.168.0.3/255.255.255.0.

We can invert the network mask to give a host mask and use the same procedure to find out the host ID, which turns out to be 00000011, otherwise known as 3.

192.168.0.3 = 11000000 10101000 00000000 00000011
mask 8 bits = 00000000 00000000 00000000 11111111
result      = 00000000 00000000 00000000 00000011

You probably knew that from inspection but computers ain’t so clever!

Another useful number we can learn from the address and mask is the broadcast address. This can be used to send messages to everyone on the network. We calculate this by inverting the mask again, so we’ve got 8 bits on the right instead of 24 bits on the left, and taking the result to be 1 wherever the address or the mask is 1.

192.168.0.3 = 11000000 10101000 00000000 00000011
mask 8 bits = 00000000 00000000 00000000 11111111
result      = 11000000 10101000 00000000 11111111

The result can then be written as 192.168.0.255 in dotted-decimal format. Packets sent to this address will be examined by every host on the local network. This is useful if you don’t know the address of your recipient!

Photo is The World’s Network by saschaaa.

No responses yet

Jun 22 2010

A simple lab book

Published by Dougal under Computing

At work I keep all my different tasks in separate screen sessions. Each session has a friendly name which makes sense to me, but also has a variable which captures the bug ID. There are various command line tools I interact with that require the ID so it’s easier if I set it once and don’t need to copy-paste it in each time.

After seeing this post on keeping a “lab book” I realised I could make extra use of this environment variable that I already used.

I save this file in ~/bin/note and I can write quick scribbles to myself in whichever window I’m using and it will go to the correct “book”:

#!/bin/bash
 
LOGDIR=$HOME/labbook
 
if [ "" == "$BUGID" ]
then LOGFILE=$LOGDIR/GENERAL.log
else LOGFILE=$LOGDIR/$BUGID.log
fi
 
if [ "" == "$*" ]
then
    echo "No log written: message empty"
else
    DATE=$( date +%Y%m%d-%H:%M )
    mkdir -p $LOGDIR
    echo $DATE $* >> $LOGFILE
fi

It’s not complex but I think it will do wonders for how much I remember to document about my work from day to day.


Update: I learned today the delight of this one-liner which prints everything I’ve done today in order. It’s quite pleasing.

$ grep -h `date +%Y%m%d` labbook/* | sort

No responses yet

May 10 2010

Reading at work

Published by Dougal under Computing, Work

At work there’s a tentative conversation about a book group, to read and discuss (and presumably work through) books like SICP. I’m tempted to join in but I wonder if I’d actually get anything done. I haven’t been to the Science and Society Reading Group for several sessions because I just haven’t had time to read the books in question.

(This is why our other book group is better because reading books is optional and finishing books is not necessary.)

But the combination of SICP and discussion with some of the engaged programmers at work might be really useful. I will put my name down and see what happens. The starting book suggested was the aforementioned SICP but after that they’re going to look at something Agile/XP related which is generally an area I know nothing about except by hyped reputation. I’d also be interested in reading research papers so I might suggest that too.

No responses yet

Jan 05 2010

Filesystems and data recovery (an explanation of sorts)

Published by Dougal under Computing

I posted a short guide yesterday to recovering photographs from corrupted memory cards. There were a few interesting technical points in that post that I glossed over for the sake of the explanation, but discussion on Facebook has prompted me to try filling in those gaps. Let’s have a look.

File systems

How is stuff written to and read from a disk, and why would it suddenly stop working?

Reading and writing disks

Picture a disk as a big sheet of grid paper. Each little square in the grid can hold a single letter, and that little square has a location, like the grid reference. We can read and write a single letter at a time by knowing its grid reference. Most things which are written on the paper will take up many little squares, because most useful information is more than one letter long — words, sentences, paragraphs and so on are stored as long sequences of individual letters. This doesn’t really pose a problem because our sheet of paper is easily big enough to hold all the sentences we could ever write. It is also possible to store many different sequences of text on one sheet of paper, so that books sit alongside essays alongside love letters alongside formal complaints. The only problem this does leave us with is organising and retrieving the information, since a gridded sheet where each box contains a single letter will start to look more like a wordsearch after a while.

If we have been sensible we will write our texts onto the great sheet of paper in a consistent manner, such as left-to-right, and starting a new row one line down whenever we fill up the previous row. The system of organisation we use is arbitrary — we could just as easily use right-to-left, write vertically or move up a row instead of down at the end of each line. It doesn’t really matter as long as we are consistent. (This should be obvious from the history of written language!) We can then identify the beginning of a piece of text by knowing the grid reference of the first letter; and we can retrieve the whole text if we know how long the text is, since it was written in a consistent direction. If we do this we should easily be able to organise all the information we have by keeping track of the starting location and length of all the sequences of letters we write.

The disk, like the gridded paper, is full of little slots where data can be stored; and each slot has a location called its “address”. Data can be stored on disk in long sequences of adjacent slots. To get the sense of the original data we only need to know the starting address and to read the correct number of entries from that point onwards. All we need to extract a file from disk is two numbers — the address and the length. From this we can reconstruct the original data.

Now we have many long essays and books and letters stored on the hypothetical sheet of paper, and these long sequences of text can be reached via the much shorter pairs of numbers we mentioned, representing address and length. But where are these numbers kept? The one location that doesn’t require any intelligence to find on a grid of paper is the start, which we shall call address 0 (zero). How can we use this to our advantage? We could write a list, starting at address 0, which contains the pairs of numbers needed to find all the other sequences of letters! It will never be possible to “lose” the sequence at the top corner of the sheet of paper so we will always be able to access this contents listing for the rest of the page.

But think again: if the sheet of paper is covered in many sequences of letters, it’s very conceivable that one of those sequences will contain map references or temperatures or other lists of numbers that could be mistaken for the pairs of numbers which we use to find the texts themselves. If a sequence of pairs of numbers starts at address 0 how do we know when it has ended and a different sequence of numbers has started? We know the numbered pairs start at 0 but we don’t know how long they continue after that. The solution is to enter the length of the first sequence at address zero itself. It then becomes possible to start at the top of the page and read off a list of starting locations for every piece of information on that page, without fear of accidentally reading data which isn’t relevant.

The organisation of real disks and memory cards is done along these lines, and the system of addresses and references which comprise the implementations are called a file system.

(Bonus question: Imagine that one of the sequences of text buried somewhere in our piece of paper was a list of location/length pairs just like the one at address zero. The location/length of this secondary list would be mentioned by the primary one, but the pairs of numbers it contain would not be mentioned. What use would this be? Can you think of a common instance of this sublist in computer file systems?)

Disk corruption and file recovery

In order to reuse a disk people often “format” it, making a clean slate. The formatting process generally doesn’t affect the data on disk, only the contents listings which help to make sense of it. We can format our grid of paper by altering the contents listing which we store as the first sequence. It’s not difficult — all we have to do is change the number at address zero, the one that represents the length of the contents listing, to a zero. From now on, reading that sheet of paper will immediately show that this sheet contains no data!

Quick and reliable access to the text written on the vast hypothetical sheet of paper hinges on the reliability of the contents listing at the top of the page. It allows us to make sense of the lines and lines of individual letters that follow. If this gets corrupted or erased in any way, our sheet of paper is turned from a useful store into a tedious wordsearch. And as you can see with the formatting example, it takes only a tiny change to render the file listing “wrong”.

So, if through chance or misadventure your disk becomes unreadable in this way, you will not be able to read the data straight off. But it’s all still there, somewhere. The process of file recovery is one of trawling through the disk looking for interesting sequences of text, like a word search. Thankfully many files such as images or videos are organised like a file system in miniature — they contain useful header and file length info which aids identification. For example, JPEG images, which is how digital cameras commonly store their images, start all their files with a particular number. File recovery programs can trawl through the disk looking for known sequences such as this and attempting to pull useful data from the mess. In many cases it works very well.

6 responses so far

Next »