## Interview Question

## Interview Answer

2 Answers

1) divide those integers into 5 equal set of numbers, so each set is less than 2MB (very important for later)

2) sort them using qsort or something

3) find the medium of all those 5 sets

4) then you are left with arrays with following

ie:

... 2, 5, 7, (10), 13, 14, 19...

... 4, 5, 9, (12), 19, 20, 33...

... 1, 2, 10, (14), 22, 23, 43... <= medium of mediums

... 10, 11, 12, (17), 19, 20, 35...

... 1, 2, 19, (21), 24, 29, 33...

where everything left and above medium of mediums(14) is smaller than 14, and everythign right and bleow medium of mediums(14) is larger than 14.

then, you merge everything left and above of 14 to a set of number smaller than 14 (samller set), you merge everything right and below to a set of sorted number which larger than 14 (larger set)

then you sort remaining two group of numbers (right below and left above), take numbers smaller than 14 to merge with smaller set, and take numbers larger than 14 to merge with larger set.

## Add Answers or Comments

To comment on this question, Sign In with Facebook or Sign Up

1 million integers = 4MB which is > 2MB RAM.

solution: external sort - divide and conquer

1. read half the list into 2MB ram and sort using quicksort (quicksort uses logn space - however 0.5m integers is less than 2MB (2000kb v 2048kb) so this should be okay).

2. write sorted data to disk

3. repeat for next chunk

4. merging results: we need an output buffer. lets say this is 1MB. then we read 512KB from each of your chunks into input buffers. we then perform a 2-way merge of the data. when an input buffer becomes empty we can pull in the remainder of the chunk.

5. when the output buffer is full we write this to disk.

6. when the process completes we are left with 2x 1MB files sorted in the correct order.

notes:

- when choosing an input buffer size, the smaller we make it the more I/O operations we will need to perform which would significantly slow down our sorting. a smaller output buffer would also do the same. we would rather a smaller output buffer as multi-threading would allow sorting to continue while writing to disk.

- we could use an additional thread to load next chunk from disk when input buffer drops below half size.

- HDD mirror raid would increase sequential read and write speed to disk as well as HDD speed

- compression of input would also reduce I/O

- we choose quicksort over mergesort as mergesort requires O(n) space. quicksort therefore reduces the number of merge operations.

Nov 9, 2010