Example Indexing of a CSV file

The Data

The CSV file I wish to index contains the data from the UK Land Registry.

https://www.gov.uk/guidance/about-the-price-paid-data

This records the value of every house sale in England and Wales. I have gathered the data from 2000 onwards and extracted the following dataset. The postcode, year and price of every transaction.

What I want is to be able to extract the list of year / price entries for a particular postcode.

postcode,year,price       
AL10 0AB,2000,63000       
AL10 0AB,2003,126500      
AL10 0AB,2003,167000      
AL10 0AB,2003,177000      
AL10 0AB,2004,125000      
AL10 0AB,2013,220000      
AL10 0AB,2014,180000      
⋮
YO8 9YB,2021,269950
YO8 9YD,2011,230000
YO8 9YD,2012,249999
YO8 9YD,2018,327500
YO8 9YE,2009,320000
YO8 9YE,2019,380000
YO8 9YE,2020,371500
YO90 1UU,2017,15500000
YO90 1WR,2015,28100000
YO91 1RT,2017,150000

13,193,754 lines, including the header. The file size is 267,473,752 bytes.

In the Index I can store the Named Tuple (data:UInt64, aux:UInt64) for each UInt64 Key.

So what I shall do is to store the file offset of the first postcode entry in the data and the number of rows in the aux.

It just so happens that the postcode is, at most, 7 characters long, so this can be converted into a UInt64 with a byte spare for the tag.

By turning the postcodes into a fixed format from

YO8 9YE
YO90 1UU

to

YO 8 9YE
YO90 1UU

we can also make the process reversible.

Here are the functions to do this

function postcode_to_UInt64(pc) 
    m = match(r"([A-Z]+)([0-9]+([A-Z]+)?) ?([0-9]+)([A-Z]+)", replace(pc, " "=>""))
    if m == nothing || m[1] === nothing || m[2] === nothing || m[4] === nothing || m[5] === nothing
        return 0
    end
    reduce((a,c) -> UInt64(a) << 8 + UInt8(c), collect(lpad(m[1], 2) * lpad(m[2], 2) * m[4] * m[5]), init=0)
end

function UInt64_to_postcode(u)
    cs = Char[]
    while u > 0
        push!(cs, Char(u & 0xff))
        u >>= 8
    end
    part(cs) = replace(String(reverse(cs)), " "=>"")
    "$(part(cs[4:end])) $(part(cs[1:3]))"
end

Creating the Keyset

So, now iterate over the data and create the Keyset

function count_lines(io, pcode)
    lines = 0
    pos = position(io)
    while (line = readline(io)) !== nothing
        lines += 1
        newcode = split(line, ",", limit=2)[1]
        if newcode != pcode
            return newcode, lines, pos
        end
        pos = position(io)
    end
    return "", lines, pos
end

create_kvs(fname) = open(create_kvs, fname)

function create_kvs(io::IO)
    kvs = Dict{UInt64, DataAux}()
    readline(io)
    pos = position(io)
    pcode, lines, nextpos = count_lines(io, "postcode")
    while pcode != ""
        newcode, lines, nextpos = count_lines(io, pcode)
        kvs[postcode_to_UInt64(pcode)] = (data=pos, aux=lines)
        pcode = newcode
        pos = nextpos
    end
    kvs
end

It's not really necessary to understand how all that works (in fact my first version was wrong!), just know that when we run it

julia> create_kvs("pc_year_price.csv")

we get something like the following Dictionary

Dict{UInt64, NamedTuple{(:data, :aux), Tuple{UInt64, UInt64}}} with 1424321 entries:
  0x00594f2038395942 => (data = 0x00000000000000e6, aux = 0x0000000000000007)
  0x00594f2038395945 => (data = 0x0000000000000136, aux = 0x0000000000000009)
  ⋮
  0x00414c3130304142 => (data = 0x0000000000000536, aux = 0x0000000000000027)
  0x00594f2038395944 => (data = 0x0000000000000bfa, aux = 0x0000000000000013)

The keys are the encoded postcodes e.g.

"AL10 0AB" becomes 0x00414c3130304142 and at offset 0x536 in the CSV files, has 0x27 rows of data

Which is the format of the keyset we need to build the actual index.

That part of the process is in the module

julia> @time build_index_file("Postcode_Year_Price.index", create_kvs("pc_year_price.csv"))
 10.181108 seconds (83.94 M allocations: 6.096 GiB, 10.57% gc time)
42453022

build_index_file returns the number of bytes written

So now in my pwd() is the index file

42453022 Jan 18 13:56 Postcode_Year_Price.index

At 41M it is only a bit smaller than the 256M of pc_year_price.csv but to explore it, we don't need to have it in memory.

If we open the index we get

julia> idx = open_index("Postcode_Year_Price.index")
 NI key:0x1053522036384146 value: (LR left: NI key:0x2053522036384146 value:(Leaf data:0x000000000287c576 aux:0x0000000000000000) right: NI key:0x20594f3931315254 value:(Leaf data:0x000000000287c76e aux:0x0000000000000000))

This is the root node of the first page. This example is rather shallow, but that is an artefact of the number of data Leafs.

We can search this idx

julia> get_leaf(idx, postcode_to_UInt64("YO8 9YB"))
 NI key:0x30594f2038395942 value:(Leaf data:0x000000000ff15226 aux:0x0000000000000007)

How quick is this?

julia> @benchmark get_leaf($idx, $postcode_to_UInt64("YO8 9YB"))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  32.455 μs …  3.829 ms  ┊ GC (min … max): 0.00% … 97.92%
 Time  (median):     35.409 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.733 μs ± 52.923 μs  ┊ GC (mean ± σ):  2.01% ±  1.38%

    ▃█▆▄▅▄▃▆█▆▅▅▃     ▄▅▃                                      
  ▂▅██████████████▆▅▅█████▇▅▄▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄
  32.5 μs         Histogram: frequency by time        46.5 μs <

 Memory estimate: 9.31 KiB, allocs estimate: 247.

Certainly a lot slower (20x) than reading it straight from memory.

But at least it doesn't need the whole kvs in memory

What we don't have, though, is the actual data. So let's write an accessor function for that.

prices_for_postcode(idx, pcode, csvfile) = open(csvfile) do io prices_for_postcode(idx, pcode, io) end
        
function prices_for_postcode(idx, pcode, csvio::IO)
    (offset, lines) = get(idx, postcode_to_UInt64(pcode), (0,0))
    if lines > 0
        seek(csvio, offset)
        return CSV.File(csvio; header=["postcode", "year", "price"], limit=lines)
    end
end

We write two functions, one which takes a filename and one which takes an IO. That way we can perform either a single lookup or multiple lookups without opening / closing the file (or use an IOBuffer instead of a file and go full circle!)

The function finds a Leaf node. Remembering the data is the offset and the aux the number of rows, we can then read the rows as CSV from the file.

julia> prices_for_postcode(idx, "YO8 9YB", csvfile)
7-element CSV.File:
 CSV.Row: (postcode = "YO8 9YB", year = 2000, price = 59500)
 CSV.Row: (postcode = "YO8 9YB", year = 2000, price = 95000)
 CSV.Row: (postcode = "YO8 9YB", year = 2009, price = 230000)
 CSV.Row: (postcode = "YO8 9YB", year = 2014, price = 222000)
 CSV.Row: (postcode = "YO8 9YB", year = 2014, price = 237000)
 CSV.Row: (postcode = "YO8 9YB", year = 2018, price = 142500)
 CSV.Row: (postcode = "YO8 9YB", year = 2021, price = 269950)

We could also create a DataFrame from this data, if we were so inclined

julia> prices_for_postcode(idx, "YO8 9YB", csvfile) |> DataFrame
7×3 DataFrame
 Row │ postcode  year   price  
     │ String7   Int64  Int64  
─────┼─────────────────────────
   1 │ YO8 9YB    2000   59500
   2 │ YO8 9YB    2000   95000
   3 │ YO8 9YB    2009  230000
   4 │ YO8 9YB    2014  222000
   5 │ YO8 9YB    2014  237000
   6 │ YO8 9YB    2018  142500
   7 │ YO8 9YB    2021  269950

Creating this document revealed that node_range is broken, so I won't write that up at the moment.

In the future work, I plan to incorprate SeaweedFS into the Index. So we could use this index along with Distributed. But that's another project!