Holding on to memory

qr-code for this page's url

This week I spent far too much of my spare time trying to track down where on earth I was "leaking" memory in my trivial clojure code.

I was trying to load the geoplanet data into MongoDB and all that using Clojure of course in my quest to learn that language.

Parsing was fairly easy - a neat sequence of reading the file line by line, splitting it up, cleaning up the columns, parsing integers, etc. Thanks to lazy sequences and the ->> macro this all looks almost trivial:

(defn load-data-file
  "General (for any geoplanet data file) loading file function"
     ;; load without any mappings
     (load-data-file file-name []))
  ([file-name mappings]
  (binding [duck-streams/*default-encoding* "UTF-8"]
    (->> file-name
         (map (fn [line] (str-utils2/split line #"\t")))
         (map (fn [line] (map maybe-strip line)))
         (parse-data-file file-name)
         (map (fn [dict] (reduce (fn [dict mapper] (mapper dict)) dict mappings)))))))

There are some extra functions wrapping these to read the three types of files (places, adjacencies, and aliases) but that is irrelevant for the discussion here.

Writing data is also trivial thanks to congomongo. We partition the data we read into larger batches to push into the DB and also give some visual feedback as this will take ages (ignore the munge-place which only adds some extra fields I want):

(defn store-places
  [places-list collection-name]
  (doseq [some-places (partition 1000 1000 [] (map munge-place places-list))]
    (congo/mass-insert! collection-name some-places)
    (print "."))))

However, in combination (store-places (load-places-file "7.4.1/uk_places.tsv" false) "uk") this soon blew up with an exhausted heap exception. However the equivalent

(doseq [x (partition 1000 1000 []
                     (map munge-place-to-dict (load-places-file "7.4.1/uk_places.tsv" false)))]
  (congo/mass-insert! "uk" x)
  (print "."))

does work.

I guess the output of load-places-file is a lazy sequence and the store-places function holds onto the head of that sequence ( it is the places-list parameter) thus not allowing it to be garbage collected while we iterate over it. Took me ages to find this out. The first urge to factor your code into nice small bits is not always the best way to go.