Friday, January 6, 2012

Porting libs to ClojureCLR: an example

Responses to the 2011 ClojureCLR survey gave high priority to the porting of Clojure contrib libs to ClojureCLR.  One action item resulting from an analysis of the responses was to provide examples of the porting process.  As an example, I decided to port the data.json contrib lib, authored by Stuart Sierra.  In looking across all the authorized contrib libs on github.com/clojure, this port was above trivial but below daunting in complexity.  (In a later post, I will analyze all the contrib libs for complexity.)

The following recipe will work for simple ports.

Figure out how the original code works.  This step might be optional on the simplest ports involving just simple interop method renaming.  For this port, I had to modify one algorithm and think through extending a protocol to CLR types.

Create a project for the port.  I did not fork the original data.json project on github; the project needs its own identity and you're not going to be issuing pull requests back to the original.

You are on your own for picking project names and namespace structures for these ports.  I chose the name cljclr.data.json for both the project and the namespace.  You will find my project here.

I would appreciate input on naming these projects.  Here's my current thoughts.  I didn't want to call  this project clr.data.json--I plan to use clr.X.X for ports of contrib libs that have names like java.X.X.  The namespace for data.json is actually clojure.data.json, so I decided to use cljclr.data.json.  I thought of data.json.clr and other variations adding clr as a component, but was offended by having on extra layer of subdirectory injected.  I don't actually like cljclr--I can't read it and I always mistype it.  Help!
In the absence of tools such as leiningen or a working Visual Studio extension, for now you will have to come up with your project structure.  I just used

  • <root>
    •  src
    •  test

as the two main branches, with subdirectories corresponding to the namespaces.

Copy files from the source project.  Usually, you can preserve the basic code structure, relocating files appropriately for your namespace structure.  The data.json project really has only two files of significance:  json.clj and json_test.clj.  I copied them into src/cljclr/data/json.clj and test/cljclr/data/json_test.clj.

Modify file headers. I changed the ns directive in each file to match my namespace structure. If you are interested in things like copyrights, you will have figure out how you want to acknowledge the original author according the license on the original project.

Scan for trouble.  I look for trouble.  Here are some of the things I look for.

  • Imports in the ns directive. Imports of clojure.lang classes are usually okay--I've been carerful to maintain class names and public method names to match Clojure as best I can.  Anything starting  java. will have to be replaced.
  • Interop calls. Scan for (.name,  (name.,  and (CapName/name.  If you are lucky, a simple capitalization will handle many of the (.name calls.
  • Type hints.  Most type hints of classes outside clojure.lang will need to be changed.  Carats feel like sticks.
  • Exceptions.  The only exception class that works unchanged is Exception itself.
  • CapitalLetters. Given that Clojure code is mostly lowercase, anything with CapitalLetters is likely to be trouble.
Do simple renamings.  Hack out the bad imports and add new ones as you work through the code.  Work through each interop call and determine equivalents; ditto for type hints and exceptions.

I adopt a uniform method of marking the places where I make changes.  This makes it easier to update the port as the original lib changes.  You will see lines like this in my code:

 (Char/IsWhiteSpace c) (recur (.Read stream) result)             ;DM: Character/isWhitespace .read   

where the comment shows what was changed from in the original code.

For json.clj, I made the following simple changes.

Method renamings:

FromToCount
(.read(.Read
30
(.append(.Append
14
(.print(.Write
11
(.unread(.Unread
3
(Character/isWhitespace(Char/IsWhiteSpace
3
(.isArray(.IsArray
2
(.charAt(.get_Chars
1
(.toString(.ToString
1

Type renamings:

FromToCount
PushbackReaderPushbackTextReader
9
PrintWriterTextWriter
5
PrintWriterStreamWriter
1
CharSequenceString
1
EOFExceptionEndOfStreamException
5

PrintWriter was mostly used as a type hint, and could have TextWriter substituted.  In one place, it was used new'd.  TextWriter is abstract and can't be constructed, so StreamWriter had to be used in that place.

This project is obviously quite I/O-centric.  Fortunately, there is sufficient commonality between the I/O libraries on the JVM and CLR to make this part of the translation fairly routine.

At this point, you are probably very close to being able to load the file. You can try loading or compiling and let the errors guide you.

Check for reflection warnings.  I find it useful to  (set! *warn-on-reflection* true) at the start of each code file.  I don't do this for performance, but it can catch bad type hints and misnamed methods in the code.   The perf I care about is coding perf, not runtime.

Do the hard parts. The remaining changes were not as easy.  There were two other type renamings, not I/O-related.
java.util.Map => System.Collections.IDictionary
java.util.Collection => System.Collections.ICollection
If you see Java collection types, you are going to have think about the correct alternatives.  In general, we have no way to deal with CLR generic collection types -- most uses of these type names will require the type parameters to be instantiated, so you can't say you want all System.Collections.GenericICollection<T> to be handled. However, most of the generic collection types will provide non-generic interfaces to work with.  That is the solution I chose for this code:

(defn- pprint-json-dispatch [x escape-unicode]
 (cond (nil? x) (print "null")
       (instance? System.Collections.IDictionary x) 
                        (pprint-json-object x escape-unicode)        ;DM: java.util.Map
       (instance? System.Collections.ICollection x) 
                        (pprint-json-array x escape-unicode)         ;DM: java.util.Collection
       (instance? clojure.lang.ISeq x) 
                        (pprint-json-array x escape-unicode)
       :else (pprint-json-generic x escape-unicode)))

Another common trouble spot is dealing with the Java primitive wrapper classes, such as Integer.  You mostly  like won't see many type hints, but you will see calls to static methods.  There is only one in this code.  In read-json-hex-character, I translated (Integer/parseInt s 16) to  (Int32/Parse s NumberStyles/HexNumber).  (I added an import for System.Globalization.NumberStyles.)  I don't have a list of rules for this.  You're on your own.

Extending protocols to Java library classes will also cause you to spend some time thinking.  The primary difficulty here came in extensions to the Write-JSON protocol.  Extensions to things like nil and clojure.lang.Named were unchanged.  Other extensions had to be modifed.

You will have a problem anytime you run into java.lang.Number.  This is a base class for all the primtive numeric wrapper classes such as Integer.  There is no equivalent in the CLR. I created extensions for each primitive numeric type.

(extend System.Byte Write-JSON {:write-json write-json-plain})
(extend System.SByte Write-JSON {:write-json write-json-plain})
(extend System.Int16 Write-JSON {:write-json write-json-plain})
...

Also, java.math.BigInteger and java.math.BigDecimal will need to be translated to clojure.lang.BigInteger and clojure.lang.BigDecimal, and CharSequence usualy goes to String.

The only significant algorithmic change was in write-json-string.  Stuart was careful to properly handle full Unicode, which means not just iterating through the string character by character but dealing with actual Unicode code points as expressed in the UTF-16 encoding used in Java strings.  CLR strings use the same encoding, but the proper way to iterate through Unicode code points is different.  This took the most time for me to translate due to my own ignorance.  I ended up with

(defn- write-json-string [^String s ^PrintWriter out escape-unicode?]
  (let [sb (StringBuilder. ^Integer(count s))]
    (.append sb \")
    (dotimes [i (count s)]
      (let [cp (Character/codePointAt s i)]
        (cond
         ;; Handle printable JSON escapes before ASCII
         (= cp 34) (.append sb "\\\"")
...
         (< 31 cp 127) (.append sb (.charAt s i))
...
         :else (if escape-unicode?
   ;; Hexadecimal-escaped
   (.append sb (format "\\u%04x" cp))
   (.appendCodePoint sb cp)))))
 ...

becoming

(defn- write-json-string [^String s ^TextWriter out escape-unicode?]
  (let [sb (StringBuilder. ^Int32 (count s))
        chars32 (StringInfo/ParseCombiningCharacters s)]
    (.Append sb \")
    (dotimes [i (count chars32)]
      (let [cp (Char/ConvertToUtf32 s (aget chars32 i))]
        (cond
         ;; Handle printable JSON escapes before ASCII
         (= cp 34) (.Append sb "\\\"")
...
         (< 31 cp 127) (.Append sb (.get_Chars s i))
...
         :else (if escape-unicode?
   ;; Hexadecimal-escaped
   (.Append sb (format "\\u%04x" cp))
   (.Append sb  (Char/ConvertFromUtf32 cp))))))
...

Translate your tests.  This is generally easier.  The only changes I had to make were to some setup code for certain tests.  Only a handful of changes were required.

Test. Repair. Repeat.  Good luck.

After just a few iterations, I had all tests passing except one.  In the pretty-print test, I was getting output for some floating point number that were exact integers without a .0 at the end as was required by the test.  Having run into this before, I recalled that I had defined a helper function name fp-str to take care of this.

(defn- write-json-float [x ^TextWriter out escape-unicode?] 
  (.Write out (fp-str x)))                                  

and then modifed the two float-type extends:

(extend System.Double Write-JSON {:write-json write-json-float})
(extend System.Single Write-JSON {:write-json write-json-float})

GREEN!!!!

Celebrate.  
Publish.

I leave these steps to you.


Next steps? Choose something to port and go for it.  


Many of the contrib libs will require significantly less work than this.  There are others that will require almost complete rewrites.  Outside of the core contrib libs, many of the requested ports such as leiningen and the web frameworks are going to a lot of work.

Resources

* data.json:  https://github.com/clojure/data.json
* clrclj.data.json -- https://github.com/dmiller/clrclj.data.json

No comments:

Post a Comment